CPU power allocation scheme
Tiger/Wimpy is designed to run single threaded computational tasks in parallel
on the nodes connected to the cluster. All processes are started immediately after
invocation by the mpirun command, i.e.
they don't have to wait in a queue until
a free processor is available. Total number of running processes is not strictly
limited; as a soft limit can be considered two times more running processes than
processors in the cluster (see the
utilization page
for the current state). Wimpy scheduler allocates CPU power to the processes
according to their priority by means of their migration among the nodes of different
CPU speed. High priority processes can obtain maximum available CPU time even in
the case of large cluster overload.
Assigning an appropriate priority to the processes by the users is crucial for
optimal cluster utilization. The priority is defined by value of nice within
the range <0,19>. With some exceptions, due to the cluster discretness, process
with nice N will obtain approximately X times more CPU time than process
with nice 19, where X = 3^((19 - N) / 5). Users
should decide what is the importance (requirement to finish as soon as possible)
of each job and set the priority accordingly. They should also respect needs of the
others. This requires a common way of interpretation of the values of nice. Hence,
- we define a reference value of nice to be 12. This is also the default
value set by mpirun.
- processes with larger value
of nice will be automatically pushed to slower processors in the case of the
cluster overload
- on the other hand, nice < 12 indicates an urgent job, which will be
preferably placed on fast processors by the scheduler
- the priority scaling is rather steep. Therefore, processes with nice = 8
will be exclusively located on the fastest processors under the standard circumstances
(cluster load < 150%). Assigning lower value of nice will not speed them up and,
therefore, it should only be used to declare very urgent job that will stay on
the fastest CPU even in the case of an extreme cluster overload.
Improper assignment or interpretation of nice or users' custom attempt to alter
allocation of the cluster resources by hand (e.g. by checkpointing and restarting
processes according to the current cluster load) usually leads to waste of the
CPU time.
Other general rules (which do not exclude exceptions) are:
- the longer the process is expected to run, the lower priority (higher nice)
it should have.
It can be proven that increasing priority of long running process does not shorten
considerably real time of its execution, however, it can considerably increase real
time of simultaneously running short lived jobs
- there are not any quotas on CPU time for individual users. Nevertheless, those
who know that their claim (proportional to their contribution) exceeds their
usage
should start their jobs with lower priority
- it is good when the cluster is slightly overloaded (preferably with low priority
processes) -- such a configuration prevents it from running only partially
utilized when some process exits.
Memory requirements
Wimpy makes only limited checks of compatibility of the process' memory
requirements and the resources available on the individual nodes. Each node has
2GB RAM per (logical) processor, i.e. it is moreorless safe to migrate processes
that require up to 500MB of RAM. If the process exceeds this limit, it is
automatically sent to a specific node which has more RAM per CPU. Users are asked
to force 'local' execution of processes that will allocate more than 500MB via
mpirun -N . There is no strict limit for allocated
memory of the locally running processes, nevertheless, sum of their memory
requirements should not exceed 64GB.
Input/Output activity
All processes run on the diskless nodes and their reads and writes
have to be transferred from/to the headnode via network. It's physical limit is
several tens of MB/s. A safe load on the network is few MB/s. Hence, the users
should consider number of their running jobs with large I/O activity carefully.
Current load can be checked on the
monitors page.
Note, that each transfer has an additional expense in processing the network
packets. Large number of small packets can arise when write operations are
followed by the flush() or close() call too frequently.
Processes that write more than ~1kB/s should let the system to optimise the write
operations, i.e. they should not use flush() explicitely. If large disk
I/O is unavoidable, start the process (only limited number of them) directly
on the headnode Sirrah.
|