How to use the cluster resources

CPU power allocation scheme

Tiger/Wimpy is designed to run single threaded computational tasks in parallel on the nodes connected to the cluster. All processes are started immediately after invocation by the mpirun command, i.e. they don't have to wait in a queue until a free processor is available. Total number of running processes is not strictly limited; as a soft limit can be considered two times more running processes than processors in the cluster (see the utilization page for the current state). Wimpy scheduler allocates CPU power to the processes according to their priority by means of their migration among the nodes of different CPU speed. High priority processes can obtain maximum available CPU time even in the case of large cluster overload.

Assigning an appropriate priority to the processes by the users is crucial for optimal cluster utilization. The priority is defined by value of nice within the range <0,19>. With some exceptions, due to the cluster discretness, process with nice N will obtain approximately X times more CPU time than process with nice 19, where X = 3^((19 - N) / 5). Users should decide what is the importance (requirement to finish as soon as possible) of each job and set the priority accordingly. They should also respect needs of the others. This requires a common way of interpretation of the values of nice. Hence,

  • we define a reference value of nice to be 12. This is also the default value set by mpirun.
  • processes with larger value of nice will be automatically pushed to slower processors in the case of the cluster overload
  • on the other hand, nice < 12 indicates an urgent job, which will be preferably placed on fast processors by the scheduler
  • the priority scaling is rather steep. Therefore, processes with nice = 8 will be exclusively located on the fastest processors under the standard circumstances (cluster load < 150%). Assigning lower value of nice will not speed them up and, therefore, it should only be used to declare very urgent job that will stay on the fastest CPU even in the case of an extreme cluster overload.
Improper assignment or interpretation of nice or users' custom attempt to alter allocation of the cluster resources by hand (e.g. by checkpointing and restarting processes according to the current cluster load) usually leads to waste of the CPU time.

Other general rules (which do not exclude exceptions) are:

  • the longer the process is expected to run, the lower priority (higher nice) it should have. It can be proven that increasing priority of long running process does not shorten considerably real time of its execution, however, it can considerably increase real time of simultaneously running short lived jobs
  • there are not any quotas on CPU time for individual users. Nevertheless, those who know that their claim (proportional to their contribution) exceeds their usage should start their jobs with lower priority
  • it is good when the cluster is slightly overloaded (preferably with low priority processes) -- such a configuration prevents it from running only partially utilized when some process exits.

Memory requirements

Wimpy makes only limited checks of compatibility of the process' memory requirements and the resources available on the individual nodes. Each node has 2GB RAM per (logical) processor, i.e. it is moreorless safe to migrate processes that require up to 500MB of RAM. If the process exceeds this limit, it is automatically sent to a specific node which has more RAM per CPU. Users are asked to force 'local' execution of processes that will allocate more than 500MB via mpirun -N . There is no strict limit for allocated memory of the locally running processes, nevertheless, sum of their memory requirements should not exceed 64GB.

Input/Output activity

All processes run on the diskless nodes and their reads and writes have to be transferred from/to the headnode via network. It's physical limit is several tens of MB/s. A safe load on the network is few MB/s. Hence, the users should consider number of their running jobs with large I/O activity carefully. Current load can be checked on the monitors page. Note, that each transfer has an additional expense in processing the network packets. Large number of small packets can arise when write operations are followed by the flush() or close() call too frequently. Processes that write more than ~1kB/s should let the system to optimise the write operations, i.e. they should not use flush() explicitely. If large disk I/O is unavoidable, start the process (only limited number of them) directly on the headnode Sirrah.
Last updated: 20.12.2014 (L.)