If you are not familiar with linux you may find the default editor vi intimidating. We have installed two other text editors. Nano is similar to Windows notepad and gedit almost identical but it does require you to run an X client on your desktop.
The ICTS High Performance Cluster uses SLURM to schedule jobs. There is one head node that researchers connect to in order to submit jobs. The /home and /scratch partitions on the head node are mounted on all worker nodes, regardless of series. Resources are assigned to partitions which can be thought of as queues.
Partition | Description | Nodes | Cores / node | Max cores / user | Time limit | Priority |
---|---|---|---|---|---|---|
ada | Standard partition | 100-126 | 40 | 120 | 170 hours | 20 |
swan | Large core partition | 119-122 | 40 | 160 | 24 hours | 30 |
curie | Long term partition | 600-609 | 64 | 64 | 750 hours | 20 |
gpuo | GPU partition | 001-004 | 16/12/12/20 | 32 | 150 hours | 20 |
gpumk | Private | 005-008 | 32 | Private | Private | Private |
gpumka | Private | 005-008 | 32 | Private | Private | Private |
a100 | GPU partition | 009-010 | 56 | varies | varies | varies |
grace | High memory | 801-802 | 24 | 24 | 72 hours | 20 |
sadacc-short | Private | 127-134 | 44 | 176 | 1 hour | 20 |
sadacc-long | Private | 127-134 | 44 | 176 | 24 hours | 30 |
Researchers are assigned to an account which is analogous to a group, normally their department or research group, for instance maths, compsci etc. A researcher may also be assigned to additional accounts. Accounts may also be limited to specific partitions, hence a researcher may submit to the ada partition using their maths account, but may only submit to the GPU partition using their mathsgpu account for example.
Time format in SLURM:
Before starting it is important to understand the format of the time parameter to avoid ambiguity and confusion. Acceptable time formats include “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds”. This option applies to job and step allocations. Jobs won’t run unless a wall time is explicitly specified. We force you to enter a wall time for a job rather than rely on a default time because we want you to think carefully about the wall time limit for your jobs. This is your responsibility.
Some examples: 50 = 50 minutes 50:00 = 50 minutes 50:00:00 = 50 hours 2-2 = 50 hours (2 days and 2 hours) 2-2:00 = 50 hours (2 days and 2 hours) 2-2:00:00 = 50 hours (2 days and 2 hours)
Submitting jobs
Basic jobs:
Create a shell script with parameters similar to the one below:
#!/bin/sh #SBATCH --account maths #SBATCH --partition=ada #SBATCH --time=10:00:00 #SBATCH --nodes=1 --ntasks=4 #SBATCH --job-name="MyMathsJob" #SBATCH --mail-user=MyEmail@uct.ac.za #SBATCH --mail-type=ALL /opt/exp_soft/softwareX/xyz -o /home/fred/testA15/myfile.txt
There must be no spaces between the start of the line and #SBATCH. There must be no spaces between the # and SBATCH.
All the #SBATCH directives must be at the top of the file with no other commands mixed in between them.
You can then submit the job by typing: sbatch myscript.sh
The directive “nodes” is the number of worker nodes or servers required and “ntasks” is the total number of cores per job. If you specify nodes > 1 then at least one thread will be assigned to each of the additional servers. You do not need to cd to the directory from which the job is launched. If you wish to run more than one job at a time in the same folder you must ensure that each job’s output is directed to a different file, otherwise data files will conflict or overwrite one another. In the above example the second job’s output should be directed to myfile2.txt.
While the job runs on the worker node standard output and standard error (the screen output you’d see if you ran on a desktop) is written to a .out file. If the screen output of your software fills up the disk your job will fail. It is best to ensure that your job output is directed to a file in /home or /scratch, possibly with a command line argument or the linux redirect > function. In addition it is recommended that you disable all spurious or unnecessary program output to minimize on disk space usage, particularly for long job runs.
Memory control:
Like the CPU cores memory is a limited resource. The –mem-per-cpu directive allows you to specify how much RAM is needed. If your job exceeds the maximum RAM/core your job will be terminated. The amount of RAM per node can be found on the cluster architecture page.
Parallel jobs:
Parallel jobs write to one file system regardless of which worker node they start on. However this does mean that each job a user submits is required to start in a unique folder if the software that the job runs is not capable of specifying unique data files.
As an example, user fred has a home directory /home/fred/ on the head node, and this directory is also mounted on each worker node. This means that if fred created /home/fred/myfile.txt on the head node, this file is also immediately present on each worker node. Fred now submits a job. The job initially lands on hpc102. Slurm now uses OpenMPI to start parallel versions of this job on nodes hpc101 and hpc103 for example. Each of the three nodes writes data to /home/fred/myfile.txt
If fred now submits another job and the software that fred is using cannot distinguish between concurrently running versions then data written to /home/fred/myfile.txt will be intermingled and/or corrupted. Hence it is critical that non-concurrent capable software be launched from unique directories. If fred wants to run 3 concurrent jobs then the following need to be created: /home/fred/job1, /home/fred/job2 and /home/fred/job3. The shell script that controls the job must have a change directory command in it in order to select the correct directory.
The cluster uses OpenMPI to control parallel jobs. To launch a parallel aware program one generally uses mpirun, however as SLURM is tightly coupled with OpenMPI there are some distinctions to launching mpi jobs manually; one does not need to specify a hostfile\machinefile nor does one need to specify the number of threads in the command line. SLURM has its own wrapper to mpirun, srun. Also be aware that unlike Torque\PBS there is no symmetrical geometry, if you request 2 nodes and 4 cores then SLURM will do the bare minimum to satisfy your request by running 3 threads on one node and 1 thread on the second. To retain symmetry use –ntasks-per-node=X where X is the number of threads per node you wish to use. It is critical that the shell script specifies how many servers (nodes) and cores will be reserved. This will inhibit other user’s jobs from trying to run on the same cores which would cause contention, slowing down both jobs. Use the #SBATCH directives to specify the nodes and cores.
#!/bin/sh #SBATCH --account maths #SBATCH --partition=ada #SBATCH --time=10:00:00 #SBATCH --nodes=2 --ntasks=8 --ntasks-per-node=4 #SBATCH --job-name="MyMathsJob" #SBATCH --mail-user=MyEmail@uct.ac.za #SBATCH --mail-type=ALL module load mpi/openmpi-4.0.1 srun /home/fred/mympiprog
This shell script tells SLURM that 2 nodes and a total of exactly 4 cores on each node should be reserved. Note that if –ntasks-per-node was not specified then the first node would have used 7 cores and the second node would have used 1 core. Unless specified the scheduler will not distribute the threads symmetrically. Mpirun is coupled to the scheduler and it is not necessary to specify a host file.
Please note that if your code uses OMP or PTHREADS we have set OMP_NUM_THREADS=1 on all worker nodes by default as some researchers launch OMP jobs without setting this variable which results in the code grabbing all cores on the worker node. You are welcome to override this if needed with:
export OMP_NUM_THREADS=$SLURM_NTASKS
Or if running in hybrid node with an OMP job distributed via MPI:
export OMP_NUM_THREADS=$SLURM_TASKS_PER_NODE
If your code is capable of running in parallel but does not use OpenMPI and requires a command line argument for the number of cores or threads such as -n 30 or -t 30 then you can link the reserved cores to this with the $SLURM_NTASKS variable for example -n $SLURM_NTASKS instead of -n 30, for example:
#!/bin/sh #SBATCH --account maths #SBATCH --partition=ada #SBATCH --time=10:00:00 #SBATCH --nodes=1 --ntasks=30 #SBATCH --job-name="MyMathsJob" #SBATCH --mail-user=MyEmail@uct.ac.za #SBATCH --mail-type=ALL myprog -np $SLURM_NTASKS -in data.txt
salloc:
The salloc command is used to interactively allocate a SLURM job allocation. When salloc successfully obtains the requested allocation, it then runs the command specified by the user. Finally, when the user specified command is complete, salloc relinquishes the job allocation. Entering the following at the head node returns a confirmation and prompt once resources are available:
salloc --account maths --partition=ada --time=10:00:00 --nodes=1 --ntasks=1 salloc: Granted job allocation 2060 bob@srvcnthpc001:~$>
User bob is still logged into the head node but can now use srun to issue commands which will run on the assigned resources, even though the prompt still indicates the head node.
srvcnthpc001 ~$ srun cat /etc/hostname srvcnthpc101.uct.ac.za
Typing exit relinquishes the resources and ends the job.
bob@srvcnthpc001:~$ exit exit salloc: Relinquishing job allocation 2060 salloc: Job allocation 2060 has been revoked. bob@srvcnthpc001:~$
It is possible to launch a cluster job directly from the command line (or a script).
srun -A maths --partition=ada --time=1000:00 --nodes=1 --ntasks=1 /home/fred/myprog -o /home/fred/out.txt
The prompt is frozen until the job completes.
Interactive Jobs:
An interactive job gives you command line access to a worker node. From the head node type:
sintx
The cluster will indicate that you are starting an interactive job and your prompt will change to that of a worker node. Now any command you type is executed on that node. If you do not see the text “Starting interactive job” then you are still on the head node and should not run any heavy load processes.
Unlike salloc your commands do not need to be prefaced with srun unless you are running OpenMPI code.
Type exit to end the job and you will return to the head node.
You can request additional cluster parameters with the sintx command just as in an sbatch file:
sintx --ntasks=20 --account=maths --partition=ada
Account and partition parameters are not mandatory unless you have access to more than one partition.
In addition sintx automatically creates a DISPLAY environment variable should you wish to export a graphical display back to your workstation. You will however need to be running an Xclient on your desktop.
Type squeue to see a list of running jobs
andy@srvcnthpc001:~$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2143 testl MyBatchJob andy PD 0:00 1 (resources) 2144 test MPImemjobA fred R 2:25:02 2 hpc106,107 2150 test MPImemjobB fred R 1:15:27 2 hpc108,109
Here user andy wants to see why his job is not running, it’s most likely that user fred is consuming all the available resources. Note that in SLURM servers can belong to multiple partitions with different queuing attributes and it is possible that some researcher’s jobs may overtake other queued jobs. This is often the case where a research group has paid for the servers and therefore has priority on them.
To cancel a job type scancel jobid
Other considerations
Some software such as Java or Stata do not behave well in a cluster environment and grab more resources than allocated. If you are submitting Java or Stata jobs please reserve all cores on a node, for example if submitting to the ada partition use:
--ntasks=40
If you are unsure as to how your software will behave please contact us for advice.