DRAFT DOCUMENT – THE DOCUMENTATION OF THE A100 PARTITION IS SUBJECT TO CHANGE AT SHORT NOTICE.
NB. If you intend submitting jobs to the a100 partition you need to read this entire document extremely carefully or your jobs will not run.
The a100 partition is made up of servers srvcntgpu009 and srvcntgpu010. Each server has 56 CPU cores and 4 A100 Nvidia GPU cards. These cards can be partitioned into virtual GPUs or multi-instance GPUs (migs).
Each mig instance has a profile; a certain amount of GPU compute and GPU memory. These profiles are provided by Nvidia and may not be altered. The profiles are named according to the compute and memory they provide: 1g5gb, 2g10gb, 3g20gb, 4g20gb and 7g40gb. The 7g40gb profile is in effect a full A100 card. The sum of the capabilities of these profiles may not exceed the capabilities of the entire card. For instance one may divide an A100 into a 4g20gb and three 1g5gb cards or two 3g20gb cards, as long as the compute capabilities do not exceed 7 and the memory capabilities do not exceed 40gb.
Currently the HPC a100 partition servers are split as follows:
GPU0: 4g20gb 2g10gb 1g5gb GPU1: 3g20gb 3g20gb GPU2: ampere ...unpartitioned full card. GPU3: ampere ...unpartitioned full card.
How to submit jobs
You need to let the scheduler know that you require a GPU resource. This is done via the generic resource (gres) tag. The format is type:label:quantity.
Via batch queue:
#SBARCH --partition=a100 #SBATCH --gres=gpu:a100-1g-5gb:1 #SBATCH --account=mygpugroup #SBATCH --ntasks=2
sintx --partition=a100 --account=mygpugroup --ntasks=2 --gres=gpu:a100-1g-5gb:1
In the above examples the text mygpugroup must be replaced by the GPU group you were granted access to. This is not your user account.
Please be extremely careful with the interactive command as an incorrect request can cause problems with the scheduler.
One card per job: Only one mig may be addressed by a job at a time. This is a CUDA limitation. If you reserve two cards your job will be canceled, for example:
--gres=gpu:a100-1g-5gb:2 <== DON'T DO THIS as the second card would be wasted.
Starvation via CPUs: In order to submit jobs to a node there must be free cores available on that node. If user jobs consume all available cores then no further jobs may be submitted even if there are free GPUs. Please keep your ntasks parameters as low as possible. Remember that the a100 partition is focused on GPU computation.
Starvation via priority: Most clusters present groups of homogenous resources. Due to their nature migs are often groups of heterogeneous resources. A high priority job may block other jobs from running if it is waiting for a specific gres to become available even if there are gres types free and available to those other jobs. If you see your job is queued with reason (Priority) then this is what is happening. The scheduler has a backfill algorithm to deal with these situations and we have also developed scripts to ensure that these queued jobs will run. However there is a time delay of up to 10 minutes before this will be actioned. Starvation via priority happens most often when very large wall times are selected. If possible please try to select a shorter wall time rather than revert to the default by omission. A shorter wall time ensures your job gets higher priority.
CUDA device selection: Nvidia has changed the way that GPU cards are addressed\detected. SLURM is expecting to address a numerical instance but CUDA is expecting a device ID. The SLURM environment needs to be changed slightly for your job to run correctly. It is mandatory to have the following line in your sbatch script below the #SBATCH directives:
or to run this command as soon as your interactive job starts.
If you do not run this command your code will not detect any GPUs.
When your job fails to run it is most likely that you forgot the above command.
As the A100 MIG configuration can only run under CUDA 11.4 it is installed by default on the A100 nodes and does not need to be invoked. If you are compiling software then you will need to run an interactive job on one of the nodes. Please keep your core reservation to a minimum and do not forget to export the CUDA_VISIBLE_DEVICES variable.
Graphs and monitoring:
Feedback is important to determine how well your job is running. On the head node you may use the a100cores command to determine where your job is running and also which GPU instance it is utilizing
The HPC dashboard displays the GPU utilization percentage. Hover your mouse over the bar graph to see utilization. The dashboard is updated once every 5 minutes.
There are Nagios\Cacti graphs available. They update once every 5 minutes. Use the a100cores script to determine which server and which instance you should be monitoring.
Who may apply for access to this partition?
Groups from the Science and Engineering faculties contributed heavily to the cost of this resource. Members of these groups are granted access to resources proportional to their contribution, these levels being set by a committee of group leaders.
ICTS contributed to the cost of one of the cards and has ‘donated’ the mig instances to the general pool of researchers free of charge. However this pool is limited in the type and number of mig instances available, the wall time as well as the number of jobs that can be queued at any one time.
Will more servers like these be purchased?
The cost of one of these servers is approximately R1 million, understandably higher than the average research group can afford. We would strongly encourage research groups to pool their resources in order to share the cost of these servers\cards. The money would then be transferred to ICTS who would purchase the server. The servers are housed in the UCT data center and are administered by the HPC staff who are also responsible for any repairs or replacements that may be required. HPC staff members can also facilitate the discussions around purchase and resource sharing.