meta data for this page
- Česky (cs)
- English (en)
As of February 2022, running jobs on Kraken cluster compute nodes is only possible through the queuing system (SLURM). Compute nodes are not directly accessible from the network. Special administrative node “kraken” is available for connecting to the cluster, preparing and submitting jobs to the queuing system.
The Administrative node is not intended to run computations, use it primarily for data manipulation, submitting jobs to queues, and compiling custom programs. You cannot run parallel jobs outside of the queues. The “NoCompute” queue is intended for computationally inefficient parallel programs (e.g. Paraview). This node runs only on the administrative node (which has increased RAM to 320GB to allow for big data processing).
In addition to SLURM commands, the command shows the current cluster load
freenodes
Users queue a job and let the queue system run the computation. The job is queued in order of the system's internal priorities and waits for exectuion. The queuing system runs the job as soon as the requested compute resources are available. Users do not need to monitor compute capacity availability themselves, they can log out of the cluster and wait for the computation to completed. There is an option for notification email to be sent upon completion of the job, for details see Command Overview).
See below for a basic description of how to work with the queue system (SLURM), specifics on starting specific applications are on separate pages:
SLURM queuing system
The queuing system takes care of optimal cluster utilization, providing a number of tools for job submission, job control and parallelization. All tasks are performed by logging into the administrative node “kraken” (ssh username@kraken).
Full documentation can be found at slurm.schedmd.com
Basic commands:
Running jobs
There are 2 commands to queue a job, srun
and sbatch
:
srun <parameters> <running_program> <running_program_parameters>
Key command for queuing a job. For parallel jobs, it replaces the “mpirun” command (mpi libraries in modules therefore do not offer the mpirun command either…).
srun
in this form requests resources according to <parameters>
and runs the program on them. If you are running a non-parallel job, leave the parameter -n 1
(the default), if you choose a higher value, the non-parallel program will run n times!
sbatch <script_file>
Submit the job to the queue according to the prepared script, see examples below. The script for parallel jobs usually includes a line with the “srun” command (commercial codes are usually run without srun, like codes). The most common way to queue a job is just sbatch+script.
Task management
sinfo
Lists the queues and their current usage.
squeue
Lists information about running jobs in the queue system.
Meaning of abbreviations in squeue (complete list here):
- In the “ST” (status) column: R - running, PD - pending (waiting for allocation of resources), CG - completing (some processes are finished, but some are still active),…
- in the REASON column: Priority - the task(s) with higher priority is/are in the queue, Dependency - the task is waiting for the completion of the task in the dependency and will be started afterwards, Resources - the task is waiting for the release of the required resources,…
scancel <number>
Quit the <number> queued task.
User information
sacct
Lists information about the user's jobs (including history).
Running the tasks
Tasks can be run on multiple nodes, but always on one of the parts of the server:
- part M - kraken machines-m1 to m10 (all users)
- part L - kraken-l1 to l4 machines (limited access)
Tasks can be run on:
- directly from the line with the
srun
command - using a script with the
sbatch
command
Guidelines for running jobs
- A job must always run under a queue (partition). If no queue is specified,
Mexpress
is used. A list of defined queues is given below. - By specifying a queue, a run time limit is defined.
- Tasks in the express and short queues cannot be given a longer run time using
-
-time
. The default queue long time is set to 1 week, but they allow running up to 2 weeks, e.g. 9 days and 5 hours by specifying'-p Llong
'-
-time=9-05:00:0
. - Slurm will prioritize the task and user that the cluster uses less when queuing pending tasks. Therefore, it is not advantageous to declare a longer computation time than strictly necessary.
Predefined queues and time limits
There are 6 queues (“partitions”) on the Kraken cluster, divided by job run length (express, short, long) and cluster partition (“Mxxx” and “Lxxx”). If the user does not specify a queue with the --partition switch, the default value (Mexpress) is used:
cluster part | partition | node | time limit |
---|---|---|---|
M (nodes kraken-m[1-10]) | Mexpress | kraken-m[1-10] | 6 hours |
Mshort | kraken-m[1-10] | 2 days | |
3 days | |||
Mlong | kraken-m[3-6], kraken-m8 | 1 week | |
2 weeks | |||
L (nodes kraken-l[1-4]) | Lexpress | kraken-l[1-4] | 6 hours |
Lshort | kraken-l[1-4] | 2 days | |
Llong | kraken-l[1-4] | 1 week | |
2 months (max) | |||
admin node only | NoCompute | kraken | 1 hour |
8 hours |
*bold=default
Details of the settings can also be viewed using the command
scontrol show partition [partition_name]
Parameters for the ''srun'' and ''sbatch'' commands
The program run is controlled by parameters. For the srun
command they are entered directly into the command line, for the sbatch
command they are written into the startup script. In the script, each parameter is preceded by the identifier #SBATCH
.
Options can be entered in two forms, either the full form -
-ntasks=5
(two hyphens and an equal sign) or the abbreviated form -n 5
(one hyphen and a space).
option | description | example |
---|---|---|
-J, --job-name=<jobname> | Job name, shown e.g. in output of squeue | -J my_first_job |
-p, --partition=<partition_names> | Request a specific partition for the resource allocation | -p Mshort |
-n, --ntasks=<number> | Number of resources (~cores) to be allocated for the task | -n 50 |
-N, --nodes=<nodes> | Number of nodes to be used | -N 3 |
--mem | Job memory request | --mem=1gb |
-o, --output=<filename> | Name of file where slurm will output | -o out.txt |
-e, --error=<filename> | standard error to file | -e err.txt |
--mail-user=<user> | User to receive email notification of state changes as defined by –mail-type | --mail-user=my@email |
--mail-type=<type> | Send email with BEGIN, END, FAIL, ALL,… | --mail-type=BEGIN,END |
--ntasks-per-node=<ntasks> | Request that ntasks be invoked on each node | |
-t, --time=<time> | Set a limit on the total run time of the job allocation (days-hours:minutes:seconds) | -t 1:12 |
-w, --nodelist=<node_name_list> | Request a specific list of hosts | -w kraken-m2,kraken-m[5-6] |
-x, --exclude={<host1[,<host2>…]} | Request that a specific list of hosts not be included in the resources allocated to this job | --exclude=kraken-m[7-9] |
Do jmen souborů výstupu (output, error) lze začlenit proměnné jako jméno nodu (%N), číslo úlohy (%J), jméno uživatele (%u), apod… Výpis standardní chyby zadaný ve skriptu #SBATCH -e slurm.%N.%J.%u.err
bude v souboru slurm.kraken-m123.12345.username.err
Začátek běhu nově zadávané úlohy lze podmínit např. dokončením úlohy již běžící sbatch –dependency=after:123456:+5 myjob.slurm, Here 123456 is the job number (according to the squeue listing) and “+5” indicates a delay of 5 minutes between the end of the previous job and the start of the new one.
For a complete list of parameters, see e.g. Slurm-sbatch.
Example: compiling a parallel job and queuing it
In the /home/SOFT/modules/HelloMPI/ directory, you will find the source code for a simple parallel program to calculate π and a script to enter it into the SLURM queue system. After copying it to your own directory, you can test the queue settings in the file slurm.sh and the complexity of the calculation or the accuracy of the result, i.e. the parameter n in the source file pi.c .
Compilation
First we copy the directory to our local directory
cp -r /home/SOFT/modules/HelloMPI/ ./
enter the directory
cd HelloMPI
The directory should contain 2 files, the source “pi.c” and the script “slurm.sh”
ls
To compile the program we will need the “mpicc” command. This is not available from the system, but it is provided by one of the mpi library modules (openmpi, mpich, intel-mpi, …), e.g.
ml openmpi
After loading the module of the selected mpi library, we have the mpicc command available to compile the program
mpicc pi.c -o pi
We now have the executable file “pi” in the directory
ls
Queue the “pi” program using either the “srun” or “sbatch <script>” command:
srun
We will list all the options as a single srun command.
srun -n 6 -N 2 pi
Slurm runs the pi program on 6 cores using two nodes. NOTE: srun replaces the mpirun command normally provided by the MPI libraries themselves for compatibility with SLURM queues in the system, so mpi libraries are installed in modules without the mpirun command, which is not available even after the module is loaded! To run a parallel task, you must use the srun command.
Instead of writing srun and many parameters, it is often convenient to use
script for sbatch
The switches for srun can be written in a file that we pass to the sbatch command. Only the variables belonging to the queue system are listed in #SBATCH, otherwise they are command line commands.
The contents of the file “slurm.sh” from the HelloMPI directory are as follows:
#!/bin/bash # #SBATCH --job-name=HelloMPI_Pi #SBATCH --output=HelloMPI_Pi_log.txt #SBATCH -n 6 #SBATCH -N 2 srun pi
Pass the job to the system:
sbatch slurm.sh
By editing the slurm.sh file, you can freely test the commands listed in the table above.
The file for sbatch can also contain generic command line commands. Extending the previous:
#!/bin/bash #SBATCH -N 2 echo "PI: starts in folder" pwd srun pi echo "PI:finished" mkdir Folder2 cd Folder2 echo "PI-second run in folder" pwd srun ../pi cd ../ ls
All command line output, both for the pi program itself and the echo and pwd commands, can be found in the HelloMPI_Pi_log.txt file, specified above using #SBATCH –output