This translation is older than the original page and might be outdated. See what has changed.
Translations of this page:

As of February 2022, running jobs on Kraken cluster compute nodes is only possible through the queuing system (SLURM). Compute nodes are not directly accessible from the network. Special administrative node “kraken” is available for connecting to the cluster, preparing and submitting jobs to the queuing system.

The Administrative node is not intended to run computations, use it primarily for data manipulation, submitting jobs to queues, and compiling custom programs. You cannot run parallel jobs outside of the queues. The “NoCompute” queue is intended for computationally inefficient parallel programs (e.g. Paraview). This node runs only on the administrative node (which has increased RAM to 320GB to allow for big data processing).

In addition to SLURM commands, the command shows the current cluster load

freenodes

Users queue a job and let the queue system run the computation. The job is queued in order of the system's internal priorities and waits for exectuion. The queuing system runs the job as soon as the requested compute resources are available. Users do not need to monitor compute capacity availability themselves, they can log out of the cluster and wait for the computation to completed. There is an option for notification email to be sent upon completion of the job, for details see Command Overview).

See below for a basic description of how to work with the queue system (SLURM), specifics on starting specific applications are on separate pages:

SLURM queuing system

The queuing system takes care of optimal cluster utilization, providing a number of tools for job submission, job control and parallelization. All tasks are performed by logging into the administrative node “kraken” (ssh username@kraken).

Full documentation can be found at slurm.schedmd.com

Basic commands:

Running jobs

There are 2 commands to queue a job, srun and sbatch:

srun <parameters> <running_program> <running_program_parameters>

Key command for queuing a job. For parallel jobs, it replaces the “mpirun” command (mpi libraries in modules therefore do not offer the mpirun command either…).
srun in this form requests resources according to <parameters> and runs the program on them. If you are running a non-parallel job, leave the parameter -n 1 (the default), if you choose a higher value, the non-parallel program will run n times!

sbatch <script_file>

Submit the job to the queue according to the prepared script, see examples below. The script for parallel jobs usually includes a line with the “srun” command (commercial codes are usually run without srun, like codes). The most common way to queue a job is just sbatch+script.

Task management

sinfo

Lists the queues and their current usage.

squeue

Lists information about running jobs in the queue system.

Meaning of abbreviations in squeue (complete list here):

- In the “ST” (status) column: R - running, PD - pending (waiting for allocation of resources), CG - completing (some processes are finished, but some are still active),…

- in the REASON column: Priority - the task(s) with higher priority is/are in the queue, Dependency - the task is waiting for the completion of the task in the dependency and will be started afterwards, Resources - the task is waiting for the release of the required resources,…

scancel <number>

Quit the <number> queued task.

User information

sacct

Lists information about the user's jobs (including history).

List of commands with parameters in PDF document.

Running the tasks

Tasks can be run on multiple nodes, but always on one of the parts of the server:

  • part M - kraken machines-m1 to m10 (all users)
  • part L - kraken-l1 to l4 machines (limited access)

Tasks can be run on:

  • directly from the line with the srun command
  • using a script with the sbatch command

Guidelines for running jobs

  • A job must always run under a queue (partition). If no queue is specified, Mexpress is used. A list of defined queues is given below.
  • By specifying a queue, a run time limit is defined.
  • Tasks in the express and short queues cannot be given a longer run time using --time. The default queue long time is set to 1 week, but they allow running up to 2 weeks, e.g. 9 days and 5 hours by specifying '-p Llong' --time=9-05:00:0.
  • Slurm will prioritize the task and user that the cluster uses less when queuing pending tasks. Therefore, it is not advantageous to declare a longer computation time than strictly necessary.

Predefined queues and time limits

There are 6 queues (“partitions”) on the Kraken cluster, divided by job run length (express, short, long) and cluster partition (“Mxxx” and “Lxxx”). If the user does not specify a queue with the --partition switch, the default value (Mexpress) is used:

cluster part partition node time limit
M (nodes kraken-m[1-10]) Mexpress kraken-m[1-10] 6 hours
Mshort kraken-m[1-10] 2 days
3 days
Mlong kraken-m[3-6], kraken-m8 1 week
2 weeks
L (nodes kraken-l[1-4]) Lexpress kraken-l[1-4] 6 hours
Lshort kraken-l[1-4] 2 days
Llong kraken-l[1-4] 1 week
2 months (max)
admin node only NoCompute kraken 1 hour
8 hours

*bold=default

Details of the settings can also be viewed using the command

scontrol show partition [partition_name]

Parameters for the ''srun'' and ''sbatch'' commands

The program run is controlled by parameters. For the srun command they are entered directly into the command line, for the sbatch command they are written into the startup script. In the script, each parameter is preceded by the identifier #SBATCH.

Options can be entered in two forms, either the full form --ntasks=5 (two hyphens and an equal sign) or the abbreviated form -n 5 (one hyphen and a space).

option description example
-J, --job-name=<jobname> Job name, shown e.g. in output of squeue -J my_first_job
-p, --partition=<partition_names> Request a specific partition for the resource allocation -p Mshort
-n, --ntasks=<number> Number of resources (~cores) to be allocated for the task -n 50
-N, --nodes=<nodes> Number of nodes to be used -N 3
--mem Job memory request --mem=1gb
-o, --output=<filename> Name of file where slurm will output -o out.txt
-e, --error=<filename> standard error to file -e err.txt
--mail-user=<user> User to receive email notification of state changes as defined by –mail-type --mail-user=my@email
--mail-type=<type> Send email with BEGIN, END, FAIL, ALL,… --mail-type=BEGIN,END
--ntasks-per-node=<ntasks> Request that ntasks be invoked on each node
-t, --time=<time> Set a limit on the total run time of the job allocation (days-hours:minutes:seconds) -t 1:12
-w, --nodelist=<node_name_list> Request a specific list of hosts -w kraken-m2,kraken-m[5-6]
-x, --exclude={<host1[,<host2>…]} Request that a specific list of hosts not be included in the resources allocated to this job --exclude=kraken-m[7-9]

Do jmen souborů výstupu (output, error) lze začlenit proměnné jako jméno nodu (%N), číslo úlohy (%J), jméno uživatele (%u), apod… Výpis standardní chyby zadaný ve skriptu #SBATCH -e slurm.%N.%J.%u.err bude v souboru slurm.kraken-m123.12345.username.err

Začátek běhu nově zadávané úlohy lze podmínit např. dokončením úlohy již běžící sbatch –dependency=after:123456:+5 myjob.slurm, Here 123456 is the job number (according to the squeue listing) and “+5” indicates a delay of 5 minutes between the end of the previous job and the start of the new one.

For a complete list of parameters, see e.g. Slurm-sbatch.



Example: compiling a parallel job and queuing it

In the /home/SOFT/modules/HelloMPI/ directory, you will find the source code for a simple parallel program to calculate π and a script to enter it into the SLURM queue system. After copying it to your own directory, you can test the queue settings in the file slurm.sh and the complexity of the calculation or the accuracy of the result, i.e. the parameter n in the source file pi.c .

Compilation

First we copy the directory to our local directory

cp -r /home/SOFT/modules/HelloMPI/ ./

enter the directory

cd HelloMPI

The directory should contain 2 files, the source “pi.c” and the script “slurm.sh”

ls

To compile the program we will need the “mpicc” command. This is not available from the system, but it is provided by one of the mpi library modules (openmpi, mpich, intel-mpi, …), e.g.

ml openmpi

After loading the module of the selected mpi library, we have the mpicc command available to compile the program

mpicc pi.c -o pi  

We now have the executable file “pi” in the directory

ls

Queue the “pi” program using either the “srun” or “sbatch <script>” command:

srun

We will list all the options as a single srun command.

srun -n 6 -N 2 pi

Slurm runs the pi program on 6 cores using two nodes. NOTE: srun replaces the mpirun command normally provided by the MPI libraries themselves for compatibility with SLURM queues in the system, so mpi libraries are installed in modules without the mpirun command, which is not available even after the module is loaded! To run a parallel task, you must use the srun command.

Instead of writing srun and many parameters, it is often convenient to use

script for sbatch

The switches for srun can be written in a file that we pass to the sbatch command. Only the variables belonging to the queue system are listed in #SBATCH, otherwise they are command line commands.

The contents of the file “slurm.sh” from the HelloMPI directory are as follows:

#!/bin/bash
#
#SBATCH --job-name=HelloMPI_Pi
#SBATCH --output=HelloMPI_Pi_log.txt 
#SBATCH -n 6
#SBATCH -N 2
srun pi

Pass the job to the system:

sbatch slurm.sh

By editing the slurm.sh file, you can freely test the commands listed in the table above.

The file for sbatch can also contain generic command line commands. Extending the previous:

#!/bin/bash
#SBATCH -N 2
echo "PI: starts in folder"
pwd
srun pi
echo "PI:finished"
mkdir Folder2
cd Folder2
echo "PI-second run in folder"
pwd
srun ../pi
cd ../
ls  

All command line output, both for the pi program itself and the echo and pwd commands, can be found in the HelloMPI_Pi_log.txt file, specified above using #SBATCH –output