Slurm User Guide

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm consists of a slurmd daemon running on each compute node and a central slurmctld daemon running on a management node. The slurmd daemons provide fault-tolerant hierarchical communications.

The entities managed by these Slurm daemons include

nodes, the compute resource in Slurm,
partitions, which group nodes into logical (possibly overlapping) sets,
jobs, or allocations of resources assigned to a user for a specified amount of time,
and job steps, which are sets of (possibly parallel) tasks within a job.

The partitions can be considered job queues, each of which has an assortment of constraints such as job size limit, job time limit, users permitted to use it, etc

Common Slurm Commands

The Slurm system is accessed using the following commands:

interactive – Start an interactive session
sbatch – Submit and run a batch job script
sqstat – A version analogous to the PBS qstat commands
srun – Typically used inside batch job scripts for running parallel jobs (See examples further down)
scancel – Cancel one or more of your jobs.

******************************************************************************

Materials from sites were used :

https://slurm.schedmd.com/
https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html

CICC

JINR Central Information and Computer Complex

Slurm User Guide