Slurm User Guide

 

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm consists of a slurmd daemon running on each compute node and a central slurmctld daemon running on a management node. The slurmd daemons provide fault-tolerant hierarchical communications.

The entities managed by these Slurm daemons  include

  • nodes, the compute resource in Slurm,
  • partitions, which group nodes into logical (possibly overlapping) sets,
  • jobs, or allocations of resources assigned to a user for a specified amount of time,
  • and job steps, which are sets of (possibly parallel) tasks within a job.

The partitions can be considered job queues, each of which has an assortment of constraints such as job size limit, job time limit, users permitted to use it, etc

Common  Slurm Commands

The Slurm system is accessed using the following commands:

  • interactive – Start an interactive session
  • sbatch – Submit and run a batch job script
  • sqstat – A version analogous to the PBS qstat commands
  • srun – Typically used inside batch job scripts for running parallel jobs (See examples further down)
  • scancel – Cancel one or more of your jobs.

 

******************************************************************************

 

 

Materials from sites were used :

https://slurm.schedmd.com/
https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html