Для получения информации о параметрах очередей (которые в Slurm называются разделами используйте следующие основные команды:
1. SINFO Просмотр общих параметров разделов
Команда sinfo показывает состояние узлов и основные настройки очередей ,
используется для просмотра общего состояния узлов и количества процессоров на них, а не для конкретных задач пользователей
- sinfo — выводит список разделов, их состояние (up/down), ограничение по времени и количество узлов
- sinfo -pимя_раздела — показывает информацию только по конкретной очереди.
- sinfo -l — выводит расширенную таблицу с деталями по каждому разделу.
ALLOCATED (alloc): Nodes with active jobs.
COMPLETING (comp): Jobs are finishing, no new allocations.
DOWN (down): Unavailable due to failure or admin action.
DRAINED (drain) / DRAINING (drng): Unavailable for new jobs; draining nodes are finishing current jobs.
IDLE (idle): Available for new jobs.
MIXED (mix): Partially allocated, partially idle.
RESERVED (resv): Reserved for specific use.
MAINT (maint): Under maintenance.
UNKNOWN (unk): State undetermined
To get more specific information about node states, you can use these flags:
sinfo -N -l: Displays detailed information for each node individually.
sinfo -R: Shows the reason why a node is in a down, drained, or failing state.
sinfo -s: Provides a summarized view of the partitions.
sinfo -t <state>: Filters the output to show only nodes in a specific state (e.g., sinfo -t idle).
2. SCONTROL
scontrol
Для получения детальных технических лимитов (макс. количество узлов, памяти, процессоров на задачу) используйте:
scontrol show partition — выводит полные параметры всех очередей кластера.
scontrol show partition <имя_раздела> — выводит подробную конфигурацию конкретной очереди
scontrol show assoc_mgr flags=users is used in Slurm to display the current contents of the slurmctld daemon’s internal cache, specifically focusing on user associations, limits, and Quality of Service (QOS) settings.
Key Aspects of the Output:
Internal State: It provides a detailed look at the active accounting information currently in use on the cluster, rather than just querying the database.
User Limits: It shows per-user and per-account associations, including limits such as maxjobs, maxsubmit, and TRES (Trackable Resources) usage.
QOS Details: It can be used to troubleshoot QOS limits, such as GrpTRESMins or MaxTRESPerUser.
Usage Tips:
Troubleshooting: If jobs are unexpectedly pending due to limits, this command helps verify what slurmctld believes the current usage is.
Privileges: Running this command usually requires elevated privileges (administrator).
Alternative: To see associations for specific users or accounts, it is often combined with other flags or sacctmgr to narrow down the output.
This command is vital for debugging why jobs are not running due to resource limits imposed by Accounting or QOS
- View Internal Cache: This command shows the live, cached accounting information rather than querying the database directly, making it faster and useful for diagnosing why jobs are pending due to limits.
- Troubleshooting Limits: It is primarily used to debug QOS and association limits (e.g.,
MaxTRESPerUser,GrpTRES). - Usage Data: It helps display the current consumption of resources by users or accounts.
-
- ClusterName
- Account
- User
- QOS
- Usage Information (e.g., raw TRES usage)
scontrol show assoc_mgr
command is used in Slurm to view the internal status and «cached» accounting information currently held by the Slurm controller (slurmctld).
While tools like sacctmgr query the database directly,
scontrol show assoc_mgr shows you what the controller currently thinks is happening, which is critical for troubleshooting why jobs might be pending due to limits.
Key Uses
Verify Resource Limits: Check current resource consumption (e.g., CPUs, nodes, GPUs) against defined limits for specific users, accounts, or Quality of Service (QOS) levels.
Troubleshoot Pending Jobs: If a job is stuck with a reason like QOSMaxNodePerUserLimit, this command shows the actual counts the controller is using to enforce that limit.
Monitor Real-time Usage: See «Raw Usage» and TRES (Trackable Resource) consumption for associations before they are fully processed or decayed in the database.
Common Syntax Variants
Show All: scontrol show assoc_mgr (Displays everything including Users, Associations, and QOS).
Filter by QOS: scontrol show assoc_mgr qos=name (Narrows output to a specific QOS).
Filter by User: scontrol show assoc_mgr user=username.
One-liner Output: scontrol -o show assoc_mgr (Useful for scripts and parsing)
sacctmgr: Used to manage and modify association records.sreport: Used for generating historical reports from Slurm accounting data.
- Lineage: Shows the structure and hierarchy of accounts (replaces old ‘Lft’ data).
- User/Account Associations: Lists
uid(often user name),account,partition, and associatedQOS. - QOS Details: Displays limits like
MaxJobs,MaxNodes,MaxWall, andFlags
=================================================================
getent passwd username
scontrol show assoc_mgr flags=users | grep -A2 -B2 username
scontrol reconfigure
sleep 5
scontrol show assoc_mgr flags=users | grep -A2 -B2 username
Если после reconfigure в assoc_mgr всё ещё остаётся username(uid — не тот) тогда уже перезапустить slurmctld, чтобы он заново поднял runtime-состояние user/account associations:
systemctl restart slurmctld
sleep 5
scontrol show assoc_mgr flags=users | grep -A2 -B2 username
======================================================
squeue— для просмотра текущего состояния очереди (списка запущенных и ожидающих заданий).squeue -p <имя_раздела>
* (Asterisk): If a state is followed by an asterisk (e.g., down* or idle*), it means the node is not responding to the Slurm controller.
+ (Plus): Indicates that the node is in a «power save» mode or is being rebooted.
Для определения количества процессоров (ядер), выделенных задачам пользователей в Slurm:
%C) или узлов/процессоров (%b). Основной способ — командаsqueue -o "%.10i %.10u %.5C %.10R", где %C показывает количество CPU.
squeue -u имя_пользователя -o "%.10i %.5C %.10R"
- Просмотр истории использования процессоров (завершенные задачи):
sacct -j <job_id> --format=JobID,User,NCPUS,State - Просмотр подробной информации о запущенной задаче:
scontrol show job <job_id> | grep NumCPUsКлючевые параметры Slurm
CPUSилиnum_cpus— общее количество ядер, выделенных задаче.TRES(Trackable Resources) — может отображать ресурсы в форматеcpu=N, если настроено в кластере.
================================================================
Примеры :
lxbsrv1:~/wrk # sreport cluster UserUtilizationByAccount
———————————————————————————
Cluster/User/Account Utilization 2026-03-26T00:00:00 — 2026-03-27T00:59:59 (90000 secs)
Usage reported in CPU Minutes
———————————————————————————
Cluster Login Proper Name Account Used Energy
——— ——— ————— ————— ——— ———
cicc cms001 Pcms user cms_mcore 2837592 0
cicc atlas001 Patlas user atl_mcore 2773792 0
cicc alice001 Palice user alice 1736900 0
cicc zontikov Zontikov Artem+ users 470185 0
cicc mpd001 Pmpd user nica 329914 0
cicc bmn001 Pbmn user nica 311389 0
cicc tatded Dedovich Tatja+ users 128100 0
cicc lhcb001 Plhcb user lhcb 38648 0
cicc itsrin8 Nizamov Rinat + users 6000 0
cicc asmirnov Smirnov Artjom+ users 1709 0
cicc bio001 Pbio user biomed 1500 0
cicc scms001 Scms user etf 112 0
cicc sops001 Sops user etf 42 0
cicc satl001 Satl user etf 41 0
cicc spd001 Pspd user nica 26 0
lxui10:~ > sinfo -aeN | grep cicc
t2-s734 1 cicc* resv
wn000 1 cicc* alloc
…
wn013 1 cicc* alloc
….
Tue Mar 24 16:18:18 2026
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
t2-s734 1 tier2 reserved 8 8:1:1 15500 40000 1 (null) none
t2-s734 1 grid reserved 8 8:1:1 15500 40000 1 (null) none
t2-s734 1 cicc* reserved 8 8:1:1 15500 40000 1 (null) none
wn000 1 tier2 allocated 28 28:1:1 127500 1000000 1 (null) none
wn000 1 grid allocated 28 28:1:1 127500 1000000 1 (null) none
wn000 1 cicc* allocated 28 28:1:1 127500 1000000 1 (null) none
wn001 1 tier2 allocated 28 28:1:1 127500 1000000 1 (null) none
wn001 1 grid allocated 28 28:1:1 127500 1000000 1 (null) none
wn001 1 cicc* allocated 28 28:1:1 127500 1000000 1 (null) none
wn002 1 tier2 allocated 28 28:1:1 127500 1000000 1 (null) none
wn002 1 grid allocated 28 28:1:1 127500 1000000 1 (null) none
Wed Mar 25 13:45:39 2026
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
t2-s734 1 cicc* reserved 8 8:1:1 15500 40000 1 (null) none
wn000 1 cicc* mixed 28 28:1:1 127500 1000000 1 (null) none
wn001 1 cicc* idle 28 28:1:1 127500 1000000 1 (null) none
wn002 1 cicc* mixed 28 28:1:1 127500 1000000 1 (null) none
wn003 1 cicc* mixed 28 28:1:1 127500 1000000 1 (null) none
wn004 1 cicc* mixed 28 28:1:1 127500 1000000 1 (null) none
wn005 1 cicc* mixed 28 28:1:1 127500 1000000 1 (null) none
wn006 1 cicc* mixed 28 28:1:1 127500 1000000 1 (null) none
wn007 1 cicc* mixed 28 28:1:1 127500 1000000 1 (null) none
2896
444
6164
11990
466
PartitionName=cicc
AllowGroups=ALL AllowAccounts=users,etf AllowQos=ALL
AllocNodes=ALL Default=YES QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=1 MaxTime=4-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=t2s734,wn[000-055,060-113,115-329,360-475],wn2a[000-003],wni[020039]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=9964 TotalNodes=466 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
TRES=cpu=9964,mem=55099000M,node=466,billing=69152
TRESBillingWeights=CPU=1.0,Mem=1.1G
Account Descr Org
———- ——————— ———————
alice alice lhc
atl_mcore atl_mcore lhc
bes bes grid
biomed biomed grid
cms_mcore cms_mcore lhc
compass compass grid
etf etf etf
grid grid grid
ilc ilc grid
juno juno grid
lhc lhc lhc
lhcb lhcb lhc
nica nica nica
nova nova grid
root default root account root
users users users
Usage reported in CPU Minutes
———————————————————————————
Cluster Login Proper Name Account Used Energy
——— ——— ————— ————— ——— ———
cicc atlas001 Patlas user atl_mcore 2650179 0
cicc cms001 Pcms user cms_mcore 2296634 0
cicc alice001 Palice user alice 1690127 0
cicc mpd001 Pmpd user nica 594625 0
cicc zontikov Zontikov Artem+ users 171858 0
cicc genis Genis Musulman+ users 111000 0
cicc lbavinh Lyong Ba Vin users 63573 0
cicc aivanov Ivanov Arteb V+ users 47451 0
cicc lhcb001 Plhcb user lhcb 39645 0
cicc asmirnov Smirnov Artjom+ users 25416 0
cicc itsrin8 Nizamov Rinat + users 6000 0
cicc timofeev Timofeev Artem+ users 1511 0
cicc bio001 Pbio user biomed 1500 0
cicc kshtejer Katherin Shtej+ users 356 0
cicc bmn001 Pbmn user nica 232 0
cicc scms001 Scms user etf 163 0
cicc spd001 Pspd user nica 63 0
cicc satl001 Satl user etf 52 0
cicc sops001 Sops user etf 49 0
cicc oris Suares Ehng Or+ users 4 0