sinfo — scontrol — sacctmgr

Для получения информации о параметрах очередей (которые в Slurm называются разделами используйте следующие основные команды:

1. SINFO Просмотр общих параметров разделов
Команда sinfo показывает состояние узлов и основные настройки очередей ,
используется для просмотра общего состояния узлов и количества процессоров на них, а не для конкретных задач пользователей

sinfo — выводит список разделов, их состояние (up/down), ограничение по времени и количество узлов
sinfo -pимя_раздела — показывает информацию только по конкретной очереди.
sinfo -l — выводит расширенную таблицу с деталями по каждому разделу.

ALLOCATED (alloc): Nodes with active jobs.
COMPLETING (comp): Jobs are finishing, no new allocations.
DOWN (down): Unavailable due to failure or admin action.
DRAINED (drain) / DRAINING (drng): Unavailable for new jobs; draining nodes are finishing current jobs.
IDLE (idle): Available for new jobs.
MIXED (mix): Partially allocated, partially idle.
RESERVED (resv): Reserved for specific use.
MAINT (maint): Under maintenance.
UNKNOWN (unk): State undetermined

To get more specific information about node states, you can use these flags:

sinfo -N -l: Displays detailed information for each node individually.
sinfo -R: Shows the reason why a node is in a down, drained, or failing state.
sinfo -s: Provides a summarized view of the partitions.
sinfo -t <state>: Filters the output to show only nodes in a specific state (e.g., sinfo -t idle).

2. SCONTROL

scontrol
Для получения детальных технических лимитов (макс. количество узлов, памяти, процессоров на задачу) используйте:

scontrol show partition — выводит полные параметры всех очередей кластера.

scontrol show partition <имя_раздела> — выводит подробную конфигурацию конкретной очереди

scontrol show assoc_mgr flags=users is used in Slurm to display the current contents of the slurmctld daemon’s internal cache, specifically focusing on user associations, limits, and Quality of Service (QOS) settings.

Key Aspects of the Output:

Internal State: It provides a detailed look at the active accounting information currently in use on the cluster, rather than just querying the database.
User Limits: It shows per-user and per-account associations, including limits such as maxjobs, maxsubmit, and TRES (Trackable Resources) usage.
QOS Details: It can be used to troubleshoot QOS limits, such as GrpTRESMins or MaxTRESPerUser.

Usage Tips:

Troubleshooting: If jobs are unexpectedly pending due to limits, this command helps verify what slurmctld believes the current usage is.
Privileges: Running this command usually requires elevated privileges (administrator).
Alternative: To see associations for specific users or accounts, it is often combined with other flags or sacctmgr to narrow down the output.

This command is vital for debugging why jobs are not running due to resource limits imposed by Accounting or QOS

Purpose and Functionality

View Internal Cache: This command shows the live, cached accounting information rather than querying the database directly, making it faster and useful for diagnosing why jobs are pending due to limits.
Troubleshooting Limits: It is primarily used to debug QOS and association limits (e.g., MaxTRESPerUser, GrpTRES).
Usage Data: It helps display the current consumption of resources by users or accounts.

Example Output

When run, it provides a detailed list of associations, including:

- ClusterName
- Account
- User
- QOS
- Usage Information (e.g., raw TRES usage)

scontrol show assoc_mgr

command is used in Slurm to view the internal status and «cached» accounting information currently held by the Slurm controller (slurmctld).
While tools like sacctmgr query the database directly,

scontrol show assoc_mgr shows you what the controller currently thinks is happening, which is critical for troubleshooting why jobs might be pending due to limits.

Key Uses

Verify Resource Limits: Check current resource consumption (e.g., CPUs, nodes, GPUs) against defined limits for specific users, accounts, or Quality of Service (QOS) levels.
Troubleshoot Pending Jobs: If a job is stuck with a reason like QOSMaxNodePerUserLimit, this command shows the actual counts the controller is using to enforce that limit.
Monitor Real-time Usage: See «Raw Usage» and TRES (Trackable Resource) consumption for associations before they are fully processed or decayed in the database.

Common Syntax Variants

Show All: scontrol show assoc_mgr (Displays everything including Users, Associations, and QOS).
Filter by QOS: scontrol show assoc_mgr qos=name (Narrows output to a specific QOS).
Filter by User: scontrol show assoc_mgr user=username.
One-liner Output: scontrol -o show assoc_mgr (Useful for scripts and parsing)

sacctmgr: Used to manage and modify association records.
sreport: Used for generating historical reports from Slurm accounting data.

Key Information Displayed

Lineage: Shows the structure and hierarchy of accounts (replaces old ‘Lft’ data).
User/Account Associations: Lists uid (often user name), account, partition, and associated QOS.
QOS Details: Displays limits like MaxJobs, MaxNodes, MaxWall, and Flags

=================================================================

getent passwd username
scontrol show assoc_mgr flags=users | grep -A2 -B2 username
scontrol reconfigure
sleep 5
scontrol show assoc_mgr flags=users | grep -A2 -B2 username

Если после reconfigure в assoc_mgr всё ещё остаётся username(uid — не тот) тогда уже перезапустить slurmctld, чтобы он заново поднял runtime-состояние user/account associations:

systemctl restart slurmctld
sleep 5
scontrol show assoc_mgr flags=users | grep -A2 -B2 username

======================================================

squeue — для просмотра текущего состояния очереди (списка запущенных и ожидающих заданий).
squeue -p <имя_раздела>

Special Symbols

* (Asterisk): If a state is followed by an asterisk (e.g., down* or idle*), it means the node is not responding to the Slurm controller.
+ (Plus): Indicates that the node is in a «power save» mode or is being rebooted.

Для определения количества процессоров (ядер), выделенных задачам пользователей в Slurm:

squeue с флагом формата для отображения CPUs (%C) или узлов/процессоров (%b). Основной способ — команда

squeue -o "%.10i %.10u %.5C %.10R", где %C показывает количество CPU.

squeue -u имя_пользователя -o "%.10i %.5C %.10R"

Просмотр истории использования процессоров (завершенные задачи):
sacct -j <job_id> --format=JobID,User,NCPUS,State
Просмотр подробной информации о запущенной задаче:

scontrol show job <job_id> | grep NumCPUs
```
Ключевые параметры Slurm
```

CPUS или num_cpus — общее количество ядер, выделенных задаче.
TRES (Trackable Resources) — может отображать ресурсы в формате cpu=N, если настроено в кластере.

================================================================

Примеры :

lxbsrv1:~/wrk # sreport cluster UserUtilizationByAccount
———————————————————————————
Cluster/User/Account Utilization 2026-03-26T00:00:00 — 2026-03-27T00:59:59 (90000 secs)
Usage reported in CPU Minutes
———————————————————————————
Cluster Login Proper Name Account Used Energy
——— ——— ————— ————— ——— ———
cicc cms001 Pcms user cms_mcore 2837592 0
cicc atlas001 Patlas user atl_mcore 2773792 0
cicc alice001 Palice user alice 1736900 0
cicc zontikov Zontikov Artem+ users 470185 0
cicc mpd001 Pmpd user nica 329914 0
cicc bmn001 Pbmn user nica 311389 0
cicc tatded Dedovich Tatja+ users 128100 0
cicc lhcb001 Plhcb user lhcb 38648 0
cicc itsrin8 Nizamov Rinat + users 6000 0
cicc asmirnov Smirnov Artjom+ users 1709 0
cicc bio001 Pbio user biomed 1500 0
cicc scms001 Scms user etf 112 0
cicc sops001 Sops user etf 42 0
cicc satl001 Satl user etf 41 0
cicc spd001 Pspd user nica 26 0

lxui10:~ > sinfo -aeN | grep cicc
t2-s734 1 cicc* resv
wn000 1 cicc* alloc
…
wn013 1 cicc* alloc

….

lxui10:~ > sinfo -N -l | more
Tue Mar 24 16:18:18 2026
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
t2-s734 1 tier2 reserved 8 8:1:1 15500 40000 1 (null) none
t2-s734 1 grid reserved 8 8:1:1 15500 40000 1 (null) none
t2-s734 1 cicc* reserved 8 8:1:1 15500 40000 1 (null) none
wn000 1 tier2 allocated 28 28:1:1 127500 1000000 1 (null) none
wn000 1 grid allocated 28 28:1:1 127500 1000000 1 (null) none
wn000 1 cicc* allocated 28 28:1:1 127500 1000000 1 (null) none
wn001 1 tier2 allocated 28 28:1:1 127500 1000000 1 (null) none
wn001 1 grid allocated 28 28:1:1 127500 1000000 1 (null) none
wn001 1 cicc* allocated 28 28:1:1 127500 1000000 1 (null) none
wn002 1 tier2 allocated 28 28:1:1 127500 1000000 1 (null) none
wn002 1 grid allocated 28 28:1:1 127500 1000000 1 (null) none

….

lxbsrv1:~/wrk/bin # sinfo -p cicc -N -l | more
Wed Mar 25 13:45:39 2026
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
t2-s734 1 cicc* reserved 8 8:1:1 15500 40000 1 (null) none
wn000 1 cicc* mixed 28 28:1:1 127500 1000000 1 (null) none
wn001 1 cicc* idle 28 28:1:1 127500 1000000 1 (null) none
wn002 1 cicc* mixed 28 28:1:1 127500 1000000 1 (null) none
wn003 1 cicc* mixed 28 28:1:1 127500 1000000 1 (null) none
wn004 1 cicc* mixed 28 28:1:1 127500 1000000 1 (null) none
wn005 1 cicc* mixed 28 28:1:1 127500 1000000 1 (null) none
wn006 1 cicc* mixed 28 28:1:1 127500 1000000 1 (null) none
wn007 1 cicc* mixed 28 28:1:1 127500 1000000 1 (null) none

….

CPUs:

sinfo -N -l | grep cicc | grep allocated | awk ‘{s +=$5} END {print s} ‘
2896

sinfo -p cicc -N -l | grep allocated | awk ‘{s +=$5} END {print s} ‘
444

sinfo -p cicc -N -l | grep mix | awk ‘{s +=$5} END {print s} ‘
6164

sinfo -p cicc -N -l | awk ‘{s +=$5} END {print s} ‘
11990

WNs:

sinfo -p cicc -N -l | awk ‘{print $1}’ | wc -l
466

lxbsrv1:~/wrk/bin # scontrol show partition cicc
PartitionName=cicc
AllowGroups=ALL AllowAccounts=users,etf AllowQos=ALL
AllocNodes=ALL Default=YES QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=1 MaxTime=4-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=t2s734,wn[000-055,060-113,115-329,360-475],wn2a[000-003],wni[020039]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=9964 TotalNodes=466 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
TRES=cpu=9964,mem=55099000M,node=466,billing=69152
TRESBillingWeights=CPU=1.0,Mem=1.1G

———————

lxui10:~/tmp > sacctmgr show account
Account Descr Org
———- ——————— ———————
alice alice lhc
atl_mcore atl_mcore lhc
bes bes grid
biomed biomed grid
cms_mcore cms_mcore lhc
compass compass grid
etf etf etf
grid grid grid
ilc ilc grid
juno juno grid
lhc lhc lhc
lhcb lhcb lhc
nica nica nica
nova nova grid
root default root account root
users users users

============================

sreport cluster UserUtilizationByAccount

Cluster/User/Account Utilization 2026-03-24T00:00:00 — 2026-03-25T00:59:59 (90000 secs)
Usage reported in CPU Minutes
———————————————————————————
Cluster Login Proper Name Account Used Energy
——— ——— ————— ————— ——— ———
cicc atlas001 Patlas user atl_mcore 2650179 0
cicc cms001 Pcms user cms_mcore 2296634 0
cicc alice001 Palice user alice 1690127 0
cicc mpd001 Pmpd user nica 594625 0
cicc zontikov Zontikov Artem+ users 171858 0
cicc genis Genis Musulman+ users 111000 0
cicc lbavinh Lyong Ba Vin users 63573 0
cicc aivanov Ivanov Arteb V+ users 47451 0
cicc lhcb001 Plhcb user lhcb 39645 0
cicc asmirnov Smirnov Artjom+ users 25416 0
cicc itsrin8 Nizamov Rinat + users 6000 0
cicc timofeev Timofeev Artem+ users 1511 0
cicc bio001 Pbio user biomed 1500 0
cicc kshtejer Katherin Shtej+ users 356 0
cicc bmn001 Pbmn user nica 232 0
cicc scms001 Scms user etf 163 0
cicc spd001 Pspd user nica 63 0
cicc satl001 Satl user etf 52 0
cicc sops001 Sops user etf 49 0
cicc oris Suares Ehng Or+ users 4 0

ЦИВК

Центральный информационно-вычислительный комплекс ОИЯИ

sinfo — scontrol — sacctmgr