Steps:
1. ssh lxui.jinr.ru
the user gets to one of the lxui[01-04] machines. These machines are accessible via ssh from all networks, not just the JINR network.
2. Run command
klist -e
should show in output:
{{{
lxui04:~ > klist -e | grep -o aes256-cts
aes256-cts
}}}
If you get a different output, you need to change your password.
In addition, in Kerberos, the validity period of the initial ticket “Expires” should be 24 hours and its refresh time “renew until” – 100 days. (Format is month/day/year).
The Kerberos ticket in SLURM is stored on the server for later use in the task, when it is launched on the computing machine, at this moment an AFS token is also created to access its $HOME. Ticket and token are constantly updated (up to 100 days) during operation tasks on a calculating machine. This makes SLURM itself without running krenew …
3. To launch the jobs correctly, you need to run the automatic extension of the Kerberos ticket and AFS token lifetime once in the current session:
krenew -a -b -K 60 -t &
krenew – refreshes a kerberos ticket and AFS token in the background, refreshes up to 100 days.
auks – sends a kerberos ticket to the server, where it is stored, updated and transferred to the computing machine when the task starts.
4. Check the launch in SLURM by running an interactive
srun -n 1 -N 1 --pty --mem=500M --tmp=5G /bin/bash -i
A prompt from the counting machine will appear:
wn410:~
Check HOME and ticket/token:
pwd
klist
tokens
Then you can launch your program or test:
srun -vv -n 1 -N 1 --pty --mem=500M --tmp=5G /bin/bash -i
exit
6. in case of an error, update the ticket:
auks-r
auks-a
7. SLURM creates a unique directory for each job, usually the first action in a job script:
cd $TMPDIR
TMPDIR is an environment variable that points to this directory. The size of the partition in which this directory is created is maximum possible on calculating machines. It is highly desirable to create temporary or resulting files in this directory, then rewrite from or to space at EOS.
NEVER USE /tmp, /var/tmp DIRECTORIES ! you can overfill them, which will lead to malfunctions systems.
8. Run the job script at SLURM
sbatch sript1.sh
9. check the status job
squeue -u `id -un`
10. Keep in mind the following:
Default limit: 400 cores for running tasks. When all your running tasks take up 400 cores, or there are no free cores, tasks will be queued.
So in short, running a job in SLURM at lxui.jinr.ru should look something like this:
a) ssh/putty to lxui.jinr.ru
b) start ticket/token update
krenew -a -b -K 60 -t
c) check with an interactive task
srun -n 1 -N 1 --pty --mem=500M --tmp=5G /bin/bash -i
exit
d) in case of an error, update the ticket
auks-r
auks-a
launch script in SLURM
sbatch sript1.sh
f) check the status
squeue -u `id -un`
Sample script to run at sbatch :
{{{ #!/bin/bash #SBATCH --job-name=test-001 # Job name #SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL) #SBATCH --mail-user=<usr>@jinr.ru # Where to send mail #SBATCH --no-requeue # do not auto requeue on errors #SBATCH --ntasks=1 # Run on a single CPU #SBATCH --cpus-per-task=1 # 1 core per tast #SBATCH --mem=2gb # Job memory request #SBATCH --time=3-00:00:00 # Time limit days-hrs:min:sec #SBATCH --tmp=100G # max disk space in $TMPDIR $HOME/slurm/test.sh }}}
$HOME/slurm/test.sh — script that runs on the counting machine
test.sh script example (test only):
{{
#!/bin/bash
echo «Start at: «`date`
U=`echo $USER | cut -c 1-1`
SRCF=file1GB
DSTF=$SRCF.$SLURM_JOB_ID
test X»$SUBMITHOST» = «X» && SUBMITHOST=lxui01.jinr.ru
echo «—— ARGV: 0=\»$0\» 1=\»$1\» 2=\»$2\» 3=\»$3\» 4=\»$4\»»
echo «—— currunt dir»
pwd
OPWD=`pwd`
echo «—— cd $TMPDIR»
cd $TMPDIR
pwd
echo «—— hostname -f»
hostname -f
echo «—— whoami»
whoami
echo «—— id»
id
echo «—— ulimit -a»
ulimit -a
echo «—— klist»
klist
echo «—— tokens»
tokens
echo «—— eos -b whoami»
eos -b whoami
echo «—— cat $HOME/.forward >/dev/null»
cat $HOME/.forward
echo «—— environment»
env | grep -E «^[A-Z]» | grep -v LS_COLORS | sort
echo «—— dd if=/dev/urandom of=$SRCF bs=10M count=100»
dd if=/dev/urandom of=$SRCF bs=10M count=100
for i in `seq -w 1 3 ; do
date
echo «###################################################################### step $i»
echo «—— /bin/rm -f /eos/user/$U/$USER/$DSTF»
/bin/rm -f /eos/user/$U/$USER/$DSTF
echo «—— /bin/time -p cp -pv $SRCF /eos/user/$U/$USER/$DSTF»
/bin/time -p cp -pv $SRCF /eos/user/$U/$USER/$DSTF
echo «—— cmp $SRCF /eos/user/$U/USER/$DSTF»
cmp $SRCF /eos/user/$U/$USER/$DSTF
echo «—— /bin/rm -f /eos/user/$U/$USER/$DSTF»
/bin/rm -f /eos/user/$U/$USER/$DSTF
echo «—— eoscp -n $SRCF /eos/user/$U/$USER/$DSTF»
eoscp -n $SRCF /eos/user/$U/$USER/$DSTF
echo «—— cmp $SRCF /eos/user/$U/USER/$DSTF»
cmp $SRCF /eos/user/$U/$USER/$DSTF
echo «—— /bin/rm -f /eos/user/$U/$USER/$DSTF»
/bin/rm -f /eos/user/$U/$USER/$DSTF
echo «—— eos cp -n $SRCF /eos/user/$U/$USER/$DSTF»
eos cp -n $SRCF /eos/user/$U/$USER/$DSTF
echo «—— cmp $SRCF /eos/user/$U/USER/$DSTF»
cmp $SRCF /eos/user/$U/$USER/$DSTF
echo «—— /bin/rm -f /eos/user/$U/$USER/$DSTF»
/bin/rm -f /eos/user/$U/$USER/$DSTF
echo «—— eos cp -n $SRCF root://eos.jinr.ru//eos/user/$U/$USER/$DSTF»
eos cp -n $SRCF root://eos.jinr.ru//eos/user/$U/$USER/$DSTF
echo «—— cmp $SRCF /eos/user/$U/USER/$DSTF»
cmp $SRCF /eos/user/$U/$USER/$DSTF
echo «—— /bin/rm -f /eos/user/$U/$USER/$DSTF»
/bin/rm -f /eos/user/$U/$USER/$DSTF
echo «—— xrdcp -f $SRCF \»root://eos.jinr.ru//eos/user/$U/$USER/$DSTF?xrd.wantprot=krb5,unix\»»
xrdcp -f $SRCF «root://eos.jinr.ru//eos/user/$U/$USER/$DSTF?xrd.wantprot=krb5,unix» 2>&1 | tr «\r» «\n» | tail -1
echo «—— cmp $SRCF /eos/user/$U/USER/$DSTF»
cmp $SRCF /eos/user/$U/$USER/$DSTF
echo «—— id»
id
echo «—— klist»
klist
echo «—— tokens»
tokens
echo «—— eos -b whoami»
eos -b whoami
echo «—— ps auxwww | grep -E $USER | grep -v [g]rep»
ps auxwww | grep -E «$USER» | grep -v [g]rep
echo «—— ssh -x $SUBMITHOST /bin/true»
ssh -x $SUBMITHOST /bin/true
echo «—— sleep 600»
sleep 600
done
echo «Done «`date`
exit 0
}}}
Lgs stdout/stderr will be written to the current directory under the name
slurm-<n>.out , <n> — job index at SLURM
Look up where it’s running:
sqstat --user=$USER