SLURM . Running jobs at lxui[01..04].jinr.ru

Steps:

1. ssh lxui.jinr.ru

the user gets to one of the lxui[01-04] machines. These machines are accessible via ssh from all networks, not just the JINR network.

2. Run command

klist -e

should show in output:

{{{

lxui04:~ > klist -e | grep -o

aes256-cts aes256-cts

}}}

I it is not, the user needs to change the password. In addition, in Kerberos, the validity period of the initial ticket “Expires” should be 24 hours and its refresh time “renew until” – 100 days. (Format is month/day/year) .

The Kerberos ticket in SLURM is stored on the server for later use in the task, when it is launched on the computing machine, at this moment an AFS token is also created to access its $HOME. Ticket and token are constantly updated (up to 100 days) during operation tasks on a calculating machine. This makes SLURM itself without running krenew …

3. For the correct launch of job- to automatically extend the lifetime of a Kerberos ticket and an AFS token in an iterative session, you need to run once in the current session:

krenew -a -b -K 60 -t &

 

  • krenew – refreshes a kerberos ticket and AFS token in the background, refreshes up to 100 days.
  • auks – sends a kerberos ticket to the server, where it is stored, updated and transferred to the computing machine when the task starts.

4. Check the launch in SLURM by running an interactive session:

srun -n 1 -N 1 –pty –mem=500M –tmp=5G /bin/bash -i

A prompt from the counting machine will appear:

wn410:~ >

see HOME and ticket/token there:

pwd k

list

tokens

or :  srun -vv -n 1 -N 1 –pty –mem=500M –tmp=5G /bin/bash -i

You can run your program…

exit

6. in case of an error, update the ticket

auks-r

auks-a

 

7. SLURM creates a unique directory for each task, usually the first action in a task script:

cd $TMPDIR

TMPDIR is an environment variable that points to this directory. The size of the partition in which this directory is created is maximum possible on calculating machines. It is highly desirable to create temporary or resulting files in this directory, then rewrite from to space in EOS.

NEVER USE /tmp, /var/tmp DIRECTORIES !  you can overfill them, which will lead to malfunctions systems.

8. Run task script in SLURM

sbatch sript1.sh

9. check the status job

squeue -u `id -un`

10. Keep in mind the following:

Default limit: 400 cores for running tasks. When all your running tasks take up 400 cores, or there are no free cores, tasks will be queued.

 

So in short, running a task in SLURM at lxui should look something like this:

 a). ssh/putty to lxui.jinr.ru

b). start ticket/token update

krenew -a -b -K 60 -t

c). check with an interactive task

srun -n 1 -N 1 –pty –mem=500M –tmp=5G /bin/bash -i

exit

d). in case of an error, update the ticket

auks-r

auks-a

e). run the task script in SLURM

sbatch sript1.sh

f). check the status

squeue -u `id -un`

 

Sample script to run at sbatch :

{{{
#!/bin/bash
#SBATCH —job-name=test-001 # Job name
#SBATCH —mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH —mail-user=<usr>@jinr.ru # Where to send mail
#SBATCH —no-requeue # do not auto requeue on errors
#SBATCH —ntasks=1 # Run on a single CPU
#SBATCH —cpus-per-task=1 # 1 core per tast
#SBATCH —mem=2gb # Job memory request
#SBATCH —time=3-00:00:00 # Time limit days-hrs:min:sec
#SBATCH —tmp=100G # max disk space in $TMPDIR
$HOME/slurm/test.sh
}}}

$HOME/slurm/test.sh — script that runs on the counting machine

 

test.sh script example (test only):

 

{{
#!/bin/bash
echo «Start at: «`date`
U=`echo $USER | cut -c 1-1`
SRCF=file1GB
DSTF=$SRCF.$SLURM_JOB_ID
test X»$SUBMITHOST» = «X» && SUBMITHOST=lxui01.jinr.ru
echo «—— ARGV: 0=\»$0\» 1=\»$1\» 2=\»$2\» 3=\»$3\» 4=\»$4\»»
echo «—— currunt dir»
pwd
OPWD=`pwd`
echo «—— cd $TMPDIR»
cd $TMPDIR
pwd
echo «—— hostname -f»
hostname -f
echo «—— whoami»
whoami
echo «—— id»
id
echo «—— ulimit -a»
ulimit -a
echo «—— klist»
klist
echo «—— tokens»
tokens
echo «—— eos -b whoami»
eos -b whoami
echo «—— cat $HOME/.forward >/dev/null»
cat $HOME/.forward
echo «—— environment»
env | grep -E «^[A-Z]» | grep -v LS_COLORS | sort
echo «—— dd if=/dev/urandom of=$SRCF bs=10M count=100»
dd if=/dev/urandom of=$SRCF bs=10M count=100
for i in `seq -w 1 3 ; do
date
echo «###################################################################### step $i»
echo «—— /bin/rm -f /eos/user/$U/$USER/$DSTF»
/bin/rm -f /eos/user/$U/$USER/$DSTF
echo «—— /bin/time -p cp -pv $SRCF /eos/user/$U/$USER/$DSTF»
/bin/time -p cp -pv $SRCF /eos/user/$U/$USER/$DSTF
echo «—— cmp $SRCF /eos/user/$U/USER/$DSTF»
cmp $SRCF /eos/user/$U/$USER/$DSTF
echo «—— /bin/rm -f /eos/user/$U/$USER/$DSTF»
/bin/rm -f /eos/user/$U/$USER/$DSTF
echo «—— eoscp -n $SRCF /eos/user/$U/$USER/$DSTF»
eoscp -n $SRCF /eos/user/$U/$USER/$DSTF
echo «—— cmp $SRCF /eos/user/$U/USER/$DSTF»
cmp $SRCF /eos/user/$U/$USER/$DSTF
echo «—— /bin/rm -f /eos/user/$U/$USER/$DSTF»
/bin/rm -f /eos/user/$U/$USER/$DSTF
echo «—— eos cp -n $SRCF /eos/user/$U/$USER/$DSTF»
eos cp -n $SRCF /eos/user/$U/$USER/$DSTF
echo «—— cmp $SRCF /eos/user/$U/USER/$DSTF»
cmp $SRCF /eos/user/$U/$USER/$DSTF
echo «—— /bin/rm -f /eos/user/$U/$USER/$DSTF»
/bin/rm -f /eos/user/$U/$USER/$DSTF
echo «—— eos cp -n $SRCF root://eos.jinr.ru//eos/user/$U/$USER/$DSTF»
eos cp -n $SRCF root://eos.jinr.ru//eos/user/$U/$USER/$DSTF
echo «—— cmp $SRCF /eos/user/$U/USER/$DSTF»
cmp $SRCF /eos/user/$U/$USER/$DSTF
echo «—— /bin/rm -f /eos/user/$U/$USER/$DSTF»
/bin/rm -f /eos/user/$U/$USER/$DSTF
echo «—— xrdcp -f $SRCF \»root://eos.jinr.ru//eos/user/$U/$USER/$DSTF?xrd.wantprot=krb5,unix\»»
xrdcp -f $SRCF «root://eos.jinr.ru//eos/user/$U/$USER/$DSTF?xrd.wantprot=krb5,unix» 2>&1 | tr «\r» «\n» | tail -1
echo «—— cmp $SRCF /eos/user/$U/USER/$DSTF»
cmp $SRCF /eos/user/$U/$USER/$DSTF
echo «—— id»
id
echo «—— klist»
klist
echo «—— tokens»
tokens
echo «—— eos -b whoami»
eos -b whoami
echo «—— ps auxwww | grep -E $USER | grep -v [g]rep»
ps auxwww | grep -E «$USER» | grep -v [g]rep
echo «—— ssh -x $SUBMITHOST /bin/true»
ssh -x $SUBMITHOST /bin/true
echo «—— sleep 600»
sleep 600
done
echo «Done «`date`
exit 0
}}}

Lgs stdout/stderr  will be written to the current directory under the name 
slurm-<n>.out , <n> — job index at SLURM

Look up where it’s running:

sqstat –user=$USER