SLURM . Running jobs at lxui[01..04].jinr.ru

Steps:

1. ssh lxui.jinr.ru

the user gets to one of the lxui[01-04] machines. These machines are accessible via ssh from all networks, not just the JINR network.

2. Run command

klist -e

should show in output:

{{{
lxui04:~ > klist -e | grep -o aes256-cts
aes256-cts
}}}

If you get a different output, you need to change your password.

In addition, in Kerberos, the validity period of the initial ticket “Expires” should be 24 hours and its refresh time “renew until” – 100 days. (Format is month/day/year).
The Kerberos ticket in SLURM is stored on the server for later use in the task, when it is launched on the computing machine, at this moment an AFS token is also created to access its $HOME. Ticket and token are constantly updated (up to 100 days) during operation tasks on a calculating machine. This makes SLURM itself without running krenew …

3.  To launch the jobs correctly, you need to run the automatic extension of the Kerberos ticket and AFS token lifetime once in the current session:

krenew -a -b -K 60 -t &

krenew – refreshes a kerberos ticket and AFS token in the background, refreshes up to 100 days.

auks – sends a kerberos ticket to the server, where it is stored, updated and transferred to the computing machine when the task starts.

4.  Check the launch in SLURM by running an interactive

 srun -n 1 -N 1 --pty --mem=500M --tmp=5G /bin/bash -i

A prompt from the counting machine will appear:

wn410:~

Check HOME and ticket/token:

pwd
klist
tokens

Then you can launch your program or  test:

srun -vv -n 1 -N 1 --pty --mem=500M --tmp=5G /bin/bash -i

exit

6. in case of an error, update the ticket:

auks-r

auks-a

7.  SLURM creates a unique directory for each job,  usually the first action in a job script:

cd $TMPDIR

TMPDIR is an environment variable that points to this directory.  The size of the partition in which this directory is created is maximum possible on calculating machines. It is highly desirable to create temporary or resulting files in this directory, then rewrite from or to space  at EOS.

NEVER USE /tmp, /var/tmp DIRECTORIES ! you can overfill them, which will lead to malfunctions systems.

8. Run the job script at SLURM

sbatch sript1.sh

9. check the status job

squeue -u `id -un`

10. Keep in mind the following:

Default limit: 400 cores for running tasks. When all your running tasks take up 400 cores, or there are no free cores, tasks will be queued.


So in short, running a job  in SLURM at lxui.jinr.ru  should look something like this:

a) ssh/putty to lxui.jinr.ru

b) start ticket/token update

krenew -a -b -K 60 -t

c) check with an interactive task

srun -n 1 -N 1 --pty --mem=500M --tmp=5G /bin/bash -i

exit

d) in case of an error, update the ticket

auks-r

auks-a

launch  script in SLURM

sbatch sript1.sh

f) check the status

squeue -u `id -un`


 

Sample script to run at sbatch :

{{{
 #!/bin/bash
 #SBATCH --job-name=test-001 # Job name
 #SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
 #SBATCH --mail-user=<usr>@jinr.ru # Where to send mail
 #SBATCH --no-requeue # do not auto requeue on errors
 #SBATCH --ntasks=1 # Run on a single CPU
 #SBATCH --cpus-per-task=1 # 1 core per tast
 #SBATCH --mem=2gb # Job memory request
 #SBATCH --time=3-00:00:00 # Time limit days-hrs:min:sec
 #SBATCH --tmp=100G # max disk space in $TMPDIR
 $HOME/slurm/test.sh
 }}}

$HOME/slurm/test.sh — script that runs on the counting machine

 

test.sh script example (test only):

 

{{
#!/bin/bash
echo «Start at: «`date`
U=`echo $USER | cut -c 1-1`
SRCF=file1GB
DSTF=$SRCF.$SLURM_JOB_ID
test X»$SUBMITHOST» = «X» && SUBMITHOST=lxui01.jinr.ru
echo «—— ARGV: 0=\»$0\» 1=\»$1\» 2=\»$2\» 3=\»$3\» 4=\»$4\»»
echo «—— currunt dir»
pwd
OPWD=`pwd`
echo «—— cd $TMPDIR»
cd $TMPDIR
pwd
echo «—— hostname -f»
hostname -f
echo «—— whoami»
whoami
echo «—— id»
id
echo «—— ulimit -a»
ulimit -a
echo «—— klist»
klist
echo «—— tokens»
tokens
echo «—— eos -b whoami»
eos -b whoami
echo «—— cat $HOME/.forward >/dev/null»
cat $HOME/.forward
echo «—— environment»
env | grep -E «^[A-Z]» | grep -v LS_COLORS | sort
echo «—— dd if=/dev/urandom of=$SRCF bs=10M count=100»
dd if=/dev/urandom of=$SRCF bs=10M count=100
for i in `seq -w 1 3 ; do
date
echo «###################################################################### step $i»
echo «—— /bin/rm -f /eos/user/$U/$USER/$DSTF»
/bin/rm -f /eos/user/$U/$USER/$DSTF
echo «—— /bin/time -p cp -pv $SRCF /eos/user/$U/$USER/$DSTF»
/bin/time -p cp -pv $SRCF /eos/user/$U/$USER/$DSTF
echo «—— cmp $SRCF /eos/user/$U/USER/$DSTF»
cmp $SRCF /eos/user/$U/$USER/$DSTF
echo «—— /bin/rm -f /eos/user/$U/$USER/$DSTF»
/bin/rm -f /eos/user/$U/$USER/$DSTF
echo «—— eoscp -n $SRCF /eos/user/$U/$USER/$DSTF»
eoscp -n $SRCF /eos/user/$U/$USER/$DSTF
echo «—— cmp $SRCF /eos/user/$U/USER/$DSTF»
cmp $SRCF /eos/user/$U/$USER/$DSTF
echo «—— /bin/rm -f /eos/user/$U/$USER/$DSTF»
/bin/rm -f /eos/user/$U/$USER/$DSTF
echo «—— eos cp -n $SRCF /eos/user/$U/$USER/$DSTF»
eos cp -n $SRCF /eos/user/$U/$USER/$DSTF
echo «—— cmp $SRCF /eos/user/$U/USER/$DSTF»
cmp $SRCF /eos/user/$U/$USER/$DSTF
echo «—— /bin/rm -f /eos/user/$U/$USER/$DSTF»
/bin/rm -f /eos/user/$U/$USER/$DSTF
echo «—— eos cp -n $SRCF root://eos.jinr.ru//eos/user/$U/$USER/$DSTF»
eos cp -n $SRCF root://eos.jinr.ru//eos/user/$U/$USER/$DSTF
echo «—— cmp $SRCF /eos/user/$U/USER/$DSTF»
cmp $SRCF /eos/user/$U/$USER/$DSTF
echo «—— /bin/rm -f /eos/user/$U/$USER/$DSTF»
/bin/rm -f /eos/user/$U/$USER/$DSTF
echo «—— xrdcp -f $SRCF \»root://eos.jinr.ru//eos/user/$U/$USER/$DSTF?xrd.wantprot=krb5,unix\»»
xrdcp -f $SRCF «root://eos.jinr.ru//eos/user/$U/$USER/$DSTF?xrd.wantprot=krb5,unix» 2>&1 | tr «\r» «\n» | tail -1
echo «—— cmp $SRCF /eos/user/$U/USER/$DSTF»
cmp $SRCF /eos/user/$U/$USER/$DSTF
echo «—— id»
id
echo «—— klist»
klist
echo «—— tokens»
tokens
echo «—— eos -b whoami»
eos -b whoami
echo «—— ps auxwww | grep -E $USER | grep -v [g]rep»
ps auxwww | grep -E «$USER» | grep -v [g]rep
echo «—— ssh -x $SUBMITHOST /bin/true»
ssh -x $SUBMITHOST /bin/true
echo «—— sleep 600»
sleep 600
done
echo «Done «`date`
exit 0
}}}

Lgs stdout/stderr  will be written to the current directory under the name 
slurm-<n>.out , <n> — job index at SLURM

Look up where it’s running:

 sqstat --user=$USER