Work with PBS-Portable Batch System.

PBS (Portable Batch System) is a batch job processing system designed to manage the resources of high-performance computing clusters.

PBS commands are used to place a job in the queue for execution, run, accounting, monitoring, modify, delete jobs and issue results.

For most users at LIT   farm  queues are created:
ib – queue for a parallel farm with Infiniband, as well as queue and common.

All jobs are started in the common  queue, from which they are redistributed in the common and ib queues. Distribution criterion:

– if 2 to 80 CPUs are ordered, the job will go to the ib queue and will run on Infiniband;
– if 1 CPU or more than 80 is ordered, the job will go to the common queue and will work on 1Gb Ethernet.

During the batch process, a number of environment variables that can be used in the job are automatically initialized.

Some commands. Usage/Control results.

To run jobs in the batch system, you must:

1) Attention! you need to run the command "pbspwstore" and

Enter in response to 3 requests 3 times your AFS password
 Be extremely careful when entering a password, in case of an error, your jobs will not be able to finish normally.

The command "pbspwstore" should be repeated only if the AFS-password is changed;

2) create a job-file pbs_script, in which to define the parameters necessary for the task;

Example pbs_script:

#!/bin/sh
 #PBS -l walltime=10:00:00, cput=01:00:00
 #PBS -m abe
 #PBS -q defq
 #PBS -M username @ lxpub01
 #PBS -r n

cd $ PBS_O_WORKDIR
 cc -o myprog myprog.c
 ./myprog

Where:

-l list of resources (via ",")
 -walltime maximum execution time
 -q Queue Name
 -m events, which should be notified by email:
 b - start, e - end, a - stop the work by mistake
 -M e-mail address to which all service status messages are sent
 -r (y / n) Whether to restore the task, when the nodes are rebooted

3) run the job in the system with the qsub command.

Example 1:

qsub pbs_script

Example 2 enter the required parameters from the command line:

qsub -l walltime = 10:00:00 -m abe -M username@lxpub01 -r n mpiexec $ PBS_O_WORKDIR/program_name

In the working directory there are 2 files:

TaskName. OIdentifier – contains standard output (stdout),
Task_name.eIdentifier – contains an error message (strerr)

For a detailed description of the parameters and variables, see: man qsub,
task resources: man pbs_resources,
attributes of the job: man pbs_job_attributes.

Task control: qstat.

The state and control of the task is performed using the qstat command, after entering it, a table appears on the screen whose columns have the following values:

Job ID               is the unique identifier of the task;
 Username        the owner of the task;
 Queue              the name of the queue in which the task is located
 Jobname          job name;
 SessID             session identifier (if the job is in the execution state)
 NDS                 number of CPUs used;
 TSK                  number of tasks;
 Req'd               time is the planned task account time;
 Elap                 time is the total processor time used by the task at the moment;
 S                      job status:
 Q - is in the queue,
 R - is calculated,
 E - an error occurred during execution.

Example: queue parameters ib.

lxpub01: > qstat -Qf ib

Queue: ib
resources_max.cput = 50000:00:00
resources_max.nodect = 560
resources_max.walltime = 101:00:00
resources_min.nodect = 2
resources_available.nodect = 560
resources_default.cput = 50000:00:00
resources_default.walltime = 101:00:00

max.nodect – the maximum number of CPUs for the parallel job (for the ib queue);
max.walltime – maximum astronomical job count time;

cput –  maximum amount of CPU time used by all processes in the job.
Units: time.

These queue parameters are set by the system administrator and can be changed. To balance the parameters of the queue  you need to take into account that for walltime, cput and nodect, the following relation is approximately true:

astronomical time (walltime) * number of processes (nodect) –
– CPU time (cput)
(In the example of the ib queue parameters given above, the maximum counting time of the job is set more in view of the planned increase in the number of processors).

The required values for these parameters are specified in the script file and, according to the requested resources, the priority and waiting time in the queue are determined. The job is not queued if the maximum allowed values of the queue parameters are exceeded. Asking for the values of counting and astronomical time is much more than the required does not make sense. the task is more likely to stay in the queue longer.

Remove a task from the queue: qdel JobID.

Below are the links to script files (author Mitsin V.V.), which can be used as a test or diagnostic task for cases:

when the user has something in the batch-system does not work;

A test for parallel computing at multiple processors for the case where all files are in AFS;

A test for parallel computing on multiple processors, when files must be copied from / scr or another arbitrary location located outside the AFS.