|
ACES queue system
To run jobs on ACES you must use PBS.
Examples of PBS jobs can be found here.
Each ACES site has access to a set of
queues that, by default
sends jobs to the 32-bit hardware groups at
that site.
To send jobs to machines other than the default set of 32-bit machines you
need to specify additional attribute information during job submission.
It is also possible to send jobs to other ACES sites by specifying an
alternate site in the job submission command. To do this you must specify the
"head node" for that site that is given in the
hardware groups table.
ACES Portable Batch System (PBS)
The PBS resource management system handles the management and monitoring
of the computational workload on the ACESGrid. Users submit "jobs" to the
resource management system where they are queued up until the system is
ready to run them. PBS selects which jobs to run, when, and where,
according to a predetermined site policy meant to balance competing user
needs and to maximize efficient use of the cluster resources.
It is important that all users learn about and cooperate with the
queueing system in order to avoid system "hogging" or unnecessary resource
contentions (eg. two or more people trying to use the same CPU at the same
time). The queue systems described here should meet the needs of the
majority of users. However, they are not "graven in stone" and
flexibility is possible to accomodate special needs. Please contact us if you have
special queueing requirements.
To use PBS, you create a batch job command file which you submit to the
PBS server to run on the ACESGrid. A batch job file is simply a shell
script containing the set of commands you want run on some set of cluster
compute nodes. It also contains directives which specify the
characteristics (attributes), and resource requirements (e.g. number of
compute nodes and maximum runtime) that your job needs. Once you create
your PBS job file, you can reuse it if you wish or modify it for
subsequent runs.
PBS also provides a special kind of batch job called
interactive-batch. An interactive-batch job is treated just like a regular
batch job, in that it is placed into the queue and must wait for resources
to become available before it can run. Once it is started, however, the
user's terminal input and output are connected to the job in what appears
to be an rlogin session to one of the compute nodes. Many users find this
useful for debugging their applications or for computational
steering.
The ACESGrid is a heterogeneous computing environment consisting of
several different sites each consisting of different types of
hardware. The PBS queue system controls which hardware a particular job
will be executed on.
Viewing ACESGrid PBS Queues
The qstat command is used to see PBS queues. Useful
options include:
| qstat -a |
Lists all of the jobs within the PBS cluster. |
| qstat -an |
Lists all of the jobs within the PBS cluster and
their respective execution hosts. |
| qstat -q |
Lists all of the queues within the PBS cluster
(including resource limits). |
| qstat -s |
Lists all of the jobs within the PBS with their
respective status comments. |
| qstat -Qf queue |
Lists all information about a specific queue. |
| qstat -f jobid |
Lists detailed information about a specific job. |
Additional options are available. Please, read more about qstat in man
pages (man qstat).
Summary of available ACESGrid PBS Queues and their attributes
Queue name |
Attributes |
| one
|
- max running jobs in this queue = 1024
- max nodes per job = 1
- max running jobs per user = 64
- max walltime per job = 2 hours
|
| four
|
- default queue
- max running jobs in this queue = 1024
- max nodes per job = 16
- max running jobs per user = 8
- max walltime per job = 2 hours
|
| four-twelve
|
- max running jobs in this queue = 1024
- max nodes per job = 26
- max running jobs per user = 4
- max walltime per job = 12 hours
|
| long
|
- max running jobs in this queue = 1024
- max nodes per job = 16
- max running jobs per user = 8
- max walltime per job = 24 hours
|
| toolong
|
- max running jobs in this queue = 1024
- max nodes per job = 4
- max running jobs per user = 4
- max walltime per job = 168 hours
|
| all
|
- CNH's private queue containing all available ACES resources!
- max nodes per job = 1024
- max running jobs per user = 4
- max walltime per job = infinite hours
|
Job submission
The PBS qsub command is used to submit job command files for scheduling
and execution. For example, to submit your job using a PBS command file
called "pbs_script", the syntax would be
$ qsub pbs_script
1354.itrda
Notice that upon successful submission of a job, PBS returns a job
identifier of the form jobid.itrda, where jobid is an integer number
assigned by PBS to that job. You'll need the job identifier for any
actions involving the job, such as checking job status, deleting the job.
A simple example of a PBS command file is given below.
There are many options to the qsub command as can be seen by typing man
qsub at the command prompt on itrda.acesgrid.org. In general jobs are
submitted using qsub in either a "batch" mode (above), or in an
"interactive" mode using the -I option (below). The -I option declares
that the job is to be run "interactively", the -l option, allows resource
requirements to be listed as part of the qsub command.
$ qsub -I -l nodes=2
qsub: waiting for job 46167.itrda to start
qsub: job 46167.itrda ready
aE34-500-036:simon <501>:
Notice once you start the interactive job, you are automatically logged
into the first of the requested interactive nodes. Type exit from this
shell to end the interactive session.
PBS batch script example
#!/bin/csh
#
#filename: pbs_script
#
# Example PBS script to run a job on the myrinet-3 cluster.
# The lines beginning #PBS set various queuing parameters.
#
# o -N Job Name
#PBS -N pbs_script
#
#
# o -l resource lists that control where job goes
# here we ask for 3 nodes, each with the attribute "gigabit".
#PBS -l nodes=3:gigabit
#
# o Where to write output
#PBS -e stderr
#
#PBS -o stdout
#
#
# o Export all my environment variables to the job
#PBS -V
#
echo $PBS_NODEFILE
cat $PBS_NODEFILE
echo 'The list above shows the nodes this job has exclusive access to.'
echo 'The list can be found in the file named in the variable $PBS_NODEFILE'
Submit the file using the command:
$ qsub pbs_script
You should see output something like:
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
/var/spool/PBS/aux/22703.itrda
aE34-500-036
aE34-500-037
aE34-500-038
The list above shows the nodes this job has exclusive access to.
The list can be found in the file named in the variable $PBS_NODEFILE
Killing PBS Jobs
If for any reason you wish to kill a job (perhaps a job submitted in
error), then the command to use is qdel and an example of
the syntax is:
$ qdel 46784.itrda
|