Example ACES queue Charm++ jobs
The Charm++ system is installed on the ACES clusters and offers both a native parallel programming environment as well as a method of converting MPI codes to Charm++ codes allowing for the possibility of dynamic load balancing etc. More details refer to the Adaptive MPI manual.
Parallel jobs
PBS parallel jobs use the TCP-based MPI communications (over Gigabit or Fast Ethernet depending on the node) by default.
script file example-charm.csh
#!/bin/csh
# invoking Charm++/AMPI on ACES Linux clusters
#
# All PBS options start as "#PBS " and can be specified on the command line
# after qsub instead of being embedded in the script file.
#----------------------------------------------
# o Queue name
# -q queue
# Parallel queues available on itrda are:
# four (2hours,16nodes),four-twelve (12hours,26nodes),long (168hours,64nodes)
#PBS -q four
#----------------------------------------------
# o Job name instead of the PBS script filename
# -N Job name (use a distinguishing name)
#PBS -N MyNameCharm
#----------------------------------------------
# o Resource lists
# -l resource lists, separated by a ","
# To ask for N nodes use "nodes=N"
# To ask for 2 processor per node use ":ppn=2", otherwise ":ppn=1"
# after the nodes=N. Preferably use ppn=2 and ask for less nodes.
# To ask for Myrinet use ":myrinet", for Gigabit Ethernet use ":gigabit"
# after the nodes=N:ppn=M
# To specify total wallclock time use "walltime=hh:mm:ss"
#PBS -l nodes=16:ppn=1,walltime=00:10:00
#----------------------------------------------
# o stderr/out combination
# -j {eo|oe}
# Causes the standard error and standard output to be combined in one file.
# For standard output to be added to standard error use "eo"
# For standard error to be added to standard output use "oe"
#
# o stderr/out (specify them instead if getting script.[oe]$PBS_JOBID
# -e standard error file
# -o standard output file
# You can append ${PBS_JOBID} to ensure distict filenames
#PBS -e myrunCharm.stderr
#PBS -o myrunCharm.stdout
#----------------------------------------------
# o Starting time
# -a time
# Declares the time after which the job is eligible for execution.
#----------------------------------------------
# o User notification
# -m {a|b|e}
# Send mail to the user when:
# job aborts: "a", job begins running: "b", job ends: "e"
#PBS -m ae
#----------------------------------------------
# o Exporting of environment
# -V export all my environment var's
#PBS -V
#----------------------------------------------
# Begin execution
#
# Check the environment variables
#
#printenv
#
# Get the right Charm variant module
# net-linux : UDP comms
# net-linux-smp: SMP on a node, UDP outside, higher latency for off-node comms
# net-linux-tcp: TCP comms, more reliable, less performance
# net-linux-gm : Myrinet
#module add charm/5.9/net-linux
module add charm/5.9/net-linux-smp
#module add charm/5.9/net-linux-tcp
#module add charm/5.9/net-linux-gm
#
# get PBS node info
#
echo $PBS_NODEFILE
cat $PBS_NODEFILE
#----------------------------------------------
# cd to the working directory from which the job was submitted
#
cd $PBS_O_WORKDIR
# How many procs do I have?
setenv NP `wc -l $PBS_NODEFILE | awk '{print $1}'`
# Get a file with unique hostnames
uniq $PBS_NODEFILE > machinefile.uniq.$PBS_JOBID
# How many nodes do I have?
setenv NPU `wc -l machinefile.uniq.$PBS_JOBID | awk '{print $1}'`
# setup the required variables
setenv NODELIST nodelist.$PBS_JOBID
# need this for the foreach loop
setenv NODES `cat machinefile.uniq.$PBS_JOBID`
# Create the Charm++ nodefile
echo group main > $NODELIST
# in
foreach node ($NODES)
echo host $node ++cpus 2 >> $NODELIST
end
#
# Run the AMPI code called "executable", provided it is in PBS_O_WORKDIR
# and your PATH includes "."
#
ampirun -np $NP ./executable
# Otherwise run directly:
charmrun +p$NP ./executable
# Say you want to run on double the number of physical processors:
@ NP2 = 2 * NP
charmrun +p$NP +vp$NP2 ./executable
# Cleanup
# Remove the unique machinefiles
rm $PBS_O_WORKDIR/machinefile.uniq.$PBS_JOBID
rm $PBS_O_WORKDIR/$NODELIST
#
# Exit (not strictly necessary)
#
exit
|