wiki:RunningAstroBear

Version 7 (modified by Erica Kaminski, 12 years ago) ( diff )

Running AstroBEAR

NOTE: This tutorial assumes that you have successfully compiled AstroBEAR and set up a problem directory (denoted here by run_dir). If you have not done so, please return to the User Guide for instructions on building your code and your run directory.


Running A Job

There are two ways to run a job: directly and through a queueing system. The direct method is basically like any other Linux/Unix operation—the user starts an n-processor run at the command line and lets it go. Most personal desktop machines such as grass use the direct method—their user-base is manageable, they're simply too small to really warrant a scheduler, and they're seldom used for production runs anyway.

Large shared clusters such as bluehive usually use some kind of scheduling system. Rather than simply running AstroBEAR on one of these systems, you create a job script and submit it to the queue. The job then runs on one of the cluster's compute nodes. Most scheduling systems have an interactive mode you can use for short debugging sessions, but you still have to submit an interactive job to the queue.

IMPORTANT: DO NOT run a job using the direct method if your cluster uses a job scheduling system. Otherwise, you will start your job on the cluster's head-node, the head-node will slow to a crawl, other users will get upset and you will get a cross e-mail from the cluster administrators. If you need/prefer the direct method, then submit an interactive job or stick to an unscheduled cluster.


Grass

One of the aforementioned multicore desktops, grass does not have a scheduler. To run a job on grass, simply move to your run_dir and enter the following command:

mpiexec -np <nprocs> astrobear > outfile.out &

where <nprocs> is the number of processors for your job. This starts an AstroBEAR run and pipes the standard output to outfile.out. The & makes it a background process, so you can watch the output by using the command

tail -f outfile.out


Bluehive

  1. AstroBEAR is distributed with a sample bluehive job script that you can modify to suit your needs. Execute the following commands to copy the required files to your run directory run_dir:
    cp <astrobear_directory>/sumbission_scripts/bluehive.pbs <run_dir>
    cp <astrobear_directory>/sumbission_scripts/launch.s <run_dir>
    
    This copies over the PBS script bluehive uses to start the job, as well as the launch.s script we use to set up certain environment variables.
  1. Move to <run_dir>, and open the bluehive.pbs file. The line beginning with #PBS -q has five parameters that you will need to customize:
    • nodes: The number of nodes your job will use. The number of processors you request is equal to nodes * ppn, and the most processors you can use on bluehive at any given time is 64. This is true whether you are running one 64-processor job or two 32-processor jobs.
    • ppn: The number of processors per node that you will use. Bluehive has eight processors per node, so if you need more than eight processors you will have to vary the nodes parameter.
    • pvmem: The amount of system memory reserved for each processor. The easiest thing to do is just set this to 2000m. If you do this, be sure to set vmem to -1.
    • vmem: The total amount of system memory reserved for the job. This is a more flexible option than pvmem, in that it creates a pool of memory for each processor to draw from. This should be set to <2000*nodes*ppn>m. If you do this, be sure to set pvmem to -1.
    • walltime: The amount of wall-clock time your job should run. In general, shorter wall-times are better if you can get away with them, because they run sooner. The standard queue on bluehive has a maximum walltime of 5 days (120 hours), but a 119-hour job runs almost as long as a 120-hour job and sometimes fools the scheduler into running it sooner.
  1. Find the line in bluehive.pbs starting with #PBS -M and change the e-mail address on that line to yours.
  1. Find the line in bluehive.pbs starting with #PBS -N. This is the name your job will display in the queue, so change it to something you will find useful/descriptive.
  1. Find the following line in bluehive.pbs:
    mpiexec -np <nprocs> ./launch.s > mpi_outfile.opt.$PBS_JOBID
    
    This is the command to run AstroBEAR; change <nprocs> to match the number of processors your job will use.
  1. Save your changes to bluehive.pbs and enter the following command at the prompt (make sure you are in run_dir:
    qsub bluehive.pbs
    
    If everything is okay in your PBS script, bluehive will return a line of the following form:
    <PBS_JOBID>.bhsn-int.bluehive.crc.private
    
    Where <PBS_JOBID> is a numerical ID assigned to your job. You can check up on your job(s) using the command
    showq -u <username>
    
    To get an estimate of your job's start time, use the command
    showstart <PBS_JOBID>
    
    The showstart command doesn't always tell you anything useful, but it can be helpful for estimates.

For more information about using the bluehive cluster, check out the CRC's Bluehive wiki page. For those who prefer to read offline, they also have a tutorial in PDF form.


BlueGene

  1. AstroBEAR is distributed with a sample bluegene job script that you can modify to suit your needs. Execute the following commands to copy the required files to your run directory run_dir:
    cp <astrobear_directory>/sumbission_scripts/bluegene.cmd <run_dir>
    cp <astrobear_directory>/sumbission_scripts/launch.s <run_dir>
    
    This copies over the LoadLeveler script bluegene uses to start the job, as well as the launch.s script we use to set up certain environment variables.
  1. Move to the run_dir directory, open the bluegene.cmd, and fine the line that starts with #@ job_name =. This is the name the job will display in the queue; change it to something you find more useful/descriptive.
  1. Find the line in bluegene.cmd that starts with #@ wall_clock_limit. This sets the wall clock time for your job. Change this to the time you want (remembering that BlueGene's maximum wall clock time is 3 days).
  1. Find the line in bluegene.cmd that starts with #@ bg_size. This is the number of nodes that will be allocated to the job; change this to the number of nodes you want. BlueGene's minimum node allocation is 64 nodes, and the CRC recommends you allocate blocks in powers of two (64, 128, 256, etc.) for optimal performance. Each node has four cores on it, so the value you pick for bg_size should be nprocs / 4.
  1. In bluegene.cmd, find the line that starts with mpirun -exp_env BG_MAXALIGNEXP -np <nprocs> that is not commented (i.e., it doesn't begin with a #). In that line, change the value <nprocs> to the number of processors you want.
  1. Save your changes to bluegene.cmd and enter the command
    llsubmit bluegene.cmd
    

This will submit your job to the queue. If all goes well, BlueGene will return the output:

llsubmit: The job "bgpfen.crc.rochester.edu.<JOB_ID>" has been submitted.

where <JOB_ID> is a numerical ID that the LoadLeveler queue will use to identify your job. You can check up on your job(s) using the command

llq -b

To get an estimate of your job's start time, use the command

llq -s bgpfen.<JOB_ID>

The showstart command doesn't always tell you anything useful, but it can be helpful for estimates.

For more information about BlueGene, visit the CRC's BlueGene wiki page.


Itasca

  1. AstroBEAR is distributed with a sample itasca job script that you can modify to suit your needs. Execute the following commands to copy the required files to your run directory run_dir:
    cp <astrobear_directory>/sumbission_scripts/itasca.pbs <run_dir>
    cp <astrobear_directory>/sumbission_scripts/launch.s <run_dir>
    
    This copies over the Torque script that itasca uses to start the job, as well as the launch.s script we use to set up certain environment variables.
  1. Move to <run_dir>, and open the itasca.pbs file. The line beginning with #PBS -l has three parameters that you will need to customize:
    • walltime: The amount of wall-clock time you intend for your job to take. Itasca's standard queue has a maximum of 24 hours, but shorter jobs in general are good. The University of Minnesota group gets a certain number of CPU hours from MSI that they're willing to share with us, so don't be extravagant.
    • mem: Like the vmem parameter on bluehive, this represents the total amount of memory this job will use. Itasca has 3 GB per core, so a vmem value of nprocs * 24000m should be about right (e.g., for 64 processors, use vmem=192000m).
    • nodes: The number of nodes your job will use. The number of processors you request is equal to nodes * ppn. While itasca has no limit on the number of nodes you can request (beyond the hard cluster limit of 1088 nodes), smaller jobs nonetheless run sooner. Plus, large jobs eat into the UMN group's CPU-hour allocation, so please be considerate.
    • ppn: The number of processors per node that you will use. Itasca has eight processors per node, so if you need more than eight processors you will have to vary the nodes parameter.
  1. Find the line in itasca.pbs starting with #PBS -N. This is the name your job will display in the queue, so change it to something you will find useful/descriptive.
  1. Look for the cd <directory> line in itasca.pbs and change the directory path to that of your problem directory run_dir.
  1. Save your changes to itasca.pbs and enter the following command at the prompt (make sure you are in the same directory as itasca.pbs:
    qsub itasca.pbs
    
    You can check up on your job(s) using the command
    showq -u <username>
    
    To get an estimate of your job's start time, use the command
    showstart <PBS_JOBID>
    
    Where <PBS_JOBID> is a numerical ID assigned to your job (the job ID is returned upon submission and can also be retrieved with the showq -u <username> command). The showstart command doesn't always tell you anything useful, but it can be helpful for estimates.

For more information about using the itasca cluster, check out MSI's itasca page..

Note: See TracWiki for help on using the wiki.