Version 6 (modified by 12 years ago) ( diff ) | ,
---|
RTC Cluster
NOTE: The information on this page was last updated in August, 2007. I am fairly certain that their systems have changed since then. This page needs the attention of someone who knows what's going on. --KYIntroduction
Rice's current cluster is a (nominally) 128-node, 262-processor system of Intel Itanium 2 (64-bit), 900MHz processors. Most nodes are dual-processor, with 4GB of RAM. There are 4 nodes which are 4-processor, 8GB (16GB?) RAM nodes. Communication between nodes may be Myrinet or gigabit ethernet; however, the latter is not recommended because it's honkin' SLOW. The system has (again, nominally) 92 nodes which can support Myrinet.
As of July, 2007, on a processor-to-processor basis RTC is about 1/3rd the speed of the Nova cluster.
Connecting to RTC
One cannot directly connect to the cluster from outside of the Rice University network. In order to connect, one must first SSH to gw.rcsg.rice.edu (usage policy on SSH gateway: http://rcsg.rice.edu/rcsg/policies/gateway.html).
From there, one may login to the cluster at rtc.rice.edu.
If it is your first time on the system, you will be prompted to change your password from the one you specified when filling out the webform, for both systems. The new passwords may be the same as each other.
Cluster documentation is located at http://rcsg.rice.edu/rtc.
Compiling on Rice
Details on compiling libraries for AstroBEAR (HDF4, HDF5, PGPLOT) may be found at CompilingOnRice.
Useful Commands
Several commands may be used with the MAUI scheduler.
- showq or qstat: These commands display the current job schedule. They show almost the same information, but showq sorts the Running Jobs by Estimated Finish Time (jobs ending soonest at top) and qstat sorts the Running Jobs by Job ID. Additional information about running jobs can be obtained with showq -r.
- checkjob <JOB ID>: This command will return information about job number <JOB ID>, such as what resources are allocated.
- showstart <JOB ID>: For a job <JOB ID> in the idle queue, this command will give the estimated start and finish times.
- showbf: Show the "backfill", namely, the number of processors that are free.
- qdel <JOB ID>: This command will delete job <JOB ID>, whether it is running, idle, or blocked.
- diagnose -f: This command returns the "fair use" statistics of the cluster and is of purely academic interest.
diagnose
has other interesting arguments which may be passed; typing diagnose without any arguments shows a list. Of note are -n (node diagnostics) -j (job diagnostics). I (Kris) suspect this command is supposed to be admins-only. Related commands include showstats and showgrid.
As with most clusters, you do not need to supply the complete <JOB ID>; that is, while most <JOB ID>'s are of the form 72320.management, simply providing 72320 to the above commands is sufficient.
Queue System
The RTC queuing system is largely automatic. The only time when one must specify a queue is for Interactive or Super queue jobs.
Displaying Realtime Program Output
The PBSPro system does not allow direct viewing of program output while a program is running. Instead, it captures all program output and, at the end of execution, writes it to a file specified in the job script. Therefore, one cannot keep track of how a job is progressing (apart from analyzing the contents of the out/ directory) until the job has finished.
This behavior can be thwarted by redirecting the output to a file. For example, within a PBS job script
... mpiexec ./mpibear > /shared.scratch/username/output.out
This file may then be tail'd. However, as of 8/6/07, this method produces a file which has an address error at the end, meaning that the usual method,
tail -f /shared.scratch/username/output.out
DOES NOT WORK. You can workaround this by using watch
:
watch -n 10 tail -n 30 /shared.scratch/username/output.out
This command tells watch every 10 seconds (-n 10) to look at the last 30 lines (tail -n 30) of the output file. This will still report the address error, but it will continue to function.
Quad Jobs
Just as the option to use myrinet is specified in the job script, for example,
#PBS -l nodes=2:ppn=2:myrinet
The quad-processor nodes (n127—130) should be accessible via
#PBS -l nodes=2:ppn=4:quad
However, using checknode, it appears that all the quad proc nodes have been reserved for Super queue jobs.