Meeting update - Erica, 2/10/14
Group related stuff:
I make a motion to move the group meeting to another day, one less busy with meetings. If this motion is passed, I volunteer to help with organizing another time that works for everyone. The reason I am proposing this switch is that Monday is quite stressful/hectic with all of the meetings - 11:30-12 grad students meet with speaker, 12-1 Pizza Lunch, 130-3 group meeting, 3:30-5 lecture. I personally find it not-optimal to prepare for our Monday meeting under these circumstances. And on another personal note, I think Friday meeting would be best - to wrap up the week with updates, and there are no meetings/classes on this day.
Science Stuff:
Colliding Flows
Am attempting to optimize the wall-time of the runs on the different local clusters. Currently, the 2 different runs I have made decent progress on are the 15 and 30 shear angles cases. These have been restarted and moved around from bluehive to bluestreak to find a sweet spot. Here are some findings.
In 1 day on bluestreak (512 processors), the restart produced 7 new frames.
The same number of frames were produced on bluehive (64 cores, standard queue) but over 3 days.
The restart frame is 32.
Comparing frame 33 on bluehive and bluestreak:
Mesh on Bluehive:
Mesh on Bluestreak:
Rho on BH:
Rho on BS:
Computing
Worked on getting baseline estimate of astrobear -
http://astrobear.pas.rochester.edu/trac/astrobear/wiki/u/erica/memoryBaseline
Found out there are only 96 cores available per user at any given time on bluehive, only 24 hrs allowed on bluestreak and max is 48 hrs with reservation. I asked CIRC about bluestreak, and got this email, thought relevant to share with group:
Erica,
The queue time limits on BlueStreak are to ensure that everyone gets a share of the machine and a little bit to force people to check point their code. Month long run times run the risk of something happening before the job finishes. If the job is not check pointed and restartable, all the run time is lost if the job and or the node crashes. Also, if the job is in an infinite loop the whole month's worth of run time would be lost. So the time limits are short to ensure higher turn over of jobs and good programming practices.
The scheduler is configured to give all users a fair share of time on the BG/Q. So people that have only occasional jobs don't get shut out by people that continually shove lots of jobs into the queue. The share is based on the number of nodes multiplied by the run time of jobs over the past 2 weeks. The fair share and wait time in the queue are the primary factors in setting priorities in the scheduler.
Your jobs have been an exception that I am still trying to figure out. They should be starting in less than a day but they have been taking days to get started. We recently updated to a new version of the scheduler so I would be interested to see if this fixes the problems with the 256 node jobs.
All of this can be overriden with a reservation but there are two problems with that. One is that the smallest reservation we can make is half the machine. So your code would need to scale to 512 nodes or maybe run as multiple jobs in parallel to make use of all the nodes.
The bigger problem is how to justify dedicating half the machine to a single person for a month or two. This is a shared resource for the UR and URMC research communities and we try to be fair in allocating time to all the users.
I hope this explains some of our reasoning behind our policies with regard to the BlueStreak. Feel free to contact us if you still have questions.
Carl
Documentation Stuff:
When do you all want to get together and talk about documentation?
Attachments (14)
- rhoCF15.gif (1.6 MB) - added by 11 years ago.
- 15rho21.png (62.7 KB) - added by 11 years ago.
- 15rho21.2.png (62.7 KB) - added by 11 years ago.
- 30rho21.png (66.3 KB) - added by 11 years ago.
- rhoCF30.gif (2.4 MB) - added by 11 years ago.
- sh2.png (50.4 KB) - added by 11 years ago.
- sht1.png (46.8 KB) - added by 11 years ago.
- wtf.png (96.9 KB) - added by 11 years ago.
- bluestreak_rho.png (43.4 KB) - added by 11 years ago.
- bluehive_rho.png (44.2 KB) - added by 11 years ago.
- bluehive_mesh.png (26.9 KB) - added by 11 years ago.
- bluestreak_mesh.png (26.9 KB) - added by 11 years ago.
- bluehive_mesh2.png (23.8 KB) - added by 11 years ago.
- bluestreak_mesh2.png (24.2 KB) - added by 11 years ago.
Comments
It looks like frame 39 was from a separate run on bluestreak?