Performance comparison between BlueStreak and Grass

I ran a fixed grid colliding flow setup - with self gravity and cooling etc… for a few time steps to compare the performance of bluestreak when using 1, 8, or 16 cores on a single node as well as a single core on grass. I also compared different levels of optimization (-O3 and -O4). I tried -O5 but the code dies soon after outputting the first frame. I have tried compiling with traceback enabled to find out where the code is dying - but I don't seem able to get any useful information from the core dumps. I'm not sure what is going wrong there.

The figure below shows the cell updates/core/second over the first 10 timesteps for various numbers of cores on bluestreak - and for optimization levels 3 and 4 - as well as a single core run on grass with level 3 optimization (1_3g).

In any event the performance is about what you would expect. The code seems to slow down after a few time steps due to either CPU temperature - or perhaps cooling source terms… Optimization does not make much of a difference < 1% (at least going from O3 to O4). Also in going from 1 to 16 cores per node - the efficiency only drops around 10% which is surprisingly good. Overall the speed per core is about 25% that of grass. Given that grass has a 2.5 GHz processor and that each core on BlueStreak is supposedly running at 1.6 GHz - this is a little surprising. A single bluestreak core appears to be performing at 39% of what you would expect taking into account CPU speed.

Attachments (1)

Download all attachments as: .zip

Comments

1. Baowei Liu -- 12 years ago

Now I'm not quite sure "Cell update per sec" is a good measure of the performance. I ran MultiClumps on different Number of cores of blue streak. When the Cell Update/Sec got lower on more cores, the total run times showed AstroBEAR ran faster. On 16 cores, AstroBEAR ran about same speed as on grass (about 2400 seconds). On 32 cores, it ran much faster.

Nodes Cores Cell Update/Sec Total Run Time
1 8 ~ 4000 4387 secs
1 16 ~3500 2689 secs
2 32 ~2800 1620 secs
2. Jonathan -- 12 years ago

The statistic that AstroBEAR gives you is cell updates per second per core. Run time should decrease when you double the number of cores - even if each core is effectively doing less cell updates per second. I think we should try to understand if we are getting the correct performance for a single core before we do multiple core runs…

3. Baowei Liu -- 12 years ago

That makes sense. Brendan suggested me to check the performance of libraries like fftw3, hdf5 and hypre on blue streak — he mentioned there's a optimized version of fftw3 library in IBMmath installed on blue streak…. I also made a reservation on blue gene/p to do some performance testing with AstroBEAR before its decommission.

4. Baowei Liu -- 12 years ago

Here's my tries on Blue Gene/P. It seems AstroBEAR runs a little slower on BGP than on blue streak — but very close:

Nodes Cores Cell Update/Sec Total Run Time
1 4 4636 8280 secs
2 8 3951 4784 secs
4 16 3431 2730 secs
8 32 2790 1658 secs
5. Baowei Liu -- 12 years ago

Update Information from other groups:

Yes, we used namd, although we only compared with the BGP.  Per core, it seems to be slightly slower.  The benefit is that it scales better than the P (the plateau is further to the right, using more nodes).  Supposedly, the trick for good performance is to utilize the extra hardware threads per core, so you have 64 threads per node.  In this case, we do get good performance.  For namd, however, this configuration is not stable and does not scale well (by which I mean it may run for 16 nodes but will SEGV for 32 nodes).
Also, here is a link to scaling/performance studies down by another group in Canada using NAMD on a Q:
http://wiki.scinethpc.ca/wiki/index.php/Namd_on_BGQ
I didn't do per-core comparision either. But the per-node performance of Gromacs on BGQ (64 threads) is about 40% slower than on our dept. linux cluster (8 core).