Cleaning up portable thread implementation
I have a reservation for a 8 hour interactive session on 32 of the afrank nodes tomorrow during which time I will be doing some final testing of the new dynamic load balancing scheme and making some minor tweaks. Today I am cleaning up the code changes made for the Portable threads and checking the implementation so I will be ready for tomorrow and also so that they can be checked into the main repo. Also working on copying and pasting the paper into the proceedings format for the AstroNUM 2011 conference.
Encouraging results
While the reservation was for 4 nodes with 8 procs per node, the afrank nodes had hyperthreading enabled - which caused mpirun to put 16 processes on a single node or two processors per proc. The idea behind hyper threading is that you overtask nodes so that they can switch back and forth between various jobs. By giving each processor two instances of the code, when one process is stuck waiting for a message, the other process can still work. The processor time per cell update would appear to double, although the actual core time per cell update should remain constant - or drop slightly if the hyperthreading gives an improvement. Going from 8 to 16, the cputime per cell update more then doubles indicating that hyper-threading actually hurts performance. it also makes it more difficult to schedule tasks at the application level, since the processor is switching back and forth between two instances of the code. The take away is that hyperthreading should not be used with AstroBEAR 2.0. Notice that the dynamic load balancing with the portable threads still gives a better performance whether or not hyperthreading is enabled.
Threading Mode | No Threading | Scheduling | Portable Threads | |||
---|---|---|---|---|---|---|
Level Balanced | T | T | F | T | F | |
8 | 3.0360618888 | 3.5354251260 | 3.9989763675 | 2.4490875460 | 2.3749391790 | |
16 (HyperThreading) | 7.1959148653 | 7.1250127389 | 5.3294605455 | |||
32 (HyperThreading?) | 12.1093259489 | 14.4666908237 | 9.2395890449 |
(Not sure if the 32 core run used both nodes or not…)
I was also somewhat surprised to see that global load balancing finally improved performance. Global load balancing is the opposite of having every level balanced. Previously performance was better when each level was balanced via splitting. While this created more smaller grids and therefore increased the overall workload - the improved distribution outweighed the cost - although perhaps because the load balancing scheme was not correctly compensating for imbalances.
Here is an image with the distribution showing the additional fragmentation on the right when Level Balanced is true. The top panels show every patch while the bottom panels show the mpi_id of the patches.
Finished looking at results from 902 cells per proc per level test… Tomorrow I have a reservation and will generate similar data for 162 322 642 1282 and 2562 if there is time. As the number of cells per proc per level increases, all of the lines will flatten out - but i should have all the necessary data for the paper.
Attachments (2)
- LoadBalancing.png (29.2 KB) - added by 13 years ago.
- ScalingResults.png (48.3 KB) - added by 13 years ago.
Download all attachments as: .zip
Comments
No comments.