timinginfo
Added timers to every do loop in hyperbolic update
Here is the break down in terms of percent time and threaded speedup (using 32 threads)
id | serial time | speedup | 32 thread time | percent of total | description |
82 | 3.231047 | 9.128274 | 0.353960343 | 9.89% | predictor riemann solves |
80 | 3.121115 | 8.762267 | 0.356199486 | 9.55% | predictor riemann solves |
84 | 2.997481 | 9.587972 | 0.312629303 | 9.17% | predictor riemann solves |
69 | 2.919109 | 22.452476 | 0.130012788 | 8.93% | eigen system |
213 | 2.752028 | 9.994905 | 0.275343087 | 8.42% | final riemann solves |
210 | 2.63407 | 9.832083 | 0.26790559 | 8.06% | final riemann solves |
216 | 2.617439 | 9.956359 | 0.262891183 | 8.01% | final riemann solves |
23 | 1.504078 | 23.787341 | 0.063230186 | 4.60% | characteristic tracing |
25 | 1.499858 | 23.942636 | 0.062643812 | 4.59% | characteristic tracing |
20 | 1.495662 | 24.030097 | 0.062241197 | 4.58% | characteristic tracing |
19 | 1.38434 | 23.166956 | 0.059754937 | 4.24% | 2 spatial reconstruction |
18 | 0.689309 | 22.116419 | 0.031167297 | 2.11% | spatial reconstruction |
2 | 0.128533 | 19.941572 | 0.00644548 | 0.39% | cons_to_prim |
97 | 0.110934 | 20.652223 | 0.005371528 | 0.34% | upwinded emfs predictor |
96 | 0.099056 | 19.557486 | 0.005064864 | 0.30% | upwinded emfs predictor |
95 | 0.098898 | 19.709309 | 0.005017832 | 0.30% | upwinded emfs predictor |
225 | 0.091149 | 19.994068 | 0.004558802 | 0.28% | upwinded emfs |
159 | 0.090894 | 20.499695 | 0.00443392 | 0.28% | prim_to_cons2 |
147 | 0.090303 | 20.1791 | 0.004475076 | 0.28% | prim_to_cons2 |
And here is a nifty plot showing the time spend on each loop vs threading efficiency.
The lower right cluster is the 6 different riemann solve loops (3 dimensions for predictor and final) which shows the most promise for improving the speedup. The upper right is the calculation of the eigen system which seems to scale quite well. Then in the center is the spatial reconstruction and characteristic tracing
The weighted average efficiency is around 12% - mostly due to the riemann solves.
So the problem was the allocation of temporary variables inside of the riemann solvers. Different threads would allocate these arrays at the same time and they would end up being close in memory which presumably lead to false sharing. In any event, the scaling of the riemann solvers is much better - and is a small portion of the runtime. Now the threaded bits get an average speedup of 21. However the non-threaded parts now make up 90% of the time…
Here is an updated graph showing the scaling performance of the same do loops.
I also looked at reducing the time spent in calc-eigens. I suspected that the manipulation of a MaxVars by MaxVars static array might be responsible so I reduced MaxVars from 20 to 10. The result was a significant speedup
I also added timers for the source routines (even though there was no source terms) and they seemed responsible for the 90% of the time for the threaded runs. If I comment those out I get a 16 fold speedup with threads on a bluegene node for a fixed grid computation.
I next tried an AMR run but that seems to be running slower due to extra refinement. See attached screenshot.
Obviously the threading is producing noise which is triggering refinement.
So it turns out I was attempting to use allocated variables with the local module scope as private within a threaded do loop… But apparently this was not working with openmp. I changed them to be a static size and declared them with the subroutine scope and it seems to work fine now. And on a 162+2 field loop advection run, the single node threaded version was 20% faster than the single node mpi version
Attachments (4)
- Screen Shot 2013-06-04 at 2.27.01 PM.png (65.5 KB) - added by 11 years ago.
- Screen Shot 2013-06-04 at 5.08.35 PM.png (40.2 KB) - added by 11 years ago.
- Screen Shot 2013-06-05 at 4.34.15 PM.png (169.2 KB) - added by 11 years ago.
- Screen Shot 2013-06-05 at 4.44.36 PM.png (52.9 KB) - added by 11 years ago.
Download all attachments as: .zip
Comments
No comments.