timinginfo

Added timers to every do loop in hyperbolic update

Here is the break down in terms of percent time and threaded speedup (using 32 threads)

idserial timespeedup32 thread timepercent of totaldescription
823.2310479.1282740.3539603439.89%predictor riemann solves
803.1211158.7622670.3561994869.55%predictor riemann solves
842.9974819.5879720.3126293039.17%predictor riemann solves
692.91910922.4524760.1300127888.93%eigen system
2132.7520289.9949050.2753430878.42%final riemann solves
2102.634079.8320830.267905598.06%final riemann solves
2162.6174399.9563590.2628911838.01%final riemann solves
231.50407823.7873410.0632301864.60%characteristic tracing
251.49985823.9426360.0626438124.59%characteristic tracing
201.49566224.0300970.0622411974.58%characteristic tracing
191.3843423.1669560.0597549374.24%2 spatial reconstruction
180.68930922.1164190.0311672972.11%spatial reconstruction
20.12853319.9415720.006445480.39%cons_to_prim
970.11093420.6522230.0053715280.34%upwinded emfs predictor
960.09905619.5574860.0050648640.30%upwinded emfs predictor
950.09889819.7093090.0050178320.30%upwinded emfs predictor
2250.09114919.9940680.0045588020.28%upwinded emfs
1590.09089420.4996950.004433920.28%prim_to_cons2
1470.09030320.17910.0044750760.28%prim_to_cons2

And here is a nifty plot showing the time spend on each loop vs threading efficiency.

The lower right cluster is the 6 different riemann solve loops (3 dimensions for predictor and final) which shows the most promise for improving the speedup. The upper right is the calculation of the eigen system which seems to scale quite well. Then in the center is the spatial reconstruction and characteristic tracing

The weighted average efficiency is around 12% - mostly due to the riemann solves.

So the problem was the allocation of temporary variables inside of the riemann solvers. Different threads would allocate these arrays at the same time and they would end up being close in memory which presumably lead to false sharing. In any event, the scaling of the riemann solvers is much better - and is a small portion of the runtime. Now the threaded bits get an average speedup of 21. However the non-threaded parts now make up 90% of the time…

Here is an updated graph showing the scaling performance of the same do loops.

I also looked at reducing the time spent in calc-eigens. I suspected that the manipulation of a MaxVars by MaxVars static array might be responsible so I reduced MaxVars from 20 to 10. The result was a significant speedup

I also added timers for the source routines (even though there was no source terms) and they seemed responsible for the 90% of the time for the threaded runs. If I comment those out I get a 16 fold speedup with threads on a bluegene node for a fixed grid computation.

I next tried an AMR run but that seems to be running slower due to extra refinement. See attached screenshot.

Obviously the threading is producing noise which is triggering refinement.

So it turns out I was attempting to use allocated variables with the local module scope as private within a threaded do loop… But apparently this was not working with openmp. I changed them to be a static size and declared them with the subroutine scope and it seems to work fine now. And on a 162+2 field loop advection run, the single node threaded version was 20% faster than the single node mpi version

Attachments (4)

Download all attachments as: .zip

Comments

No comments.