Version 10 (modified by 13 years ago) ( diff ) | ,
---|
AstroBEAR 2.0 Scaling
Empirical Results
2D
AstroBEAR 2.0 has been tested in 2D with MaxLevels of 0, 1, 2, & 4 and with base grids and refinement tolerances to give 82, 162, 322, 642, 1282, & 2562 cells per processor per level. Both the threaded algorithm and the non-threaded algorithm have been tested. The results are summarized in the following two figures.
3D
AstroBEAR 2.0 has been tested from 8 processors to 1024 processors on MSI's Itasca cluster. Our test problem is a three-dimensional field cylinder advection problem with no source terms or elliptic calculations. These are weak scaling fixed-grid tests where the workload is kept constant at 643 cells per processor.
The 8-processor run is a cubic grid (128 cells to a side). To preserve the spatial scaling of the problem, the 16- and 32-processor runs are stretched along the z-axis (the cylinder doubles in length, and the domain doubles in size). For the 64-processor run, however, the domain is returned to its original spatial dimensions, but with double the resolution (256 cells on each side). This cycle repeats every three doublings of processor size.
This arrangement is complex, but it achieves three things:
- The workload per processor is kept constant.
- The scale of the problem within the plane remains constant, as does the cylinder .
- The characteristic time of the problem remains constant (this is the time it takes for a cylinder of radius to travel a distance ).
As the resolution increases, so does the number of steps required to reach
. Consequently, the wall time doubles when the resolution does. Our metrics (described below) take this into account.
Metrics
We calculate the efficiency of an
-processor using the following normalization formula:Where
is the number of processors, is the runtime for the -processor run, and is the resolution factor at procs.The average step time is simply the advance time (i.e., the time in between the problem setup and the program shutdown) divided by the number of steps taken in the run:
Results
The reason for the hyper-efficiency at 16 and 32 processors is unknown at the moment. The efficiency at 1024 processors is 90.4%. The startup time is constant around 0.5%, so there was no compelling reason to show the advance time efficiency as well.
This is the metric used by the CASTRO group. Currently our average step times range from 11.6 seconds at 8 processors to 12.8 seconds at 1024 processors. Since MHD codes usually run twice as slow as hydro codes (because they manage twice as much data) our results are competitive with CASTRO's range of 5-6 seconds per timestep on pure hydro.
Processor Fluctuations
For a fixed grid simulation, the time it takes a processor to update a grid will in general fluctuate with some mean and standard deviation. If the standard deviation becomes a significant fraction of the mean, then running on more and more processors results in sampling of values each time step that are farther and farther from the mean - and serious performance degradation can ensue. I ran a monte-carlo 'simulation' to create curves of expected efficiencies assuming that the standard deviation for a single update was 1%, 2%, 5%, 10%, and 20% of the mean. The results are in the attached figure. The x-axis is number of processors, and the y-axis is the efficiency. If the standard deviation for the runs on Itasca are of order 15%, then it is not surprising that our scaling at 1024 processors is only 51%
Fluctuation Statistics
We ran a fixed grid simulation on Itasca (thanks Brandon) and stored the walltime each processor took to advance its grid each time step. Here we have a pseudocolor plot showing the advance times by processor. First notice the striations from left to right indicating that different processors perform consistently at different speeds. Also, notice at the right we have plotted the maximum value across all processors. It tends to be higher by 20% then the general mean values.
Now we create a histogram of the data showing the bi-modal distribution of advance times each with a standard deviation that is of order 10%
Now if we just look at the mean time for each processor and plot the distribution we see a slightly narrower bi-modal distribution but the width is still surprising since we have averaged over 160 time steps and the deviation should be about 1/12th of the un-averaged value.
Out of curiosity I performed an fft of the mean times to see if there was any trends in the performance (ie - every 8 procs or 16 procs) but it looks fairly featureless…
Finally if we compensate the advance times for each processor by its mean processor speeds we reduce the bi-modal distribution to a normal distribution and if we had identical processors - our efficiency would improve from 82% to 88%
We've also plotted the distribution of maximum advance times
Attachments (12)
- runtime_vs_amrtime_efficiency.png (45.2 KB ) - added by 14 years ago.
- startup_fractional_time.png (39.9 KB ) - added by 14 years ago.
- MonteCarloEfficiencies.png (32.7 KB ) - added by 14 years ago.
- AdvanceTimes_1.jpg (397.1 KB ) - added by 14 years ago.
- AdvanceTimes_2.jpg (55.1 KB ) - added by 14 years ago.
- AdvanceTimes_3.jpg (27.1 KB ) - added by 14 years ago.
- AdvanceTimes_4.jpg (73.1 KB ) - added by 14 years ago.
- AdvanceTimes_5.jpg (72.2 KB ) - added by 14 years ago.
- efficiency64_3_492.png (44.7 KB ) - added by 14 years ago.
- avgtimestep64_3_492.png (42.0 KB ) - added by 14 years ago.
- LevelComparison.png (46.1 KB ) - added by 13 years ago.
- GridSizeComparison.png (53.3 KB ) - added by 13 years ago.
Download all attachments as: .zip