Efficient Parallelization for AMR MHD Multiphysics Calculations; Implementaion in AstroBEAR
Abstract
Current and future AMR simulations will require algorithms that are highly parallelized and manage memory efficiently. In the redesign of AstroBEAR we have attempted to employ new techniques to achieve both of these goals. Patch based AMR often employs ghost cells to decouple the hyperbolic advances of each grid. This decoupling allows each patch to be advanced independently. In AstroBEAR we utilize this independence to allow patches across multiple refinement levels to be advanced independently with preference going to the finer level patches. This allows for global load balancing instead of level by level load balancing and allows for greater parallelization across both space and AMR level which can improve performance in deep simulations with many levels of refinement. With regard to memory management we have employed a distributed tree algorithm so processors no longer need to store the entire tree structure. We have also implemented a sweep method that pipelines the computations required for updating the fluid variables using unsplit algorithms. This can dramatically reduce the memory overhead required for intermediate variables.
Introduction
Larger and larger clusters alone will not allow larger and larger simulations to be performed in the same wall time without either an increase in CPU speed or a decrease in the workload of each processor. Since CPU speeds are not expected to keep pace with the requirements of large simulations, the only option is to decrease the work load of each processor. This however requires highly parallelized and efficient algorithms for managing the AMR infrastructure and the necessary computations. AstroBEAR, unlike many other AMR tree management codes (Paramesh, BoxLib), uses a distributed tree so that each processor is only aware of the AMR structure it needs to be aware of in order to carry out its computations and perform the necessary communications. While currently, this additional memory is small compared to the resources typically available to a CPU, future clusters will likely have much less memory per processor similar to what is already seen in GPU's. AstroBEAR also uses a distributed control structure that mirrors the nested AMR grid hierarchy. Processors have child processors just as AMR grids have child grids. These child processors receive new grids, and all necessary tree information from their parent processor(s). This eliminates the need for global sharing of tree data. Additionally the allocation of resources among child grids is done using a Hilbert space filling curve. This allows neighboring processors to be physically close on the network (or on the same core) and allows AstroBEAR to take advantage of the network topology.
AstroBEAR also utilizes interlevel threading to allow advances to occur on different levels independently. This allows for total load balancing across all refinement levels instead of balancing each level independently. This becomes especially important for deep simulations (simulations with low filling fractions but many levels of AMR) as opposed to shallow simulations (high filling fractions and only a few levels of AMR). Processors with coarse grids can advance their grids simultaneously while processors with finer grids advance theirs. Without this capability, base grids would need to be large enough to be able to be distributed across all of the processors. For simulations with large base grids to be able to finish in a reasonable wall time, only a few levels of AMR can often be used. With interlevel threading this restriction is lifted.
Distributed Tree-Control Algorithm
Many if not all current AMR codes are built upon either BoxLib or Paramesh both of which opt to store the entire AMR structure on each processor. This however requires additional memory and communication. (7 floats per patch compared to 256 floats for the data (for 2D 4x4 hydro with 2 ghost zones). While this additional memory requirement is negligible for problems run on typical cpus on 100's of processors, it can be become considerable on 1000's or 10000's of processors. Since it is expected that efficient memory use and management will be required for HPC (high performance computing) down the road, AstroBEAR uses a distributed tree algorithm in which each processor is only aware of the nodes 'surrounding' its own nodes. This includes its coarser parent node and finer children nodes, as well as physically adjacent neighbors and temporally adjacent preceding and succeeding overlaps. When new nodes are created, the parent nodes determine what information needs to be shared and with whom and then populates the childs local connections. The basic algorithm for updating local trees is as follows:
New Nodes Old Nodes First Step 1 Receive new nodes along with their parents, neighbors, and preceding overlaps from parent processors Receive new succeeding overlaps from parent processors 2 Create new children and determine on which child processors they will go 3 Determine which remote preceding nodes might have children that would overlap with its own children and send the relevant children info Determine which remote succeeding nodes might have created children that would overlap with its own children and send the relevant children info 4 Determine which remote neighboring nodes might have children that would neighbor its own children and send the relevant children info 5 Determine which local neighboring nodes have neighboring children 6 Determine which local preceding nodes have children that overlap with its own Determine which local succeeding nodes have children that overlap with its own 7 Receive new children from remote neighboring nodes and determine which of the neighbors' children neighbors its own children 8 Receive children from remote preceding nodes and determine which of the nodes children overlaps with its own Receive children from remote succeeding nodes and determine which of the nodes children overlaps with its own. 9 For each remote child, send the child's info as well as information about its parents, neighbors, & preceding overlaps. For each remote child, send the child's succeeding overlaps. Successive Steps 10 Create new children and determine on which child processor they will go 11 Determine which remote neighboring nodes might have old/new children that would overlap/neighbor its own new children and send the relevant children info 12 Determine which local neighboring nodes might have old/new children that would overlap/neighbor its own new children 13 Receive new children from remote neighboring nodes and determine which of the neighbors' children neighbors/overlaps its new/old children 14 For each new remote child, send the child's information, and the information about its parent, neighbors, & preceding overlaps. 15 For each old remote child, send the child's succeeding overlaps. *Note that physically disjoint grids can overlap with each other's ghost zones - so the 1st generation of a node's neighbor's children can overlap with the nodes 2nd generation of children in steps 11-13 *See the section on distribution for calculation of parent and child processors.
Threaded Multi-Level Advance
Many if not all current AMR codes tend to perform grid updates across all levels in a prescribed order (0, 1, 2, 2, 1, 2, 2, 0…) Good performance then requires each level update to be balanced across all processors. Load balancing each level however, often leads to artificial fragmentation of grids and requires each processor to have grids on every level. This however, limits the depth of the AMR tree and the resulting simulations are often performed with large base grids with only 1 or 2 levels of AMR. The benefit of AMR however, is best utilized for scenarios with large dynamic ranges - and having only 2 levels of AMR severely limits the dynamic range that can be followed. In AstroBEAR we allow each levels grid advances to happen independently so that processors with coarse grids can update those grids while other processors with fine grids update theirs…
Load Balancing
As was mentioned before, threading level removes the need for balancing each level and instead allows for global load balancing. When distributing children of level
nodes each processor first calculates the average remaining workload on each coarser level and the number of remaining level advances that can be completed before level completes. Each processer then calculates the work load imbalance per level step as well as the new child load per level step . Then these two numbers and are globally shared. Then each processor calculates its share of the new level work load where . Then the workloads per processor are partitioned over so that each processor can determine from which processors it should expect to receive new child grids from as well as which processor it should give new grids to. It then sorts its children in a hilbert order before distributing them in order among its child processors. If the load of the next child grid is larger than the load assigned to the next child processor then the algorithm can optionally attempt to split (artificially fragment) the child grid into proportionate pieces before assigning them to the child processors. Usually this is avoided and the first child processor is given the entire child grid since future distributions of finer grids can compensate for this inequality. If we however are distributed finest level grids, then this splitting is used. The splitting is also used on the coarsest root grid since its size can in general be very large.
Hyperbolic engine
Currently AstroBEAR's hyperbolic solver uses a Gudonov type unsplit integrator that utilizes the CTU+CT integration scheme (Stone & Gardiner 08). Unsplit integrators however, often require many intermediate variables be stored globally during a grid update which can require 10-50x the space required for storing the array of conservative variables. For GPU type systems, this would considerably restrict the size of a grid that could be updated. In AstroBEAR we implement a sweep method that essentially pipelines any sequence of calculations into a one dimensional pass across the grid where variables are only stored as long as they are needed. This is ideally suited for GPU calculations in which a CPU could constantly be sending in new field values and retrieving updated field values while the GPU uses a minimum of memory.
The pipelining is done automatically provided that the dependencies between stencil pieces is explicitly stated. For example consider a simple 2D 1st order Gudonov method in which the initial state is stored in
q
, and the updated fields are stored inQ
. The x and y fluxes are stored infx,fy
, and the left and right interface states are stored inqLx,qLy
andqRx,qRy
respectively. We also adopt the convention that stencil pieces stored on cell edges ( ieqLx, qRx, fx
) at positioni-1/2
are stored in their respective arrays with the indexi
. The stencil dependencies can then be expressed as:
Stated dependency Actual calculation Set_Dependency(qLx, q, (/-1,-1,0,0/)) qLx%data(i,j)=q%data(i-1,j) Set_Dependency(qRx, q, (/0,0,0,0/)) qRx%data(i,j)=q%data(i,j) Set_Dependency(qLy, q, (/0,0,-1,-1/)) qLy%data(i,j)=q%data(i,j-1) Set_Dependency(qRy, q, (/0,0,0,0/)) qRy%data(i,j)=q%data(i,j) Set_Dependency(fx, qLx, (/0,0,0,0/)) fx%data(i,j)=riemann(qLx%data(i,j),qRx%data(i,j)) Set_Dependency(fx, qRx, (/0,0,0,0/)) Set_Dependency(fx, qLy, (/0,0,0,0/)) fy%data(i,j)=riemann(qLy%data(i,j),qRy%data(i,j)) Set_Dependency(fx, qRy, (/0,0,0,0/)) Set_Dependency{Q, fx, (/0,1,0,0/)) Q%data(i,j)=q%data(i,j)+fx%data(i,j)-fx%data(i+1,j) Set_Dependency(Q, fy, (/0,0,0,1/)) +fy%data(i,j)-fy%data(i,j+1) Set_Dependency(Q, q, (/0,0/))
Now given the size of Q we want updated we can work backwards to determine over what range of indices we need to calculate
fx, fy, qRx, qLx, qRy, qLy
&q
. To pipeline the calculation we then construct a sliding window for each stencil piece that passes across the grid from left to right. The dependencies determine the window's lead (ie how far in front of updating Q must we calculate the stencil's value) as well as the window's width (ie how long we need to hold on to values for other stencil pieces to use. For example in the following figure we have a stencil pieces
that leads the update of Q by two columns and whose window is 3 cells wide.
For the 2D example above this would give the following windows:
Source engine
AstroBEAR's source engine uses a 2 step Runge-Kutta to try to update values within error bounds. If the error is to large it switches to a Runge-Kutta45 with variable time stepping to achieve the desired accuracy.
Elliptic engine
Solving elliptic equations across an unstructured mesh requires several iterations of communication with computation so we decided to use a library called HYPRE… It operates on both structured and semi-structured meshes, however are use is limited to structured meshes. In order to maintain level advance independence for threading purposes, it was necessary to use euler forward time integration of coarse elliptic variables as the boundary conditions for finer level grids.
The Super-Gridding experiment
One of the goals in designing AstroBEAR was to keep the algorithms simple and to have all data manipulation routines operate on individual patches (or pairs of patches). However once the stencil dependencies are explicitly stated, it becomes possible to modify the hyperbolic advances to be much more sensitive to the available data. This can allow each processor to perform computations required by neighboring patches only once - instead of twice and allows for processors to skip computations needed to update coarse cells only to replace those values with data from fine cells. For small adjacent patches or completely refined grids this can reduce the amount of computation by 50-100%. To this end we implemented a super-gridding scheme which was basically as follows:
- Collect physically adjacent grids on each processor into super-grids.
- For each supergrid, flag the cells that needed to be updated.
- Using the stencil dependencies work backwards to flag the locations where each stencil piece needs to be calculated
- Then sweep across the supergrid performing only the necessary computations.
Of course storing global mask arrays over the entire supergrid for each stencil piece is memory intensive - so instead we implemented a sparse mask storage that was essentially a collection of non-intersecting boxes marking the regions to calculate.
We then modified the above algorithm to allow for processors to prioritize computations to better overlap communication with computation as follows:
- Collect physically adjacent grids on each processor into super-grids.
- For each supergrid, flag the cells that needed to be updated.
- Using the stencil dependencies work backwards to flag the locations where each stencil piece needs to be calculated
- Then using the old patches with the previous level's data determine which of those calculations can be performed prior to receiving data from overlaps and begin performing those while waiting for overlap data.
- Determine which fluxes and emf's will need to be synchronized with neighboring processors.
- Work backwards to determine which remaining calculations need to be performed prior to sending neighbors data and perform those calculations.
- Send fluxes and then perform remaining calculations that can be done before receiving data from children while waiting for child data.
- Using data from children continue performing calculations that can be done before receiving data from neighbors while waiting for neighbor data.
- Using neighbor data complete all required calculations to advance patches.
Unfortunately the computational cost associated with keeping track of all of these logical arrays as well as the additional shuffling of data back and forth became comparable to the savings in the number of reduced stencil computations. It may be possible, however, to improve the algorithms for managing the sparse logical arrays and to design an efficient algorithm that avoids redundant computations on the same processor for unsplit integration schemes.
Performance Results
For our weak scaling tests we advected a magnetized cylinder across the domain until it was displaced by 1 cylinder radius. The size of the cylinder was chosen to give a filling fraction of approximately 12.5% so that in the AMR run, the work load for the first refined level was comparable to the base level. The resolution of the base grid was adjusted to maintain 643 cells per processor and we found that our weak scaling for both fixed grid and for AMR is reasonable out to 2048 processors.
AMR information about other codes
Code AMR Library Type globally stored tree globally communicated tree distribution algorithm
Ramses Cell Based yes Flash/Ram Paramesh fixed patch based no yes counting Orion/Pluto Chombo patch based yes yes knapsack Enzo Patch Based yes yes Nirvana independent fixed patch based maybe not maybe not (reTrieval) strict data locality (counting) box sharing
AMRVAC | fixed patch | think so | think so | counting |
Bibliography
- elmegreen future trends in computing from the 2010 IAUS proceedings
Attachments (25)
-
ms.pdf
(213.3 KB
) - added by 14 years ago.
Current version of the paper
-
ms.2.pdf
(295.5 KB
) - added by 14 years ago.
Updated paper with figure 3 and improved discussion of Load Balancing
- ms.3.pdf (267.7 KB ) - added by 13 years ago.
- ms.4.pdf (267.7 KB ) - added by 13 years ago.
- MeshTree.png (81.8 KB ) - added by 13 years ago.
- ExtendedGhostZones.png (54.2 KB ) - added by 13 years ago.
- OverlapFigure.png (77.3 KB ) - added by 13 years ago.
- Overlaps.png (166.2 KB ) - added by 13 years ago.
-
ms.5.pdf
(586.1 KB
) - added by 13 years ago.
Newer draft with reworked sections 2-5
- ms.6.pdf (587.5 KB ) - added by 13 years ago.
- HilbertOrdering.png (43.3 KB ) - added by 13 years ago.
- LoadBalancing2.png (19.5 KB ) - added by 13 years ago.
- StrongMemory4.pdf (4.0 KB ) - added by 13 years ago.
- StrongScaling16.pdf (5.5 KB ) - added by 13 years ago.
- StrongScaling32.pdf (5.4 KB ) - added by 13 years ago.
- StrongScaling64.pdf (5.4 KB ) - added by 13 years ago.
- StrongScalingThreadRatios.pdf (4.6 KB ) - added by 13 years ago.
- WeakMemory.pdf (3.9 KB ) - added by 13 years ago.
- WeakScaling.pdf (4.4 KB ) - added by 13 years ago.
- WeakScalingThreadRatios.pdf (4.0 KB ) - added by 13 years ago.
- StrongMemory4wData.pdf (3.9 KB ) - added by 13 years ago.
- ms.7.pdf (661.0 KB ) - added by 13 years ago.
- MemoryComparison.png (36.3 KB ) - added by 13 years ago.
- GridSizeComparison.png (53.3 KB ) - added by 13 years ago.
- LevelComparison.png (46.1 KB ) - added by 13 years ago.