wiki:ScramblerPaper

Efficient Parallelization for AMR MHD Multiphysics Calculations; Implementaion in AstroBEAR

Abstract

Current and future AMR simulations will require algorithms that are highly parallelized and manage memory efficiently. In the redesign of AstroBEAR we have attempted to employ new techniques to achieve both of these goals. Patch based AMR often employs ghost cells to decouple the hyperbolic advances of each grid. This decoupling allows each patch to be advanced independently. In AstroBEAR we utilize this independence to allow patches across multiple refinement levels to be advanced independently with preference going to the finer level patches. This allows for global load balancing instead of level by level load balancing and allows for greater parallelization across both space and AMR level which can improve performance in deep simulations with many levels of refinement. With regard to memory management we have employed a distributed tree algorithm so processors no longer need to store the entire tree structure. We have also implemented a sweep method that pipelines the computations required for updating the fluid variables using unsplit algorithms. This can dramatically reduce the memory overhead required for intermediate variables.

Introduction

Larger and larger clusters alone will not allow larger and larger simulations to be performed in the same wall time without either an increase in CPU speed or a decrease in the workload of each processor. Since CPU speeds are not expected to keep pace with the requirements of large simulations, the only option is to decrease the work load of each processor. This however requires highly parallelized and efficient algorithms for managing the AMR infrastructure and the necessary computations. AstroBEAR, unlike many other AMR tree management codes (Paramesh, BoxLib), uses a distributed tree so that each processor is only aware of the AMR structure it needs to be aware of in order to carry out its computations and perform the necessary communications. While currently, this additional memory is small compared to the resources typically available to a CPU, future clusters will likely have much less memory per processor similar to what is already seen in GPU's. AstroBEAR also uses a distributed control structure that mirrors the nested AMR grid hierarchy. Processors have child processors just as AMR grids have child grids. These child processors receive new grids, and all necessary tree information from their parent processor(s). This eliminates the need for global sharing of tree data. Additionally the allocation of resources among child grids is done using a Hilbert space filling curve. This allows neighboring processors to be physically close on the network (or on the same core) and allows AstroBEAR to take advantage of the network topology.

AstroBEAR also utilizes interlevel threading to allow advances to occur on different levels independently. This allows for total load balancing across all refinement levels instead of balancing each level independently. This becomes especially important for deep simulations (simulations with low filling fractions but many levels of AMR) as opposed to shallow simulations (high filling fractions and only a few levels of AMR). Processors with coarse grids can advance their grids simultaneously while processors with finer grids advance theirs. Without this capability, base grids would need to be large enough to be able to be distributed across all of the processors. For simulations with large base grids to be able to finish in a reasonable wall time, only a few levels of AMR can often be used. With interlevel threading this restriction is lifted.

Distributed Tree-Control Algorithm

Many if not all current AMR codes are built upon either BoxLib or Paramesh both of which opt to store the entire AMR structure on each processor. This however requires additional memory and communication. (7 floats per patch compared to 256 floats for the data (for 2D 4x4 hydro with 2 ghost zones). While this additional memory requirement is negligible for problems run on typical cpus on 100's of processors, it can be become considerable on 1000's or 10000's of processors. Since it is expected that efficient memory use and management will be required for HPC (high performance computing) down the road, AstroBEAR uses a distributed tree algorithm in which each processor is only aware of the nodes 'surrounding' its own nodes. This includes its coarser parent node and finer children nodes, as well as physically adjacent neighbors and temporally adjacent preceding and succeeding overlaps. When new nodes are created, the parent nodes determine what information needs to be shared and with whom and then populates the childs local connections. The basic algorithm for updating local trees is as follows:

New Nodes Old Nodes
First Step
1 Receive new nodes along with their parents, neighbors, and preceding overlaps from parent processors Receive new succeeding overlaps from parent processors
2 Create new children and determine on which child processors they will go
3 Determine which remote preceding nodes might have children that would overlap with its own children and send the relevant children info Determine which remote succeeding nodes might have created children that would overlap with its own children and send the relevant children info
4 Determine which remote neighboring nodes might have children that would neighbor its own children and send the relevant children info
5 Determine which local neighboring nodes have neighboring children
6 Determine which local preceding nodes have children that overlap with its own Determine which local succeeding nodes have children that overlap with its own
7 Receive new children from remote neighboring nodes and determine which of the neighbors' children neighbors its own children
8 Receive children from remote preceding nodes and determine which of the nodes children overlaps with its own Receive children from remote succeeding nodes and determine which of the nodes children overlaps with its own.
9 For each remote child, send the child's info as well as information about its parents, neighbors, & preceding overlaps. For each remote child, send the child's succeeding overlaps.
Successive Steps
10 Create new children and determine on which child processor they will go
11 Determine which remote neighboring nodes might have old/new children that would overlap/neighbor its own new children and send the relevant children info
12 Determine which local neighboring nodes might have old/new children that would overlap/neighbor its own new children
13 Receive new children from remote neighboring nodes and determine which of the neighbors' children neighbors/overlaps its new/old children
14 For each new remote child, send the child's information, and the information about its parent, neighbors, & preceding overlaps.
15 For each old remote child, send the child's succeeding overlaps.
*Note that physically disjoint grids can overlap with each other's ghost zones - so the 1st generation of a node's neighbor's children can overlap with the nodes 2nd generation of children in steps 11-13
*See the section on distribution for calculation of parent and child processors.

Threaded Multi-Level Advance

Many if not all current AMR codes tend to perform grid updates across all levels in a prescribed order (0, 1, 2, 2, 1, 2, 2, 0…) Good performance then requires each level update to be balanced across all processors. Load balancing each level however, often leads to artificial fragmentation of grids and requires each processor to have grids on every level. This however, limits the depth of the AMR tree and the resulting simulations are often performed with large base grids with only 1 or 2 levels of AMR. The benefit of AMR however, is best utilized for scenarios with large dynamic ranges - and having only 2 levels of AMR severely limits the dynamic range that can be followed. In AstroBEAR we allow each levels grid advances to happen independently so that processors with coarse grids can update those grids while other processors with fine grids update theirs…

http://www.pas.rochester.edu/~johannjc/Papers/Carroll2011/figure1.png

Load Balancing

As was mentioned before, threading level removes the need for balancing each level and instead allows for global load balancing. When distributing children of level nodes each processor first calculates the average remaining workload on each coarser level and the number of remaining level advances that can be completed before level completes. Each processer then calculates the work load imbalance per level step as well as the new child load per level step . Then these two numbers and are globally shared. Then each processor calculates its share of the new level work load where . Then the workloads per processor are partitioned over so that each processor can determine from which processors it should expect to receive new child grids from as well as which processor it should give new grids to. It then sorts its children in a hilbert order before distributing them in order among its child processors. If the load of the next child grid is larger than the load assigned to the next child processor then the algorithm can optionally attempt to split (artificially fragment) the child grid into proportionate pieces before assigning them to the child processors. Usually this is avoided and the first child processor is given the entire child grid since future distributions of finer grids can compensate for this inequality. If we however are distributed finest level grids, then this splitting is used. The splitting is also used on the coarsest root grid since its size can in general be very large.

Hyperbolic engine

Currently AstroBEAR's hyperbolic solver uses a Gudonov type unsplit integrator that utilizes the CTU+CT integration scheme (Stone & Gardiner 08). Unsplit integrators however, often require many intermediate variables be stored globally during a grid update which can require 10-50x the space required for storing the array of conservative variables. For GPU type systems, this would considerably restrict the size of a grid that could be updated. In AstroBEAR we implement a sweep method that essentially pipelines any sequence of calculations into a one dimensional pass across the grid where variables are only stored as long as they are needed. This is ideally suited for GPU calculations in which a CPU could constantly be sending in new field values and retrieving updated field values while the GPU uses a minimum of memory.

The pipelining is done automatically provided that the dependencies between stencil pieces is explicitly stated. For example consider a simple 2D 1st order Gudonov method in which the initial state is stored in q, and the updated fields are stored in Q. The x and y fluxes are stored in fx,fy, and the left and right interface states are stored in qLx,qLy and qRx,qRy respectively. We also adopt the convention that stencil pieces stored on cell edges ( ie qLx, qRx, fx) at position i-1/2 are stored in their respective arrays with the index i. The stencil dependencies can then be expressed as:

Stated dependency Actual calculation
Set_Dependency(qLx, q, (/-1,-1,0,0/))qLx%data(i,j)=q%data(i-1,j)
Set_Dependency(qRx, q, (/0,0,0,0/))qRx%data(i,j)=q%data(i,j)
Set_Dependency(qLy, q, (/0,0,-1,-1/))qLy%data(i,j)=q%data(i,j-1)
Set_Dependency(qRy, q, (/0,0,0,0/))qRy%data(i,j)=q%data(i,j)
Set_Dependency(fx, qLx, (/0,0,0,0/)) fx%data(i,j)=riemann(qLx%data(i,j),qRx%data(i,j))
Set_Dependency(fx, qRx, (/0,0,0,0/))
Set_Dependency(fx, qLy, (/0,0,0,0/)) fy%data(i,j)=riemann(qLy%data(i,j),qRy%data(i,j))
Set_Dependency(fx, qRy, (/0,0,0,0/))
Set_Dependency{Q, fx, (/0,1,0,0/)) Q%data(i,j)=q%data(i,j)+fx%data(i,j)-fx%data(i+1,j)
Set_Dependency(Q, fy, (/0,0,0,1/)) +fy%data(i,j)-fy%data(i,j+1)
Set_Dependency(Q, q, (/0,0/))

Now given the size of Q we want updated we can work backwards to determine over what range of indices we need to calculate fx, fy, qRx, qLx, qRy, qLy & q. To pipeline the calculation we then construct a sliding window for each stencil piece that passes across the grid from left to right. The dependencies determine the window's lead (ie how far in front of updating Q must we calculate the stencil's value) as well as the window's width (ie how long we need to hold on to values for other stencil pieces to use. For example in the following figure we have a stencil piece s that leads the update of Q by two columns and whose window is 3 cells wide.

http://www.pas.rochester.edu/~johannjc/Papers/Carroll2011/SweepWindow.png

For the 2D example above this would give the following windows:

http://www.pas.rochester.edu/~johannjc/Papers/Carroll2011/SweepDemo.png

Source engine

AstroBEAR's source engine uses a 2 step Runge-Kutta to try to update values within error bounds. If the error is to large it switches to a Runge-Kutta45 with variable time stepping to achieve the desired accuracy.

Elliptic engine

Solving elliptic equations across an unstructured mesh requires several iterations of communication with computation so we decided to use a library called HYPRE… It operates on both structured and semi-structured meshes, however are use is limited to structured meshes. In order to maintain level advance independence for threading purposes, it was necessary to use euler forward time integration of coarse elliptic variables as the boundary conditions for finer level grids.

The Super-Gridding experiment

One of the goals in designing AstroBEAR was to keep the algorithms simple and to have all data manipulation routines operate on individual patches (or pairs of patches). However once the stencil dependencies are explicitly stated, it becomes possible to modify the hyperbolic advances to be much more sensitive to the available data. This can allow each processor to perform computations required by neighboring patches only once - instead of twice and allows for processors to skip computations needed to update coarse cells only to replace those values with data from fine cells. For small adjacent patches or completely refined grids this can reduce the amount of computation by 50-100%. To this end we implemented a super-gridding scheme which was basically as follows:

  • Collect physically adjacent grids on each processor into super-grids.
  • For each supergrid, flag the cells that needed to be updated.
  • Using the stencil dependencies work backwards to flag the locations where each stencil piece needs to be calculated
  • Then sweep across the supergrid performing only the necessary computations.

Of course storing global mask arrays over the entire supergrid for each stencil piece is memory intensive - so instead we implemented a sparse mask storage that was essentially a collection of non-intersecting boxes marking the regions to calculate.

We then modified the above algorithm to allow for processors to prioritize computations to better overlap communication with computation as follows:

  • Collect physically adjacent grids on each processor into super-grids.
  • For each supergrid, flag the cells that needed to be updated.
  • Using the stencil dependencies work backwards to flag the locations where each stencil piece needs to be calculated
  • Then using the old patches with the previous level's data determine which of those calculations can be performed prior to receiving data from overlaps and begin performing those while waiting for overlap data.
  • Determine which fluxes and emf's will need to be synchronized with neighboring processors.
  • Work backwards to determine which remaining calculations need to be performed prior to sending neighbors data and perform those calculations.
  • Send fluxes and then perform remaining calculations that can be done before receiving data from children while waiting for child data.
  • Using data from children continue performing calculations that can be done before receiving data from neighbors while waiting for neighbor data.
  • Using neighbor data complete all required calculations to advance patches.

Unfortunately the computational cost associated with keeping track of all of these logical arrays as well as the additional shuffling of data back and forth became comparable to the savings in the number of reduced stencil computations. It may be possible, however, to improve the algorithms for managing the sparse logical arrays and to design an efficient algorithm that avoids redundant computations on the same processor for unsplit integration schemes.

Performance Results

For our weak scaling tests we advected a magnetized cylinder across the domain until it was displaced by 1 cylinder radius. The size of the cylinder was chosen to give a filling fraction of approximately 12.5% so that in the AMR run, the work load for the first refined level was comparable to the base level. The resolution of the base grid was adjusted to maintain 643 cells per processor and we found that our weak scaling for both fixed grid and for AMR is reasonable out to 2048 processors.

http://www.pas.rochester.edu/~johannjc/Papers/Carroll2011/ScalingResults.png

AMR information about other codes

Code AMR Library Type globally stored tree globally communicated tree distribution algorithm
Ramses Cell Based yes
Flash/Ram Paramesh fixed patch based no yes counting
Orion/Pluto Chombo patch based yes yes knapsack
Enzo Patch Based yes yes
Nirvana independent fixed patch based maybe not maybe not (reTrieval) strict data locality (counting) box sharing
AMRVAC fixed patch think so think so counting

Bibliography

  • elmegreen future trends in computing from the 2010 IAUS proceedings
Last modified 11 years ago Last modified on 07/10/13 11:33:43

Attachments (25)

Note: See TracWiki for help on using the wiki.