wiki:LoadBalancing

Version 10 (modified by Jonathan, 12 years ago) ( diff )

Load Balancing

Alternate distribution algorithms — A summary of the communication/distribution systems used by other codes.

Load balancing without global communication is difficult to do, but not impossible. Scrambler accomplishes this by storing a subtree computational cost map for each node calculated from the previous generation of descendants to predict the subtree computational costs for each new child that it generates. This allows for nodes to have pre-allocated resources for their children. The global communication of the work load associated with refinement is essentially reduced to a restriction of the costmap down to the root node. Of course if the regions of AMR refinement change dramatically between time steps, then this predictive approach will need to be modified - but for most typical simulations, this should allow for reasonable load balancing with efficiencies of 80%?

As well as having a costmap , each node also has a resource list of processors and a list of allocations for each of those processors. These resources and allocations are available to advance the node's children and their descendants. After creating its children , each node has to distribute its resources and allocations among its children. Currently each node rescales its resources based on the predicted resources after creating its children.

Estimating Work Load By Grid

Of course accurately estimating the work load can be a little difficult since it will not in general scale with the number of cells. Ghost cell calculations and sub-cycling in source calculations can increase the work load associated with a grid. If we ignore source calculations then the work load for any given grid will be simply a linear function of the grid size and the number of ghost zones. A fairly general form would be

Although a slightly more general form would be

Or if we treat bc as a constant

We would still need to modify mx , my, and mz on the first step to include what will be the size of the updated grid, but bc is then just solvermbc Note we have also ignored the size of the L2 cache…

These 9 constants (or 8 if we treat b_c as a constant) will in general depend on the number of dimensions and the choice of integration schemes/riemann solvers etc… Additionally they could depend on the compiler flags and the architecture and the size of the L2 cache even… For that reason it would be a good idea to profile each processor at the beginning of the simulation to calculate these constants.

For example on grass on 2D if we run tests with mx from 1 to 32 and my from 1 to 32 and find the fit to a slightly more generic function we get

A =  1.7415725275934024E-04
B =  8.5668721103086300E-04
C =  7.1076103230157414E-04
D =  3.9957714592475328E-03

Data with fit for general function

Fit to advance times

Error Histogram for general function

Histrogram of relative error

If we assume a form for the function that is just then we get much larger errors

Error Histogram for simple function

Error Histrogram for simple functional form f(x,y)=Axy

And if we turn on optimization flags we get

A =  1.5968758191034787E-05
B =  8.7672454209933808E-05
C =  6.0130874797035275E-05
D =  4.5258365576023019E-04

which are factors of (10.9, 9.8, 11.8, & 8.8) higher than the un-optimized coefficients

Another quick thing to check is to see what happens if nDim, MaintainAuxArrays, and lMHD are turned into parameters instead of variables. Ideally the compiler would perform the conditional expressions at compile time instead of at run time and produce a smaller faster code. From these two runs it looks like this is ~ 1% faster - although more runs would need to be performed to see if this is statistically significant.

A =  1.5805271251806859E-05
B =  8.7947065595570862E-05
C =  5.9609249326644886E-05
D =  4.4828285580678106E-04

We could also cast the expression as

where B, C, & D are in units of computational cells (D is in cells2).

By expanding the above expression and comparing it with the previous expression we can express the coefficients in terms of the unprimed coefficiencts.

which yield

A' =  1.5968758191034787E-05
B' =  3.765532302
C' =  5.490248719 
D' =  7.668110194

We could also write the previous expression as

where we can now think of as the cost per cell, as an effective number of ghost zones in the x direction, as the number of ghost zones in the y direction, and as the effective number of overhead cells.

Estimating Work Load By Region

Currently the costmap for a grid on a given level stores the number of expected internal cell updates for all for each cell. Previously unrefined cells would have a costmap of 0. A singly refined cell would have a costmap of and a doubly completely refined cell would have a costmap of where is the problem dimension.

What we would like to predict is the cost required for a child grid as well as all of its descendants… Since we know the size of the child grid, we can calculate the child cost directly. However, determining the cost associated with a child's descendants is not nearly as easy. Ideally we could construct a list of the number of expected grand child grids and their sizes, as well as great grandchildren grids, and so on. However the sizes of the descendant grids can vary from step to step based on the amr patch generating algorithm.

While the actual sizes and number of grids can vary, the total internal area covered by descendant grids should be a function of the region and the filling fraction and should not change as quickly with time. But how to go from a filling fraction of internal However, the number of cells flagged for refinement should not change drastically from step to step.

Let's start by storing on level , the costs associated for the region covered by each cell i,j,k for grandchild grids and higher in costmap(i,j,k,1) and the costs associated with child grids in costmap(i,j,k,2) over a single level step.

When coarsening costmaps for parent grids, we should combine S*costmap([i],[j],[k],1:2) into parent%costmap(I,J,K,1) (where S is the number of our steps per parent step or the refinement ratio and [i],[j],[k] represents the set of cells that lie within the parent cell [I,J,K]) (We could accumulate the costmap over each of our steps, but since this is a time varying quantity, it is better to just double the more recent value instead of using time average. We could anticipate the costs by tracking the time derivative but since this is likely to not be a smooth function, this could create problems.)

So each level receives costmap(:,:,:,1) from child grids directly (or indirectly via overlaps). So that only leaves the calculation of costmap(:,:,:,2).

Each grid at some point knows where it's current children are - and can calculate the cost associated with each. However, what we would like to create is a costmap that will be able to predict childcosts for future grids that will physically overlap ourself. The sum of costmap(:,:,:,2) should equal sum(child%costpergrid) so that if nothing were to change before the next step, our parents would allocate an appropriate amount of resources for us to assign to our children etc… But we do have some flexibility in how we distribute the actual child costs over our costmap(:,:,:,2).

Because of ghost cell costs, it no longer makes sense to think about a cost per cell. The smallest quantity we can define is really the cost per grid. There are then three different ways we could map the cost per grid for each grid into our costmap.

  • A.) We could average the cost per grid over the grids cells to get an effective cost per cell. This would then be higher for regions covered by small child grids and lower for regions covered by large child grids.
    for each child with bounds ic(:,:)
      costmap(ic(1,1):ic(1,2),ic(2,1):ic(2,2),ic(3,1):ic(3,2),2)=child%costpergrid/product(ic(:,2)-ic(:,1)+1) 
    end
    
  • B.) We could average the cost for all of our child grids over all of the regions covered by child grids.
    for each child with bounds ic(:,:)
      totalcost=totalcost+child%costpergrid
      totalcells=totalcells+product(ic(:,2)-ic(:,1)+1) 
    end
    for each child with bounds ic(:,:)
      costmap(ic(1,1):ic(1,2),ic(2,1):ic(2,2),ic(3,1):ic(3,2),2)=totalcost/totalcells 
    end
    
    
  • C.) We could average the cost for all of our child grids over the entire costmap
    for each child with bounds ic(:,:)
      totalcost=totalcost+child%costpergrid
    end
    costmap(1:mX(1),1:mX(2),1:mX(3),2)=totalcost/product(mX)
    

For a 2D field loop advection problem with a 16x8 base grid with 4 additional refinement levels the three different approaches gave the following balancing efficiencies on 4 processors.

Option AMR SMR*
A 74.9 93.6
B 74.1 93.2
C 65.2 93.0

*In the SMR runs, the grids were static - so it was easier for the scheduler to 'predict' the future computational loads. In fact the expected efficiencies from the load balancer gave efficiencies of 97%. The 4% mismatch between the expected efficiency and the actual efficiency is most likely due to the inability of the advance algorithm to return at a precise time. Threading would allow much greater flexibility - as it would allow the advance algorithms to be suspended at any time.

Attachments (3)

Download all attachments as: .zip

Note: See TracWiki for help on using the wiki.