Load balancing

A coupled code is only as fast as its slowest sub-model, so there's unlikely to be any speed-up by throwing extra PEs at the faster sub-models. However, determining which sub-models should receive any extra PEs is something of an art form.

I believe that up until now, code has been added to the submodels to determine which submodels are waiting for one another (see Richard's documentation). Jean-Christophe has introduced a paper describing a possible analytical solution, which should predict how fast each submodel should be as a functon of the number of processing elements (PEs), and for a given number of PEs the best way of sharing these PEs.

Any prediction from an analytical solution is still likely to require the techniques described by Richard - but an analytical solution should provide a very good first guess and reduce the amount of fine tunning needed by the techniques described by Richard.

Analytic prediction for submodel time versus number of PEs

The paper forwarded by JC the time a given submodel to run is given by

T(n)=T^scalar + T^serial + T^non-linear

where

n is the number of PEs,
T^scalar is the part of the code which scales perfectly with number of PEs, and equals a/n
T^serial is the time spent in non-parellelize portion of code, and equals d, and
T^non-linear is the time spent in partial parallized portion of code - so everything not covered by the first two terms. The authors of this paper have made this equal to b n^c. I think this function is fairly arbitarily chosen. The term must be positive, so b must be greater than zero. And this is the only term that can increase with PEs, which we know will eventually happen with increasing n, so this requires that c is also positive.

Hence

T(n)=a/n + b n^c + d

where I've switched the order of T^serial and T^non-linear from the equation above to be consitent with the paper, and I've chosen T(n) to be the time in days for one model year.

Clearly this equation takes no account of grid-mapping. I think this OK provided you assume that a semi-sensible choice of grid-mapping is chosen. The impact of grid-mapping on model time is largely secondary to the choice of the number of PEs.

Solving this equation requires choosing an a, b, c and d which minimise the sum of the square errors for all the measurement available - where the number of time measurements must be four or greater (there's no error if there's only four time measurements). This involves solving the method of least squares which requires solving the defining equations of the Gauss-Newton algorithm. The non-linear form of the equations means that this is an interative approach. And while each iteration produces a vector in the descent direction, I found that only a proportion of it can be used on each iteration to improve the solution (see section on `Improved versions' in Gauss-Newton algorithm).

I've written calcConst.py in /net/home/h06/mstringe/python/modelPred to calculate the constants (a, b, c, and d). The input times can be found in *.times files in /net/home/h06/mstringe/python/modelPred/data, and are calculated for say offline oxidants at N96 with

python calcConst.py data/offOxN96.times

which should produce a plot (if successfull) of a curve matching the data and store the constants in data/offOxN96.const. (This is the first code I've written in Python and seems fairly easy to manipulate matrices in it.)

Timings on 12 December 2014

The constants calculated from this are

Submodel	a	b	c	d
Offline Oxidants at N96⁺^*	78.22	84.13	7.04e-4	-84.41
Full chemistry at N96^*	218.9	20.83	5.721e-3	-21.47
MEDUSA on ORCA025 (got by assuming this is the slowest component in Yongming's UKESM0.1 runs)	300.2	N/A^x		0.3433
Offline Oxidants at N216	393.3	3.209e-11	2.756	0.2975
Full chemistry at N216	1281	4.565e-4	1.025	0.2995

+ Didn't reach the convergence criteria before being unable to find a small enough alpha
* Constant c is so small for these models, so it would probably be better to say that n^c is 1 and create a new d which is (b+d), i.e. remove non-linear term. The reason why c is so small is probably because we don't have times for high n, so we haven't really explored the non-linear part of code.
^x Only three timings, so impossible to solve a four constant equation - so set the non-linear term to zero

Determining the decomposition

I can't say that really understood the part of the paper that JC gave me that discusses how to determine how many PEs to submit to each submodel. However, it seems to me that for a given number of PEs and submodels, it's largely just a case of finding the greatest model years per day, speed_best, that can be achieved. If we ignore the part of our equation beyond the peak performance - because we're not interested in using more PEs than this anyway, the curves are just monotonically increasing. Hence we can find the greatest speed that all submodels can acheive by

assign PEs to each submodel
use our equation to determine speed of each component.
transfer PEs from the fastest component to the slowest component (I've used Talyor expansion on our equation to guess at how many PEs to transfer)
return to first item until the speed for all components have converged.

This loop is only exited once we have convergence. It's possible that we might reach the peak speed for one of the submodels - in which case this is best speed that can be achieved for the coupled code, and there's no point giving the other submodels more PEs than are needed to achieve this speed. Once we've got convergence, we've got the greatest speed which can achieved by all submodels involved according to our equation.

While the number of PEs for each submodel should add upto the total PEs available (unless one of the submodels has peaked), the PEs assigned to each submodel are unlikely to be integers or a decomposition which is possible for the given submodel. Hence it is necessary to round down to nearest decomposition for each submodel which is possible, and then rank the submodels with slowest component first. We can then see if there is enough spare PEs to round the slowest submodel upto its next available decomposition. If not, there is no point speeding up the other submodels, because this will still be the slowest component. If yes, we can assign the rounded up decomposition to this submodel and move to the next slowest to see if the same can be done for this.

Examples of running code

The python code can be found in /home/h06/mstringe/python/modelPred

Given 4000PEs, determine how best to divide them between Offline Oxidants at N216, Full chemistry at N96 and Medusa at Orca 025

python modelPred.py --npes=4000 --submodels=offOxN216,fullChemN96,medOrca025
...
*** Decomposition stats ***
The coupled code is expected to run at 1.858yrs/day on 3996PEs, which are shared across
 - medOrca025 at 1.858yrs/day on 1540PEs (20x77,22x70,28x55,44x35,70x22)
 - offOxN216 at 1.859yrs/day on 1924PEs (26x74,52x37,74x26)
 - fullChemN96 at 1.860yrs/day on 532PEs (28x19)

Determine how many PEs are needed to run at 1.5 model years/day for Offline Oxidants at N216, Full chemistry at N96 and Medusa at Orca 025

python modelPred.py --speed=1.5 --submodels=offOxN216,fullChemN96,medOrca025
...
*** Speed stats ***
Number of PEs needed to reach 1.500yrs/day are
 - offOxN216
  - 1.500yrs/day for 1088PEs (32x34,34x32,64x17)
  - alternatively, 1.495yrs/day for 1080PEs (18x60,20x54,24x45,30x36,36x30,40x27,54x20,60x18)
 - fullChemN96
  - 1.497yrs/day for 374PEs (22x17)
  - alternatively, 1.508yrs/day for 378PEs (18x21)
 - medOrca025
  - 1.500yrs/day for 928PEs (16x58,32x29,58x16)
  - alternatively, 1.501yrs/day for 930PEs (30x31)

Marc's pages

Load balancing

Analytic prediction for submodel time versus number of PEs

Timings on 12 December 2014

Determining the decomposition

Examples of running code