A coupled code is only as fast as its slowest sub-model, so there's unlikely to be any speed-up by throwing extra PEs at the faster sub-models. However, determining which sub-models should receive any extra PEs is something of an art form.
I believe that up until now, code has been added to the submodels to determine which submodels are waiting for one another (see Richard's documentation). Jean-Christophe has introduced a paper describing a possible analytical solution, which should predict how fast each submodel should be as a functon of the number of processing elements (PEs), and for a given number of PEs the best way of sharing these PEs.
Any prediction from an analytical solution is still likely to require the techniques described by Richard - but an analytical solution should provide a very good first guess and reduce the amount of fine tunning needed by the techniques described by Richard.
The paper forwarded by JC the time a given submodel to run is given by
T(n)=Tscalar + Tserial + Tnon-linear
where
Hence
T(n)=a/n + b nc + d
where I've switched the order of Tserial and Tnon-linear from the equation above to be consitent with the paper, and I've chosen T(n) to be the time in days for one model year.
Clearly this equation takes no account of grid-mapping. I think this OK provided you assume that a semi-sensible choice of grid-mapping is chosen. The impact of grid-mapping on model time is largely secondary to the choice of the number of PEs.
Solving this equation requires choosing an a, b, c and d which minimise the sum of the square errors for all the measurement available - where the number of time measurements must be four or greater (there's no error if there's only four time measurements). This involves solving the method of least squares which requires solving the defining equations of the Gauss-Newton algorithm. The non-linear form of the equations means that this is an interative approach. And while each iteration produces a vector in the descent direction, I found that only a proportion of it can be used on each iteration to improve the solution (see section on `Improved versions' in Gauss-Newton algorithm).
I've written calcConst.py in /net/home/h06/mstringe/python/modelPred to calculate the constants (a, b, c, and d). The input times can be found in *.times files in /net/home/h06/mstringe/python/modelPred/data, and are calculated for say offline oxidants at N96 with
python calcConst.py data/offOxN96.times
which should produce a plot (if successfull) of a curve matching the data and store the constants in data/offOxN96.const. (This is the first code I've written in Python and seems fairly easy to manipulate matrices in it.)
The constants calculated from this are
Submodel | a | b | c | d |
---|---|---|---|---|
Offline Oxidants at N96+* | 78.22 | 84.13 | 7.04e-4 | -84.41 |
Full chemistry at N96* | 218.9 | 20.83 | 5.721e-3 | -21.47 |
MEDUSA on ORCA025 (got by assuming this is the slowest component in Yongming's UKESM0.1 runs) | 300.2 | N/Ax | 0.3433 | |
Offline Oxidants at N216 | 393.3 | 3.209e-11 | 2.756 | 0.2975 |
Full chemistry at N216 | 1281 | 4.565e-4 | 1.025 | 0.2995 |
I can't say that really understood the part of the paper that JC gave me that discusses how to determine how many PEs to submit to each submodel. However, it seems to me that for a given number of PEs and submodels, it's largely just a case of finding the greatest model years per day, speedbest, that can be achieved. If we ignore the part of our equation beyond the peak performance - because we're not interested in using more PEs than this anyway, the curves are just monotonically increasing. Hence we can find the greatest speed that all submodels can acheive by
This loop is only exited once we have convergence. It's possible that we might reach the peak speed for one of the submodels - in which case this is best speed that can be achieved for the coupled code, and there's no point giving the other submodels more PEs than are needed to achieve this speed. Once we've got convergence, we've got the greatest speed which can achieved by all submodels involved according to our equation.
While the number of PEs for each submodel should add upto the total PEs available (unless one of the submodels has peaked), the PEs assigned to each submodel are unlikely to be integers or a decomposition which is possible for the given submodel. Hence it is necessary to round down to nearest decomposition for each submodel which is possible, and then rank the submodels with slowest component first. We can then see if there is enough spare PEs to round the slowest submodel upto its next available decomposition. If not, there is no point speeding up the other submodels, because this will still be the slowest component. If yes, we can assign the rounded up decomposition to this submodel and move to the next slowest to see if the same can be done for this.
The python code can be found in /home/h06/mstringe/python/modelPred
Given 4000PEs, determine how best to divide them between Offline Oxidants at N216, Full chemistry at N96 and Medusa at Orca 025
python modelPred.py --npes=4000 --submodels=offOxN216,fullChemN96,medOrca025 ... *** Decomposition stats *** The coupled code is expected to run at 1.858yrs/day on 3996PEs, which are shared across - medOrca025 at 1.858yrs/day on 1540PEs (20x77,22x70,28x55,44x35,70x22) - offOxN216 at 1.859yrs/day on 1924PEs (26x74,52x37,74x26) - fullChemN96 at 1.860yrs/day on 532PEs (28x19)
Determine how many PEs are needed to run at 1.5 model years/day for Offline Oxidants at N216, Full chemistry at N96 and Medusa at Orca 025
python modelPred.py --speed=1.5 --submodels=offOxN216,fullChemN96,medOrca025 ... *** Speed stats *** Number of PEs needed to reach 1.500yrs/day are - offOxN216 - 1.500yrs/day for 1088PEs (32x34,34x32,64x17) - alternatively, 1.495yrs/day for 1080PEs (18x60,20x54,24x45,30x36,36x30,40x27,54x20,60x18) - fullChemN96 - 1.497yrs/day for 374PEs (22x17) - alternatively, 1.508yrs/day for 378PEs (18x21) - medOrca025 - 1.500yrs/day for 928PEs (16x58,32x29,58x16) - alternatively, 1.501yrs/day for 930PEs (30x31)