We can also save time if we can improve the sharing of work across PEs, so here we're looking at where the disparity between the PE doing the largest and least amount of work is largest for the most imbalanced routines.
After adding the tracer advection and reading of offline oxidants fixes the most imbalances routines are
Ordering routines by self: diff | ||||
---|---|---|---|---|
Min | Mean | Max | (Max-Min) | |
HALO_EXCHANGE:SWAP_BOUNDS_EW_H1_DP@1 | 109.151 (PE 124) | 250.96 | 532.441 (PE 82) | 423.29 |
HALO_EXCHANGE:SWAP_BOUNDS_NS_DP@1 | 155.161 (PE 67) | 320.53 | 538.596 (PE 5) | 383.44 |
UKCA_RADAER_BAND_AVERAGE@2 | 165.107 (PE 1) | 276.33 | 375.012 (PE 61) | 209.91 |
UKCA_RADAER_BAND_AVERAGE@1 | 171.179 (PE 1) | 272.65 | 372.948 (PE 84) | 201.77 |
UKCA_SYNC@1 | 22.303 (PE 76) | 88.27 | 222.959 (PE 2) | 200.66 |
SWAP_BOUNDS_MV@1 | 27.977 (PE 84) | 111.87 | 213.918 (PE 81) | 185.94 |
GLOBAL_2D_SUMS@1 | 25.577 (PE 25) | 46.40 | 176.974 (PE 112) | 151.40 |
UKCA_ABDULRAZZAK_GHAN@1 | 28.258 (PE 4) | 125.41 | 162.971 (PE 68) | 134.71 |
LS_PPNC@1 | 30.077 (PE 82) | 112.30 | 155.691 (PE 84) | 125.61 |
QWIDTH@2 | 4.545 (PE 82) | 58.41 | 119.03 (PE 68) | 114.48 |
QWIDTH@1 | 22.369 (PE 82) | 77.46 | 128.116 (PE 68) | 105.75 |
NI_CONV_CTL@1 | 24.152 (PE 84) | 40.95 | 88.027 (PE 49) | 63.88 |
GLUE_CONV_6A@1 | 34.565 (PE 81) | 64.06 | 98.055 (PE 58) | 63.49 |
GLUE_CONV_6A@2 | 36.772 (PE 82) | 63.42 | 97.206 (PE 58) | 60.43 |
SWAP_BOUNDS_2D_MV@1 | 4.24 (PE 61) | 24.12 | 60.442 (PE 15) | 56.20 |
RAD_CTL@1 | 3.352 (PE 64) | 14.50 | 54.104 (PE 52) | 50.75 |
LSP_SUBGRID@2 | 1.983 (PE 82) | 24.03 | 47.455 (PE 68) | 45.47 |
LSP_SUBGRID@1 | 3.061 (PE 82) | 25.78 | 45.598 (PE 68) | 42.54 |
ATMOS_PHYSICS1@1 | 294.226 (PE 82) | 320.08 | 333.631 (PE 105) | 39.40 |
LSP_QCLEAR@2 | 1.567 (PE 82) | 19.37 | 39.51 (PE 68) | 37.94 |
Ordering routines by self: diff | ||||
---|---|---|---|---|
Min | Mean | Max | (Max-Min) | |
HALO_EXCHANGE:SWAP_BOUNDS_NS_DP@1 | 164.724 (PE 70) | 221.56 | 380.92 (PE 81) | 216.20 |
HALO_EXCHANGE:SWAP_BOUNDS_EW_H1_DP@1 | 91.908 (PE 117) | 153.02 | 300.158 (PE 82) | 208.25 |
SWAP_BOUNDS_MV@1 | 26.138 (PE 76) | 103.19 | 180.133 (PE 126) | 154 |
LS_PPNC@1 | 20.426 (PE 82) | 68.08 | 89.015 (PE 84) | 68.59 |
GLOBAL_2D_SUMS@1 | 22.411 (PE 91) | 37.61 | 82.137 (PE 113) | 59.73 |
SWAP_BOUNDS_2D_MV@1 | 2.861 (PE 68) | 23.05 | 57.839 (PE 14) | 54.98 |
NI_CONV_CTL@1 | 22.148 (PE 76) | 35.83 | 69.558 (PE 54) | 47.41 |
GLUE_CONV_5A@2 | 28.764 (PE 82) | 46.44 | 75.073 (PE 58) | 46.31 |
GLUE_CONV_5A@1 | 29.182 (PE 81) | 46.99 | 73.66 (PE 58) | 44.48 |
ATMOS_PHYSICS1@1 | 312.037 (PE 82) | 339.89 | 354.112 (PE 68) | 42.08 |
LSP_FALL@2 | 1.358 (PE 82) | 16.69 | 30.722 (PE 68) | 29.36 |
CLD_GENERATOR_MOD:CLD_GENERATOR@1 | 1.628 (PE 82) | 15.39 | 26.526 (PE 11) | 24.90 |
LSP_DEPOSITION@2 | 0.686 (PE 82) | 11.98 | 25.562 (PE 11) | 24.88 |
LSP_DEPOSITION@1 | 1.312 (PE 82) | 12.23 | 25.062 (PE 11) | 23.75 |
LSP_FALL@1 | 2.303 (PE 82) | 16.27 | 25.769 (PE 68) | 23.47 |
QSAT@1 | 1.291 (PE 5) | 9.07 | 23.131 (PE 84) | 21.84 |
QSAT@2 | 1.07 (PE 4) | 8.39 | 20.685 (PE 51) | 19.61 |
LSP_INIT@2 | 1.358 (PE 82) | 11.31 | 20.875 (PE 68) | 19.52 |
SHALLOW_CONV_5A@1 | (PE 0) | 4.43 | 19.102 (PE 58) | 19.10 |
SHALLOW_CONV_5A@2 | (PE 2) | 4.02 | 18.781 (PE 58) | 18.78 |
The results above show the time spent in SWAP_BOUNDS_EW_H1_DP is very imbalanced for the Backward-Euler run, where the difference in the maximum and minimum time is 169% of the mean time - compared to 136% for CLASSIC. The workload from all the SWAP_BOUNDS_* routines, which general pass haloes, is unlikely to be that imbalanced. The most likely reason for the imbalanced time spent in these routines it that they include barriers, which means that PEs have to wait for the slowest PEs to catch up. So it's likely to be imbalances before these routines that cause the large imbalances in self time. To try to determine why this, run facee has been created with
Note that the Backward-Euler run above was done before QWIDTH was `inlined' into the LSP_SUBGRID and LSP_QCLEAR. This has reduced the mean total time in LSP_SUBGRID to about 9s and the maximum total tiime to about 18s - so these routines are no longer that significant.
The wait times are produced by calling a routine which contains something like
IF (lhook) CALL dr_hook(routineName,zhook_in,zhook_handle) CALL GC_GSYNC(nproc,ierr) IF (lhook) CALL dr_hook(routineName,zhook_out,zhook_handle)so that all they're doing is timing how long it takes all the PEs to reach the barrier GC_GSYNC. (I'd expect the minimum time in these routines to always be about 0s, because once the slowest PE reaches GC_GSYNC all the PEs should be free to proceed. However, this isn't the case and I'm don't know why.)
It's relatively simple to put one of these calls after UKCA_ABDULRAZZAK_GHAN to produce the follows stats
Total time | ||||
---|---|---|---|---|
Min | Mean | Max | (Max-Min) | |
UKCA_ABDULRAZZAK_GHAN@1 | 28.112 (PE 4) | 125.38 | 164.952 (PE 69) | 136.84 |
WAIT:UKCA_ABDULRAZZAK_GHAN@1 | 18.847 (PE 76) | 86.78 | 215.892 (PE 4) | 197.04 |
There are few imbalances, such as there's more reactions for grid points in sunlight and less aerosol in Antartica.
It's much less simple to call one of these wait subroutines after UKCA_RADAER_BAND_AVERAGE, because it turns out that different PEs call it a different number of times. Introducing a block just causes the PEs with a higher number of calls to wait for the others indefinitely - or in reality until the wall time closes them all down. It is possible to put wait subroutines after the short wave and long wave code in RAD_CTL - and their total time included time spent in UKCA_RADAER_BAND_AVERAGE - to produce
Total time | ||||
---|---|---|---|---|
Min | Mean | Max | (Max-Min) | |
RAD_CTL@1 | 727.402 (PE 76) | 750.22 | 808.317 (PE 82) | 80.91 |
LW_RAD@1 | 281.47 (PE 0) | 359.75 | 445.176 (PE 68) | 163.71 |
WAIT:LW_RAD@1 | 4.793 (PE 68) | 85.93 | 170.023 (PE 2) | 165.23 |
SW_RAD@1 | 77.24 (PE 1) | 120.42 | 149.582 (PE 84) | 72.34 |
WAIT:SW_RAD@1 | 117.404 (PE 68) | 166.40 | 227.158 (PE 81) | 109.75 |
The routine LW_RAD is called the same number of times by all PEs, so any imbalance is not caused by an imbalance in calls. The mean wait time after this code is 86s, which isn't terrible for code that takes a mean time of 360s to run. The very large wait time for SW_RAD suggest that the imbalance is largely caused by the uneven calls to SW_RAD. The number of calls to SW_RAD for the whole run vary between 1,375 for PE 0 to 2,143 for PE 120, and depend on how many grid points are in sunlight. As the run is run over a one whole month, the imbalance for a given time step is probably worse than these numbers suggest.
CAN WE DETERMINE FROM THIS HOW MUCH TIME WE MIGHT BE ABLE TO SAVE WITH BETTER LOAD BALANCES? WAIT:UKCA_ABDULRAZZAK_GHAN + WAIT:SW_RAD TO GIVE UP TO ~ 250s - OR MORE OR LESS?
The table below shows that the wait routine have drastically cut the times in some of the SWAP_BOUNDS* routines - especially SWAP_BOUNDS_EW_H1_DP and SWAP_BOUNDS_NS_DP.
Routine | Mean self times (s) | |
---|---|---|
Backward-Euler (facef) | Backward-Euler with wait routines (facee) | |
SWAP_BOUNDS_NS_DP | 287 | 178 |
SWAP_BOUNDS_EW_H1_DP | 220 | 103 |
SWAP_BOUNDS_EW_DP | 94 | 94 |
SWAP_BOUNDS_MV | 112 | 112 |
Total | 713 | 487 |