GLOMAP

Load imbalance

We can also save time if we can improve the sharing of work across PEs, so here we're looking at where the disparity between the PE doing the largest and least amount of work is largest for the most imbalanced routines.

Comparing Backward-Euler with CLASSIC

Backward-Euler

After adding the tracer advection and reading of offline oxidants fixes the most imbalances routines are

Ordering routines by self: diff
	Min	Mean	Max	(Max-Min)
HALO_EXCHANGE:SWAP_BOUNDS_EW_H1_DP@1	109.151 (PE 124)	250.96	532.441 (PE 82)	423.29
HALO_EXCHANGE:SWAP_BOUNDS_NS_DP@1	155.161 (PE 67)	320.53	538.596 (PE 5)	383.44
UKCA_RADAER_BAND_AVERAGE@2	165.107 (PE 1)	276.33	375.012 (PE 61)	209.91
UKCA_RADAER_BAND_AVERAGE@1	171.179 (PE 1)	272.65	372.948 (PE 84)	201.77
UKCA_SYNC@1	22.303 (PE 76)	88.27	222.959 (PE 2)	200.66
SWAP_BOUNDS_MV@1	27.977 (PE 84)	111.87	213.918 (PE 81)	185.94
GLOBAL_2D_SUMS@1	25.577 (PE 25)	46.40	176.974 (PE 112)	151.40
UKCA_ABDULRAZZAK_GHAN@1	28.258 (PE 4)	125.41	162.971 (PE 68)	134.71
LS_PPNC@1	30.077 (PE 82)	112.30	155.691 (PE 84)	125.61
QWIDTH@2	4.545 (PE 82)	58.41	119.03 (PE 68)	114.48
QWIDTH@1	22.369 (PE 82)	77.46	128.116 (PE 68)	105.75
NI_CONV_CTL@1	24.152 (PE 84)	40.95	88.027 (PE 49)	63.88
GLUE_CONV_6A@1	34.565 (PE 81)	64.06	98.055 (PE 58)	63.49
GLUE_CONV_6A@2	36.772 (PE 82)	63.42	97.206 (PE 58)	60.43
SWAP_BOUNDS_2D_MV@1	4.24 (PE 61)	24.12	60.442 (PE 15)	56.20
RAD_CTL@1	3.352 (PE 64)	14.50	54.104 (PE 52)	50.75
LSP_SUBGRID@2	1.983 (PE 82)	24.03	47.455 (PE 68)	45.47
LSP_SUBGRID@1	3.061 (PE 82)	25.78	45.598 (PE 68)	42.54
ATMOS_PHYSICS1@1	294.226 (PE 82)	320.08	333.631 (PE 105)	39.40
LSP_QCLEAR@2	1.567 (PE 82)	19.37	39.51 (PE 68)	37.94

CLASSIC

Ordering routines by self: diff
	Min	Mean	Max	(Max-Min)
HALO_EXCHANGE:SWAP_BOUNDS_NS_DP@1	164.724 (PE 70)	221.56	380.92 (PE 81)	216.20
HALO_EXCHANGE:SWAP_BOUNDS_EW_H1_DP@1	91.908 (PE 117)	153.02	300.158 (PE 82)	208.25
SWAP_BOUNDS_MV@1	26.138 (PE 76)	103.19	180.133 (PE 126)	154
LS_PPNC@1	20.426 (PE 82)	68.08	89.015 (PE 84)	68.59
GLOBAL_2D_SUMS@1	22.411 (PE 91)	37.61	82.137 (PE 113)	59.73
SWAP_BOUNDS_2D_MV@1	2.861 (PE 68)	23.05	57.839 (PE 14)	54.98
NI_CONV_CTL@1	22.148 (PE 76)	35.83	69.558 (PE 54)	47.41
GLUE_CONV_5A@2	28.764 (PE 82)	46.44	75.073 (PE 58)	46.31
GLUE_CONV_5A@1	29.182 (PE 81)	46.99	73.66 (PE 58)	44.48
ATMOS_PHYSICS1@1	312.037 (PE 82)	339.89	354.112 (PE 68)	42.08
LSP_FALL@2	1.358 (PE 82)	16.69	30.722 (PE 68)	29.36
CLD_GENERATOR_MOD:CLD_GENERATOR@1	1.628 (PE 82)	15.39	26.526 (PE 11)	24.90
LSP_DEPOSITION@2	0.686 (PE 82)	11.98	25.562 (PE 11)	24.88
LSP_DEPOSITION@1	1.312 (PE 82)	12.23	25.062 (PE 11)	23.75
LSP_FALL@1	2.303 (PE 82)	16.27	25.769 (PE 68)	23.47
QSAT@1	1.291 (PE 5)	9.07	23.131 (PE 84)	21.84
QSAT@2	1.07 (PE 4)	8.39	20.685 (PE 51)	19.61
LSP_INIT@2	1.358 (PE 82)	11.31	20.875 (PE 68)	19.52
SHALLOW_CONV_5A@1	(PE 0)	4.43	19.102 (PE 58)	19.10
SHALLOW_CONV_5A@2	(PE 2)	4.02	18.781 (PE 58)	18.78

The results above show the time spent in SWAP_BOUNDS_EW_H1_DP is very imbalanced for the Backward-Euler run, where the difference in the maximum and minimum time is 169% of the mean time - compared to 136% for CLASSIC. The workload from all the SWAP_BOUNDS_* routines, which general pass haloes, is unlikely to be that imbalanced. The most likely reason for the imbalanced time spent in these routines it that they include barriers, which means that PEs have to wait for the slowest PEs to catch up. So it's likely to be imbalances before these routines that cause the large imbalances in self time. To try to determine why this, run facee has been created with

blocking calls to try and measure the imbalances around UKCA_RADAER_BAND_AVERAGE and UKCA_ABDULRAZZAK_GHAN.
dr_hook calls around the SWAP_BOUNDS_* calls to determine which parts of the code are contributing most to the time imbalances in the SWAP_BOUNDS_* routines.

Note that the Backward-Euler run above was done before QWIDTH was `inlined' into the LSP_SUBGRID and LSP_QCLEAR. This has reduced the mean total time in LSP_SUBGRID to about 9s and the maximum total tiime to about 18s - so these routines are no longer that significant.

Wait times for UKCA_RADAER_BAND_AVERAGE and UKCA_ABDULRAZZAK_GHAN

The wait times are produced by calling a routine which contains something like

      IF (lhook) CALL dr_hook(routineName,zhook_in,zhook_handle)

      CALL GC_GSYNC(nproc,ierr)

      IF (lhook) CALL dr_hook(routineName,zhook_out,zhook_handle)

so that all they're doing is timing how long it takes all the PEs to reach the barrier GC_GSYNC. (I'd expect the minimum time in these routines to always be about 0s, because once the slowest PE reaches GC_GSYNC all the PEs should be free to proceed. However, this isn't the case and I'm don't know why.)

It's relatively simple to put one of these calls after UKCA_ABDULRAZZAK_GHAN to produce the follows stats

Total time
	Min	Mean	Max	(Max-Min)
UKCA_ABDULRAZZAK_GHAN@1	28.112 (PE 4)	125.38	164.952 (PE 69)	136.84
WAIT:UKCA_ABDULRAZZAK_GHAN@1	18.847 (PE 76)	86.78	215.892 (PE 4)	197.04

There are few imbalances, such as there's more reactions for grid points in sunlight and less aerosol in Antartica.

It's much less simple to call one of these wait subroutines after UKCA_RADAER_BAND_AVERAGE, because it turns out that different PEs call it a different number of times. Introducing a block just causes the PEs with a higher number of calls to wait for the others indefinitely - or in reality until the wall time closes them all down. It is possible to put wait subroutines after the short wave and long wave code in RAD_CTL - and their total time included time spent in UKCA_RADAER_BAND_AVERAGE - to produce

Total time
	Min	Mean	Max	(Max-Min)
RAD_CTL@1	727.402 (PE 76)	750.22	808.317 (PE 82)	80.91
LW_RAD@1	281.47 (PE 0)	359.75	445.176 (PE 68)	163.71
WAIT:LW_RAD@1	4.793 (PE 68)	85.93	170.023 (PE 2)	165.23
SW_RAD@1	77.24 (PE 1)	120.42	149.582 (PE 84)	72.34
WAIT:SW_RAD@1	117.404 (PE 68)	166.40	227.158 (PE 81)	109.75

which shows that introducing the wait routine has increase the mean time in RAD_CTL from about 492s to 750s - about 50% longer. This is because the PEs are being made to wait for one another. 732s of these 750s are accounted by the 4 routines shown below RAD_CTL.

The routine LW_RAD is called the same number of times by all PEs, so any imbalance is not caused by an imbalance in calls. The mean wait time after this code is 86s, which isn't terrible for code that takes a mean time of 360s to run. The very large wait time for SW_RAD suggest that the imbalance is largely caused by the uneven calls to SW_RAD. The number of calls to SW_RAD for the whole run vary between 1,375 for PE 0 to 2,143 for PE 120, and depend on how many grid points are in sunlight. As the run is run over a one whole month, the imbalance for a given time step is probably worse than these numbers suggest.

CAN WE DETERMINE FROM THIS HOW MUCH TIME WE MIGHT BE ABLE TO SAVE WITH BETTER LOAD BALANCES? WAIT:UKCA_ABDULRAZZAK_GHAN + WAIT:SW_RAD TO GIVE UP TO ~ 250s - OR MORE OR LESS?

SWAP_BOUNDS_* routines

The table below shows that the wait routine have drastically cut the times in some of the SWAP_BOUNDS* routines - especially SWAP_BOUNDS_EW_H1_DP and SWAP_BOUNDS_NS_DP.

Routine	Mean self times (s)
Routine	Backward-Euler (facef)	Backward-Euler with wait routines (facee)
SWAP_BOUNDS_NS_DP	287	178
SWAP_BOUNDS_EW_H1_DP	220	103
SWAP_BOUNDS_EW_DP	94	94
SWAP_BOUNDS_MV	112	112
Total	713	487