GLOMAP

Classic vs Backward-Euler and N96 vs N216

If we look at the last fix on LSP_SUBGRID, LSP_QCLEAR & QWIDTH page, we see that the offline oxidants using the Backward-Euler solver at N96 was about 1.71 times slower than the Classic scheme at N96. At N96 the timestep is 20 minutes and, as UKCA is called every hour, UKCA is called every third timestep. At N216 the timestep is 15 minutes and so UKCA is called every fourth timestep. As UKCA is called less often for a given number of atmosphere timesteps, we might expect that at N216 that the Backward-Euler would be quicker than 1.71 times the speed of Classic at N216 - however it looks to be significantly slow than this. Here I try to find out why.

To do this, I've chosen one month at 16 * 8 PEs for N96 and 10 days and 16 * 16 PEs for N216 because they take about the same amount of time, so it makes the times easier to compare

The run Ids

N96 - 1 month - 16 * 8 PEs N216 - 10 days - 16 * 16 PEs
Classicfacedgadga
Backward-Eulerfaceffacep

Time in UM_SHELL

N96N216 (Time for N216)/(Time for N96)
Classic2,568s2,349s0.915
Backward-Euler4,416s4,392s0.995
(Time for B-E)/(Time for Classic)1.721.87

The top profile tree

Backward-Euler at N96

Routines
UM_SHELL (4,416s)
U_MODEL_4A (4,413s)
ATM_STEP_4A* (3,007s) UKCA_MAIN1 (1,248s)
ATMOS _PHYS- ICS1 (1,299s) EG_ COR- RECT_ TRAC- ERS (28s) ATMOS _PHYS- ICS2 (443s) EG_ SL_ HELM- HOLTZ (229s) TR_ SET_ PHYS _4A* (76s) EG_CORRECT _TRACERS _UKCA (168s) SL_ TRAC- ER1_ 4A (161s) EG_ SL_ MOI- STURE (80s) EG_SL_ FULL_WIND (136s)  ⇓  UP- DATE _M_ STAR (59s) ATM_ STEP_ STASH (60s)  ⇓   ⇓  See profiling for UKCA_ MAIN1
See profile for ATMOS_ PHYSICS1 and EG_CORRECT _TRACERS below See profile for ATMOS_ PHYSICS2 and EG_SL_ HELMHOLTZ below EG_SL_WIND_U, EG_SL_WIND_V & EG_SL_WIND_W (39 + 38 + 45 = 122s)  ⇓  EG_Q_ TO_MIX (60s)  ⇓  STASH (171s)
Itself (120s) EG_INTERPOLATION _ETA (237s) DEP- ARTURE_ POINT _ETA (83s) EG_SWAP_ BOUNDS_DP (148s) STWORK (171s)
EG_ CUBIC_ LAG- RANGE (98s, itself) EG_VERT_ WEIGHTS_ ETA (19s, itself) MONO_ ENFORCE (19s, itself) Itself (36s) See profile for SWAP_ BOUNDS _DP below SPA- TIAL (64s) PP_ HEAD (54s) EXP- PXI (34s, itself)

CLASSIC at N96

Routines
UM_SHELL (2,568s)
U_MODEL_4A (2,566s)
ATM_STEP_4A* (2,498s)
ATMOS _PHYS- ICS1 (949s) EG_ COR- RECT_ TRAC- ERS (112s) ATMOS _PHYS- ICS2 (404s) EG_ SL_ HELM- HOLTZ (240s) TR_ SET_ PHYS _4A* (48s) EG_ SISL_ INIT (50s) SL_ TRAC- ER1_ 4A (102s) EG_ SL_ MOI- STURE (79s) EG_SL_ FULL_WIND (135s) EG_Q_ TO_MIX (21s) ATM_ STEP_ STASH (112s)  ⇓ 
See profile for ATMOS_ PHYSICS1 and EG_CORRECT _TRACERS below See profile for ATMOS_ PHYSICS2 and EG_SL_ HELMHOLTZ below EG_ SISL_ INIT_ UVW (47s) EG_SL_WIND_U, EG_SL_WIND_V & EG_SL_WIND_W (39 + 38 + 45 = 122s) EG_SWAP_ BOUNDS_DP (107s) STASH (130s)
Itself (33s) EG_INTERPOLATION _ETA (177s) DEP- ARTURE_ POINT _ETA (83s) See profile for SWAP_ BOUNDS _DP below STWORK (130s)
EG_ CUBIC_ LAG- RANGE (80s, itself) EG_VERT_ WEIGHTS_ ETA (19s, itself) MONO_ ENFORCE (14s, itself) Itself (36s)         SPA- TIAL (33s) PP_ HEAD (46s) EXP- PXI (30s, itself)

Backward-Euler at N216

Routines
UM_SHELL (4,392s)
U_MODEL_4A (4,387s)
ATM_STEP_4A* (2,984s) UKCA_MAIN1 (1,163s)
ATMOS _PHYS- ICS1 (1,051s) EG_ COR- RECT_ TRAC- ERS (40s) ATMOS _PHYS- ICS2 (473s) EG_ SL_ HELM- HOLTZ (359s) TR_ SET_ PHYS _4A* (53s) EG_CORRECT _TRACERS _UKCA (247s) SL_ TRAC- ER1_ 4A (139s) EG_ SL_ MOI- STURE (75s) EG_SL_ FULL_WIND (153s)  ⇓  UP- DATE _M_ STAR (111s) ATM_ STEP_ STASH (29s)  ⇓   ⇓  See profiling for UKCA_ MAIN1
See profile for ATMOS_ PHYSICS1 and EG_CORRECT _TRACERS below See profile for ATMOS_ PHYSICS2 and EG_SL_ HELMHOLTZ below EG_SL_WIND_U, EG_SL_WIND_V & EG_SL_WIND_W (46 + 38 + 51 = 135s)  ⇓  EG_Q_ TO_MIX (113s)  ⇓  STASH (118s)
Itself (183s) EG_INTERPOLATION _ETA (262s) DEP- ARTURE_ POINT _ETA (102s) EG_SWAP_ BOUNDS_DP (221s) STWORK (118s)
EG_ CUBIC_ LAG- RANGE (113s, itself) EG_VERT_ WEIGHTS_ ETA (25s, itself) MONO_ ENFORCE (18s, itself) Itself (40s) See profile for SWAP_ BOUNDS _DP below SPA- TIAL (39s) PP_ HEAD (38s) EXP- PXI (24s, itself)

CLASSIC at N216

Routines
UM_SHELL (2,349s)
U_MODEL_4A (2,345s)
ATM_STEP_4A* (2,280s)
ATMOS _PHYS- ICS1 (632s) EG_ COR- RECT_ TRAC- ERS (122s) ATMOS _PHYS- ICS2 (431s) EG_ SL_ HELM- HOLTZ (368s) TR_ SET_ PHYS _4A* (33s) EG_ SISL_ INIT (69s) SL_ TRAC- ER1_ 4A (89s) EG_ SL_ MOI- STURE (75s) EG_SL_ FULL_WIND (155s) EG_Q_ TO_MIX (31s) ATM_ STEP_ STASH (58s)  ⇓ 
See profile for ATMOS_ PHYSICS1 and EG_CORRECT _TRACERS below See profile for ATMOS_ PHYSICS2 and EG_SL_ HELMHOLTZ below EG_ SISL_ INIT_ UVW (66s) EG_SL_WIND_U, EG_SL_WIND_V & EG_SL_WIND_W (48 + 47 + 51 = 146s) EG_SWAP_ BOUNDS_DP (133s) STASH (68s)
Itself (39s) EG_INTERPOLATION _ETA (206s) DEP- ARTURE_ POINT _ETA (109s) See profile for SWAP_ BOUNDS _DP below STWORK (68s)
EG_ CUBIC_ LAG- RANGE (93s, itself) EG_VERT_ WEIGHTS_ ETA (25s, itself) MONO_ ENFORCE (13s, itself) Itself (40s)         SPA- TIAL (21s) PP_ HEAD (19s) EXP- PXI (12s, itself)
*should also link to SWAP_BOUNDS_DP, like many other returns.

Summary for top level profile

The four profiles are remarkably similar. The main differences are

  • At N96, the Backward-Euler run is 1,848s slower than the Classic run. At N216 this number is 2,043s
  • UKCA_MAIN1 explains most of these differences. At N96 it's 1,248s and for N216 it's 1,163s. More on UKCA_MAIN1 in its profiling below.
  • The extra total time in the rest of the code, namely ATM_STEP_4A, for Backward-Euler run at N96 is 509s and at N216 it's 704s.
  • The largest contributions to the extra total time in ATM_STEP_4A for the Backward-Euler runs come from the total time in
    • ATMOS_PHYSICS1. The Backward-Euler uses 350s more at N96 and 419s more at N216.
    • EG_CORRECT_TRACERS* routines. The combined time in EG_CORRECT_TRACERS and EG_CORRECT_TRACERS_UKCA for the Backward-Euler compared to the time in EG_CORRECT_TRACERS for Classic is 84s more at N96 and 165s more at N216.
    • SL_TRACER1_4A and STASH are also significantly larger for the Backward-Euler runs but to a lesser extent the two factors above.
  • The time in ATMOS_PHYSICS1 is greatly reduced in N216 compared to N96: for Backward-Euler from 1,299s to 1,051s (reduction of 248s) and for Classic from 949s to 632s (reduction of 317s). Presumably because this routines and the routines below have fewer MPI calls, so it scales better than the other routines. See the ATMOS_PHYSICS1 profiling for more details.
  • The time in EG_SL_HELMHOLTZ is greatly increased in N216 compared to N96 by about 130s.
  • Above I stated that compared to Classic the total time in ATM_STEP_4A for the Backward-Euler is 509s more at N96 and 704s more at N216. So compared to N96 the Backward-Euler run at N216 has gained an extra 195s on its Classic equivalent. This is because of the total time in
    • EG_CORRECT_TRACERS_UKCA is 247s at N216 but 168s at N96, so a gain of 79s.
    • ATMOS_PHYSICS1. The extra time for the Backward-Euler at N96 is 350s and at N216 it's 419s, so that's 69s extra at N216
  • The number of fields in both runs at N96 are similar where as the Backward-Euler run at N216 has about twice as many fields as the Classic run at N216, explains the larger discrepancy in total time in STASH at N216.

Profiling for ATMOS_PHYSICS1 and EG_CORRECT_TRACERS

Backward-Euler at N96

ATMOS_PHYSICS1 (1,299s) EG_CORRECT_TRACERS (28s)
RAD_CTL (495s) MICROPHYS_CTL (189s) NI_GWD _CTL (273s)  ⇓   ⇓  EG_MASS_ CONSERVATION (33s)
LW_RAD (357s) SW_RAD (120s) LS_PPN (181s) G_ WAVE _5A (230s) GW_ USSP (41s) GLOBAL _2D_ SUMS (44s, itself) Itself (25s)
RADIANCE_CALC (470s) LS_PPNC (172s) SWAP_ BOUNDS (see table below)
UKCA_ RADAER _BAND_ AVERAGE (276s, itself) SOLVE_BAND_K_EQV (153s) UKCA_ RADAER_ COMPUTE _AOD (21s, itself) LSP_ICE (102s) Itself (70s)
MCICA_ SAMPLE (122s) SCALE_ ABSORB (20s) LSP_ INIT (24s) LSP_ FALL (19s)
MONOCHR- OMATIC_ RADIANCE (106s) Itself (15s) Itself (12s) Itself (17s)
MONOCHR- OMATIC_ RADIANCE _TSEQ (93s)
MCICA_ COLUMN (93s)
TWO_COEFF (67s)
TRANS_ SOURCE_ COEFF (37s) Itself (12s)
Itself (23s)
  

CLASSIC at N96

ATMOS_PHYSICS1 (949s) EG_CORRECT_TRACERS (122s)
RAD_CTL (215s) MICROPHYS_CTL (186s) NI_GWD_CTL (176s)  ⇓   ⇓  EG_MASS_ CONSERVATION (90s)
LW_RAD (153s) SW_RAD (53s) LS_PPN (174s) G_ WAVE _5A (150s) GW_ USSP (25s) GLOBAL _2D_ SUMS (38s, itself) Itself (63s)
RADIANCE_CALC (200s) LS_PPNC (163s) SWAP_ BOUNDS (see table below)
SOLVE_BAND_K_EQV (155s) LSP_ICE (95s) Itself (68s)
MCICA_ SAMPLE (122s) SCALE_ ABSORB (20s) LSP_ INIT (23s) LSP_ FALL (19s)
MONOCHR- OMATIC_ RADIANCE (107s) Itself (15s) Itself (11s) Itself (16s)
MONOCHR- OMATIC_ RADIANCE _TSEQ (94s)
MCICA_ COLUMN (94s)
TWO_COEFF (67s)
TRANS_ SOURCE_ COEFF (37s) Itself (12s)
Itself (23s)
  

Backward-Euler at N216

ATMOS_PHYSICS1 (1,051s) EG_CORRECT_TRACERS (40s)
RAD_CTL (332s) MICROPHYS_CTL (121s) NI_GWD _CTL (308s)  ⇓   ⇓  EG_MASS_ CONSERVATION (37s)
LW_RAD (240s) SW_RAD (79s) LS_PPN (113s) G_ WAVE _5A (250s) GW_ USSP (58s) GLOBAL _2D_ SUMS (70s, itself) Itself (28s)
RADIANCE_CALC (313s) LS_PPNC (106s) SWAP_ BOUNDS (see table below)
UKCA_ RADAER _BAND_ AVERAGE (163s, itself) SOLVE_BAND_K_EQV (119s) UKCA_ RADAER_ COMPUTE _AOD (13s, itself) LSP_ICE (79s) Itself (27s)
MCICA_ SAMPLE (92s) SCALE_ ABSORB (17s) LSP_ INIT (18s) LSP_ FALL (15s)
MONOCHR- OMATIC_ RADIANCE (80s) Itself (13s) Itself (10s) Itself (13s)
MONOCHR- OMATIC_ RADIANCE _TSEQ (69s)
MCICA_ COLUMN (69s)
TWO_COEFF (48s)
TRANS_ SOURCE_ COEFF (27s) Itself (7s)
Itself (17s)
  

CLASSIC at N216

ATMOS_PHYSICS1 (632s) EG_CORRECT_TRACERS (122s)
RAD_CTL (170s) MICROPHYS_CTL (122s) NI_GWD_CTL (161s)  ⇓   ⇓  EG_MASS_ CONSERVATION (96s)
LW_RAD (123s) SW_RAD (41s) LS_PPN (109s) G_ WAVE _5A (132s) GW_ USSP (28s) GLOBAL _2D_ SUMS (44s, itself) Itself (72s)
RADIANCE_CALC (157s) LS_PPNC (101s) SWAP_ BOUNDS (see table below)
SOLVE_BAND_K_EQV (119s) LSP_ICE (76s) Itself (25s)
MCICA_ SAMPLE (92s) SCALE_ ABSORB (17s) LSP_ INIT (18s) LSP_ FALL (15s)
MONOCHR- OMATIC_ RADIANCE (80s) Itself (13s) Itself (10s) Itself (13s)
MONOCHR- OMATIC_ RADIANCE _TSEQ (69s)
MCICA_ COLUMN (69s)
TWO_COEFF (47s)
TRANS_ SOURCE_ COEFF (27s) Itself (7s)
Itself (17s)
  

Summary of ATMOS_PHYSICS1 profiling

By total time, there are three main routines below ATMOS_PHYSICS1

  • RAD_CTL. Two points
    • The total time is greatly reduced at N216 compared to N96. Presumably is has fewer MPI calls than other routines, so scales better. The fact that OpenMP can be used here, would seem to confirm this (I don't think MPI calls exist in thread 2, so MPI calls need to be done outside the threading).
    • UKCA_RADAER_BAND_AVERAGE is only present for Backward-Euler runs and is a success story at N216 - it's time of 276s at N96 is reduced to 163s at N216, saving 113s.
  • MICROPHYS_CTL. The total time at N96 is about 187s and it's about 121s at N216 for both Backward-Euler and Classic. Presumably is has fewer MPI calls than other routines, and so scales better than them.
  • NI_GWD_CTL. I suspect that any extra time here comes from MPI barriers either in the local routines of the SWAP_BOUNDS* routines (see below). In which case, it's a consequence of imbalance in run time for other routines, which means that some PEs are having to wait for other PEs to catch up. And the extra time for the Backward-Euler runs would be because some of its routines, such as the ones in UKCA_MAIN1, are more imbalanced.
  • If we look at the time in ATMOS_PHYSICS1, not accounted by the total time in the three subroutines above it's
    • 372s for Classic at N96,
    • 342s for Backward-Euler at N96,
    • 179s for Classic at N216, and
    • 290s for Backward-Euler at N216
    We know a significant chunk is the time comes from the SWAP_BOUNDS* routines, STASH and especially COSP_MAIN (which doesn't have Dr Hook code included). TIMER output for a previous Backward-Euler suggestes that the total time in COSP_MAIN is about 300s at N96. From the times above, this must be much smaller for N216.

Profiling for SWAP_* routines

Backward-Euler at N96

Routines Total mean time
EG_SWAP_BOUNDS_DP (148s) ATMOS_PHYSICS1, ATMOS_PHYSICS2, G_WAVE_5A, ... 148 + ...
SWAP_BOUNDS & SWAP_BOUNDS_DP (434 + 171 = 605s) SWAP_BOUNDS_MV (112s, itself) 717s
SWAP_BOUNDS_EW_DP (316s) SWAP_BOUNDS_NS_DP (288s, itself) 716s
SWAP_BOUNDS_EW_H1_DP (220s, itself) Itself (95s) 715s

CLASSIC at N96

Routines Total mean time
EG_SWAP_BOUNDS_DP (107s) ATMOS_PHYSICS1, ATMOS_PHYSICS2, G_WAVE_5A, ... 107 + ...
SWAP_BOUNDS & SWAP_BOUNDS_DP (339 + 118 = 457s) SWAP_BOUNDS_MV (103s, itself) 560s
SWAP_BOUNDS_EW_DP (234s) SWAP_BOUNDS_NS_DP (222s, itself) 559s
SWAP_BOUNDS_EW_H1_DP (153s, itself) Itself (81s) 559s

Backward-Euler at N216

Routines Total mean time
EG_SWAP_BOUNDS_DP (221s) ATMOS_PHYSICS1, ATMOS_PHYSICS2, G_WAVE_5A, ... 221 + ...
SWAP_BOUNDS & SWAP_BOUNDS_DP (424 + 257 = 681s) SWAP_BOUNDS_MV (115s, itself) 796s
SWAP_BOUNDS_EW_DP (319s) SWAP_BOUNDS_NS_DP (361s, itself) 795s
SWAP_BOUNDS_EW_H1_DP (263s, itself) Itself (56s) 795s

CLASSIC at N216

Routines Total mean time
EG_SWAP_BOUNDS_DP (133s) ATMOS_PHYSICS1, ATMOS_PHYSICS2, G_WAVE_5A, ... 133 + ...
SWAP_BOUNDS & SWAP_BOUNDS_DP (300 + 148 = 448s) SWAP_BOUNDS_MV (133s, itself) 581s
SWAP_BOUNDS_EW_DP (212s) SWAP_BOUNDS_NS_DP (235s, itself) 580s
SWAP_BOUNDS_EW_H1_DP (163s, itself) Itself (48s) 579s

Summary of SWAP_BOUNDS* routines

More time is taken in these routines for N216, probably because the heavy MPI communiciation means that it doesn't scale as well as other routines. And considerable more for the Backward-Euler runs, which will be because

  • of the extra fields to pass
  • MPI barriers, meaning some PEs are having to wait for other PEs delayed by imbalances in the code - probably from UKCA code.

Profiling for UKCA_MAIN1

Backward-Euler at N96

Routines Total mean time
UKCA_MAIN* (1,248s) 1,248s
UKCA_AERO_CTL (885s) UKCA_ ACTIVATE (143s) 1,028s
UKCA_AERO_STEP (835s) UKCA_ ABDULRAZZAK_ GHAN (132s) 978s
UKCA_COAGWITHNUCL (378s) UKCA_ CONDEN (141s) UKCA_ CHECK_ MD_ND (70s, itself) UKCA_ CALCNUCRATE (69s) UKCA_ VOLUME_ MODE (51s) Itself (126s) 835s
Itself (315s) UKCA_ SOLVECOAGNUCL _V (63s, itself) UKCA_ COND_ COFF_V (92s, itself) Itself (49s) UKCA_ BINAPARA (65s, itself) Itself (26s) 806s

Backward-Euler at N216

Routines Total mean time
UKCA_MAIN* (1,163s) 1,163s
UKCA_AERO_CTL (813s) UKCA_ ACTIVATE (106s) 919s
UKCA_AERO_STEP (767s) UKCA_ ABDULRAZZAK_ GHAN (97s) 864s
UKCA_COAGWITHNUCL (355s) UKCA_ CONDEN (133s) UKCA_ CHECK_ MD_ND (63s, itself) UKCA_ CALCNUCRATE (59s) UKCA_ VOLUME_ MODE (45s) Itself (92s) 747s
Itself (294s) UKCA_ SOLVECOAGNUCL _V (61s, itself) UKCA_ COND_ COFF_V (83s, itself) Itself (50s) UKCA_ BINAPARA (55s, itself) Itself (23s) 721s

Backward-Euler at N216 multiplied by 4/3+

Routines Total mean time
UKCA_MAIN* (1,550s) 1,550s
UKCA_AERO_CTL (1,084s) UKCA_ ACTIVATE (141s) 1,225s
UKCA_AERO_STEP (1,022s) UKCA_ ABDULRAZZAK_ GHAN (129s) 1,151s
UKCA_COAGWITHNUCL (473s) UKCA_ CONDEN (177s) UKCA_ CHECK_ MD_ND (84s, itself) UKCA_ CALCNUCRATE (79s) UKCA_ VOLUME_ MODE (60s) Itself (123s) 1,108s
Itself (392s) UKCA_ SOLVECOAGNUCL _V (81s, itself) UKCA_ COND_ COFF_V (111s, itself) Itself (67s) UKCA_ BINAPARA (73s, itself) Itself (31s) 962s
*UCKA_MAIN also calls STASH
+You might expect that if UKCA_MAIN1 was called every timestep that it would be three times more expensive for N96 (it's called every third step) and four times more expensive for N216 (it's called every fourth step). For both Backward-Euler runs, the reset of the code - namely ATM_STEP_4A, takes about 3,000s so if the assumption above is correct, it should be OK to simply scale the N216 times by 4/3 to compare with the N96 times.

Summary of UKCA_MAIN1 profiles

The profile show that based on total time there are two main subroutines called from UKCA_MAIN1

  • UKCA_AERO_CTL. This has about the same total time for both N96 and N216 - so even though it's called less often compared to the main code, the time for the N216 is not that much less.
  • UKCA_ACTIVATE. The total time here does roughly look to scale by how frequently it is called, so it is less for the N216 run
  • As the total time spent in the UKCA_AERO_CTL - which doesn't seem to scale by calls - greatly exceeds UKCA_ACTIVATE, the reduction in total time in UKCA_MAIN1 is only 85s.

Overall summary

I've profiled four runs, where the time spent in most of the routines are similar, to try and determine why the slow down factor for the Backward-Euler code compared to Classic is greater at N216 than at N96. The main points are

  • The routines below UKCA_AERO_CTL, which is most of the total time in UKCA_MAIN1, aren't reduced by much at N216 compared to N96 - even though it's called less often. The total time in UKCA_MAIN is only 85s less at N216, when we'd hope for a lot more.
  • EG_CORRECT_TRACERS_UKCA doesn't seem to scale well and is 79s more at N216 than at N96.
  • The one success story is that UKCA_RADAER_BAND_AVERAGE takes 113s less at N216 compared to N96.
  • Unfortunately, the above gain at N216 compared to N96, is lost in other parts of ATMOS_PHYSICS1 - although I suspect that imbalances elsewhere, such as in UKCA_MAIN, are causing some PEs to arrive earlier in ATMOS_PHYSICS1 and they are then held here by MPI barriers, and consequently increasing the time spent in these routines.
  • The total time in the SWAP_BOUNDS routines at N216 compared to N96 for the Backward-Euler runs is about 80s more (about 796s compared to 717s).
  • The number of fields in both runs at N96 are similar where as the Backward-Euler run at N216 has about twice as many fields as the Classic run at N216, explains the larger discrepancy in total time in STASH at N216.