GLOMAP
Classic vs Backward-Euler and N96 vs N216
If we look at the last fix on
LSP_SUBGRID, LSP_QCLEAR & QWIDTH page,
we see that the offline oxidants using the Backward-Euler solver at
N96 was about
1.71 times slower than the Classic scheme at N96. At N96 the timestep is
20 minutes and, as UKCA is called every hour, UKCA is called every
third timestep. At N216 the timestep is 15 minutes and so UKCA
is called every fourth timestep. As UKCA is called less often for
a given number of atmosphere timesteps, we might expect that at
N216 that the Backward-Euler would be quicker than 1.71 times the
speed of Classic at N216 - however it looks to be significantly
slow than this. Here I try to find out why.
To do this, I've chosen one month at 16 * 8 PEs for N96 and
10 days and 16 * 16 PEs for N216 because they take about the
same amount of time, so it makes the times easier to
compare
The run Ids
| N96 - 1 month - 16 * 8 PEs |
N216 - 10 days - 16 * 16 PEs |
Classic | faced | gadga |
Backward-Euler | facef | facep |
Time in UM_SHELL
| N96 | N216 |
(Time for N216)/(Time for N96) |
Classic | 2,568s | 2,349s | 0.915 |
Backward-Euler | 4,416s | 4,392s | 0.995 |
(Time for B-E)/(Time for Classic) | 1.72 | 1.87 |
The top profile tree
Backward-Euler at N96
Routines |
UM_SHELL (4,416s) |
U_MODEL_4A (4,413s) |
ATM_STEP_4A* (3,007s) |
UKCA_MAIN1 (1,248s) |
ATMOS _PHYS- ICS1 (1,299s) |
EG_ COR- RECT_ TRAC- ERS (28s) |
ATMOS _PHYS- ICS2 (443s) |
EG_ SL_ HELM- HOLTZ (229s) |
TR_ SET_ PHYS _4A* (76s) |
EG_CORRECT _TRACERS _UKCA (168s)
|
SL_ TRAC- ER1_ 4A (161s) |
EG_ SL_ MOI- STURE (80s) |
EG_SL_ FULL_WIND (136s) |
⇓ |
UP- DATE _M_ STAR (59s) |
ATM_ STEP_ STASH (60s) |
⇓ |
⇓ |
See profiling for UKCA_ MAIN1 |
See profile for ATMOS_ PHYSICS1 and EG_CORRECT _TRACERS below
|
See profile for ATMOS_ PHYSICS2 and EG_SL_ HELMHOLTZ below
|
EG_SL_WIND_U, EG_SL_WIND_V & EG_SL_WIND_W (39 + 38 + 45 = 122s)
|
⇓ |
EG_Q_ TO_MIX (60s) |
⇓ |
STASH (171s) |
Itself (120s) |
EG_INTERPOLATION _ETA (237s) |
DEP- ARTURE_ POINT _ETA (83s) |
EG_SWAP_ BOUNDS_DP (148s) |
STWORK (171s) |
EG_ CUBIC_ LAG- RANGE (98s, itself) |
EG_VERT_ WEIGHTS_ ETA (19s, itself) |
MONO_ ENFORCE (19s, itself) |
Itself (36s) |
See profile for SWAP_ BOUNDS _DP below |
SPA- TIAL (64s) |
PP_ HEAD (54s) |
EXP- PXI (34s, itself) |
CLASSIC at N96
Routines |
UM_SHELL (2,568s) |
U_MODEL_4A (2,566s) |
ATM_STEP_4A* (2,498s) |
ATMOS _PHYS- ICS1 (949s) |
EG_ COR- RECT_ TRAC- ERS (112s) |
ATMOS _PHYS- ICS2 (404s) |
EG_ SL_ HELM- HOLTZ (240s) |
TR_ SET_ PHYS _4A* (48s) |
EG_ SISL_ INIT (50s) |
SL_ TRAC- ER1_ 4A (102s) |
EG_ SL_ MOI- STURE (79s) |
EG_SL_ FULL_WIND (135s) |
EG_Q_ TO_MIX (21s) |
ATM_ STEP_ STASH (112s) |
⇓ |
See profile for ATMOS_ PHYSICS1 and EG_CORRECT _TRACERS below
|
See profile for ATMOS_ PHYSICS2 and EG_SL_ HELMHOLTZ below
|
EG_ SISL_ INIT_ UVW (47s) |
EG_SL_WIND_U, EG_SL_WIND_V & EG_SL_WIND_W (39 + 38 + 45 = 122s)
|
EG_SWAP_ BOUNDS_DP (107s) |
STASH (130s) |
Itself (33s) |
EG_INTERPOLATION _ETA (177s) |
DEP- ARTURE_ POINT _ETA (83s) |
See profile for SWAP_ BOUNDS _DP below |
STWORK (130s) |
EG_ CUBIC_ LAG- RANGE (80s, itself) |
EG_VERT_ WEIGHTS_ ETA (19s, itself) |
MONO_ ENFORCE (14s, itself) |
Itself (36s) |
|
|
|
SPA- TIAL (33s) |
PP_ HEAD (46s) |
EXP- PXI (30s, itself) |
Backward-Euler at N216
Routines |
UM_SHELL (4,392s) |
U_MODEL_4A (4,387s) |
ATM_STEP_4A* (2,984s) |
UKCA_MAIN1 (1,163s) |
ATMOS _PHYS- ICS1 (1,051s) |
EG_ COR- RECT_ TRAC- ERS (40s) |
ATMOS _PHYS- ICS2 (473s) |
EG_ SL_ HELM- HOLTZ (359s) |
TR_ SET_ PHYS _4A* (53s) |
EG_CORRECT _TRACERS _UKCA (247s)
|
SL_ TRAC- ER1_ 4A (139s) |
EG_ SL_ MOI- STURE (75s) |
EG_SL_ FULL_WIND (153s) |
⇓ |
UP- DATE _M_ STAR (111s) |
ATM_ STEP_ STASH (29s) |
⇓ |
⇓ |
See profiling for UKCA_ MAIN1 |
See profile for ATMOS_ PHYSICS1 and EG_CORRECT _TRACERS below
|
See profile for ATMOS_ PHYSICS2 and EG_SL_ HELMHOLTZ below
|
EG_SL_WIND_U, EG_SL_WIND_V & EG_SL_WIND_W (46 + 38 + 51 = 135s)
|
⇓ |
EG_Q_ TO_MIX (113s) |
⇓ |
STASH (118s) |
Itself (183s) |
EG_INTERPOLATION _ETA (262s) |
DEP- ARTURE_ POINT _ETA (102s) |
EG_SWAP_ BOUNDS_DP (221s) |
STWORK (118s) |
EG_ CUBIC_ LAG- RANGE (113s, itself) |
EG_VERT_ WEIGHTS_ ETA (25s, itself) |
MONO_ ENFORCE (18s, itself) |
Itself (40s) |
See profile for SWAP_ BOUNDS _DP below |
SPA- TIAL (39s) |
PP_ HEAD (38s) |
EXP- PXI (24s, itself) |
CLASSIC at N216
Routines |
UM_SHELL (2,349s) |
U_MODEL_4A (2,345s) |
ATM_STEP_4A* (2,280s) |
ATMOS _PHYS- ICS1 (632s) |
EG_ COR- RECT_ TRAC- ERS (122s) |
ATMOS _PHYS- ICS2 (431s) |
EG_ SL_ HELM- HOLTZ (368s) |
TR_ SET_ PHYS _4A* (33s) |
EG_ SISL_ INIT (69s) |
SL_ TRAC- ER1_ 4A (89s) |
EG_ SL_ MOI- STURE (75s) |
EG_SL_ FULL_WIND (155s) |
EG_Q_ TO_MIX (31s) |
ATM_ STEP_ STASH (58s) |
⇓ |
See profile for ATMOS_ PHYSICS1 and EG_CORRECT _TRACERS below
|
See profile for ATMOS_ PHYSICS2 and EG_SL_ HELMHOLTZ below
|
EG_ SISL_ INIT_ UVW (66s) |
EG_SL_WIND_U, EG_SL_WIND_V & EG_SL_WIND_W (48 + 47 + 51 = 146s)
|
EG_SWAP_ BOUNDS_DP (133s) |
STASH (68s) |
Itself (39s) |
EG_INTERPOLATION _ETA (206s) |
DEP- ARTURE_ POINT _ETA (109s) |
See profile for SWAP_ BOUNDS _DP below |
STWORK (68s) |
EG_ CUBIC_ LAG- RANGE (93s, itself) |
EG_VERT_ WEIGHTS_ ETA (25s, itself) |
MONO_ ENFORCE (13s, itself) |
Itself (40s) |
|
|
|
SPA- TIAL (21s) |
PP_ HEAD (19s) |
EXP- PXI (12s, itself) |
*should also link to SWAP_BOUNDS_DP, like many other
returns.
Summary for top level profile
The four profiles are remarkably similar. The main differences
are
- At N96, the Backward-Euler run is 1,848s slower than the
Classic run. At N216 this number is 2,043s
- UKCA_MAIN1 explains most of these differences. At
N96 it's 1,248s and for N216 it's 1,163s. More on
UKCA_MAIN1 in its profiling below.
- The extra total time in the rest of the code,
namely ATM_STEP_4A, for Backward-Euler run at N96 is 509s
and at N216 it's 704s.
- The largest contributions to the extra total time in
ATM_STEP_4A for the Backward-Euler runs come from
the total time in
- ATMOS_PHYSICS1. The Backward-Euler uses 350s more
at N96 and 419s more at N216.
- EG_CORRECT_TRACERS* routines. The combined time
in EG_CORRECT_TRACERS and EG_CORRECT_TRACERS_UKCA for
the Backward-Euler compared to the time in
EG_CORRECT_TRACERS for Classic is 84s more at N96
and 165s more at N216.
- SL_TRACER1_4A and STASH are also significantly
larger for the Backward-Euler runs but to a lesser
extent the two factors above.
- The time in ATMOS_PHYSICS1 is greatly reduced in N216
compared to N96: for Backward-Euler from 1,299s to 1,051s
(reduction of 248s) and for Classic from 949s to 632s
(reduction of 317s). Presumably because this routines and
the routines below have fewer MPI calls, so it scales
better than the other routines. See the ATMOS_PHYSICS1
profiling for more details.
- The time in EG_SL_HELMHOLTZ is greatly increased in
N216 compared to N96 by about 130s.
- Above I stated that compared to Classic the total time in
ATM_STEP_4A for the Backward-Euler is 509s more at N96
and 704s more at N216. So compared to N96 the Backward-Euler
run at N216 has gained an extra 195s on its Classic
equivalent. This is because of the total time in
- EG_CORRECT_TRACERS_UKCA is 247s at N216 but
168s at N96, so a gain of 79s.
- ATMOS_PHYSICS1. The extra time for the
Backward-Euler at N96 is 350s and at N216 it's
419s, so that's 69s extra at N216
- The number of fields in both runs at N96 are similar where as
the Backward-Euler run at N216 has about twice as many fields as
the Classic run at N216, explains the larger discrepancy in
total time in STASH at N216.
Profiling for ATMOS_PHYSICS1 and EG_CORRECT_TRACERS
Backward-Euler at N96
ATMOS_PHYSICS1 (1,299s) |
EG_CORRECT_TRACERS (28s) |
RAD_CTL (495s) |
MICROPHYS_CTL (189s) |
NI_GWD _CTL (273s) |
⇓ |
⇓ |
EG_MASS_ CONSERVATION (33s) |
LW_RAD (357s) |
SW_RAD (120s) |
LS_PPN (181s) |
G_ WAVE _5A (230s) |
GW_ USSP (41s) |
GLOBAL _2D_ SUMS (44s, itself) |
Itself (25s) |
RADIANCE_CALC (470s) |
LS_PPNC (172s) |
SWAP_ BOUNDS (see table below) |
UKCA_ RADAER _BAND_ AVERAGE (276s, itself) |
SOLVE_BAND_K_EQV (153s) |
UKCA_ RADAER_ COMPUTE _AOD (21s, itself) |
LSP_ICE (102s) |
Itself (70s) |
MCICA_ SAMPLE (122s) |
SCALE_ ABSORB (20s) |
LSP_ INIT (24s) |
LSP_ FALL (19s) |
MONOCHR- OMATIC_ RADIANCE (106s) |
Itself (15s) |
Itself (12s) |
Itself (17s) |
MONOCHR- OMATIC_ RADIANCE _TSEQ (93s) |
MCICA_ COLUMN (93s) |
TWO_COEFF (67s) |
TRANS_ SOURCE_ COEFF (37s) |
Itself (12s) |
Itself (23s) |
|
|
CLASSIC at N96
ATMOS_PHYSICS1 (949s) |
EG_CORRECT_TRACERS (122s) |
RAD_CTL (215s) |
MICROPHYS_CTL (186s) |
NI_GWD_CTL (176s) |
⇓ |
⇓ |
EG_MASS_ CONSERVATION (90s) |
LW_RAD (153s) |
SW_RAD (53s) |
LS_PPN (174s) |
G_ WAVE _5A (150s) |
GW_ USSP (25s) |
GLOBAL _2D_ SUMS (38s, itself) |
Itself (63s) |
RADIANCE_CALC (200s) |
LS_PPNC (163s) |
SWAP_ BOUNDS (see table below) |
SOLVE_BAND_K_EQV (155s) |
LSP_ICE (95s) |
Itself (68s) |
MCICA_ SAMPLE (122s) |
SCALE_ ABSORB (20s) |
LSP_ INIT (23s) |
LSP_ FALL (19s) |
MONOCHR- OMATIC_ RADIANCE (107s) |
Itself (15s) |
Itself (11s) |
Itself (16s) |
MONOCHR- OMATIC_ RADIANCE _TSEQ (94s) |
MCICA_ COLUMN (94s) |
TWO_COEFF (67s) |
TRANS_ SOURCE_ COEFF (37s) |
Itself (12s) |
Itself (23s) |
|
|
Backward-Euler at N216
ATMOS_PHYSICS1 (1,051s) |
EG_CORRECT_TRACERS (40s) |
RAD_CTL (332s) |
MICROPHYS_CTL (121s) |
NI_GWD _CTL (308s) |
⇓ |
⇓ |
EG_MASS_ CONSERVATION (37s) |
LW_RAD (240s) |
SW_RAD (79s) |
LS_PPN (113s) |
G_ WAVE _5A (250s) |
GW_ USSP (58s) |
GLOBAL _2D_ SUMS (70s, itself) |
Itself (28s) |
RADIANCE_CALC (313s) |
LS_PPNC (106s) |
SWAP_ BOUNDS (see table below) |
UKCA_ RADAER _BAND_ AVERAGE (163s, itself) |
SOLVE_BAND_K_EQV (119s) |
UKCA_ RADAER_ COMPUTE _AOD (13s, itself) |
LSP_ICE (79s) |
Itself (27s) |
MCICA_ SAMPLE (92s) |
SCALE_ ABSORB (17s) |
LSP_ INIT (18s) |
LSP_ FALL (15s) |
MONOCHR- OMATIC_ RADIANCE (80s) |
Itself (13s) |
Itself (10s) |
Itself (13s) |
MONOCHR- OMATIC_ RADIANCE _TSEQ (69s) |
MCICA_ COLUMN (69s) |
TWO_COEFF (48s) |
TRANS_ SOURCE_ COEFF (27s) |
Itself (7s) |
Itself (17s) |
|
|
CLASSIC at N216
ATMOS_PHYSICS1 (632s) |
EG_CORRECT_TRACERS (122s) |
RAD_CTL (170s) |
MICROPHYS_CTL (122s) |
NI_GWD_CTL (161s) |
⇓ |
⇓ |
EG_MASS_ CONSERVATION (96s) |
LW_RAD (123s) |
SW_RAD (41s) |
LS_PPN (109s) |
G_ WAVE _5A (132s) |
GW_ USSP (28s) |
GLOBAL _2D_ SUMS (44s, itself) |
Itself (72s) |
RADIANCE_CALC (157s) |
LS_PPNC (101s) |
SWAP_ BOUNDS (see table below) |
SOLVE_BAND_K_EQV (119s) |
LSP_ICE (76s) |
Itself (25s) |
MCICA_ SAMPLE (92s) |
SCALE_ ABSORB (17s) |
LSP_ INIT (18s) |
LSP_ FALL (15s) |
MONOCHR- OMATIC_ RADIANCE (80s) |
Itself (13s) |
Itself (10s) |
Itself (13s) |
MONOCHR- OMATIC_ RADIANCE _TSEQ (69s) |
MCICA_ COLUMN (69s) |
TWO_COEFF (47s) |
TRANS_ SOURCE_ COEFF (27s) |
Itself (7s) |
Itself (17s) |
|
|
Summary of ATMOS_PHYSICS1 profiling
By total time, there are three main routines
below ATMOS_PHYSICS1
- RAD_CTL. Two points
- The total time is greatly reduced at
N216 compared to N96. Presumably is has fewer MPI
calls than other routines, so scales better. The fact
that OpenMP can be used here, would seem to confirm
this (I don't think MPI calls exist in thread 2, so
MPI calls need to be done outside the threading).
- UKCA_RADAER_BAND_AVERAGE is only present for
Backward-Euler runs and is a success story at N216
- it's time of 276s at N96 is reduced to 163s
at N216, saving 113s.
- MICROPHYS_CTL. The total time at N96 is about 187s
and it's about 121s at N216 for both Backward-Euler and
Classic. Presumably is has fewer MPI calls than other
routines, and so scales better than them.
- NI_GWD_CTL. I suspect that any extra time here comes
from MPI barriers either in the local routines of the
SWAP_BOUNDS* routines (see below). In which case, it's
a consequence of imbalance in run time for other routines,
which means that some PEs are having to wait for other
PEs to catch up. And the extra time for the Backward-Euler
runs would be because some of its routines, such as
the ones in UKCA_MAIN1, are more imbalanced.
- If we look at the time in ATMOS_PHYSICS1, not accounted
by the total time in the three subroutines above it's
- 372s for Classic at N96,
- 342s for Backward-Euler at N96,
- 179s for Classic at N216, and
- 290s for Backward-Euler at N216
We know a significant chunk is the time comes from the SWAP_BOUNDS*
routines, STASH and especially COSP_MAIN (which doesn't have Dr
Hook code included). TIMER output for a previous Backward-Euler
suggestes that the total time in COSP_MAIN is about 300s at N96.
From the times above, this must be much smaller for N216.
Profiling for SWAP_* routines
Backward-Euler at N96
Routines |
Total mean time |
EG_SWAP_BOUNDS_DP (148s) |
ATMOS_PHYSICS1, ATMOS_PHYSICS2, G_WAVE_5A, ... |
148 + ... |
SWAP_BOUNDS & SWAP_BOUNDS_DP (434 + 171 = 605s)
|
SWAP_BOUNDS_MV (112s, itself) |
717s |
SWAP_BOUNDS_EW_DP (316s) |
SWAP_BOUNDS_NS_DP (288s, itself) |
716s |
SWAP_BOUNDS_EW_H1_DP (220s, itself) |
Itself (95s) |
715s |
CLASSIC at N96
Routines |
Total mean time |
EG_SWAP_BOUNDS_DP (107s) |
ATMOS_PHYSICS1, ATMOS_PHYSICS2, G_WAVE_5A, ... |
107 + ... |
SWAP_BOUNDS & SWAP_BOUNDS_DP (339 + 118 = 457s)
|
SWAP_BOUNDS_MV (103s, itself) |
560s |
SWAP_BOUNDS_EW_DP (234s) |
SWAP_BOUNDS_NS_DP (222s, itself) |
559s |
SWAP_BOUNDS_EW_H1_DP (153s, itself) |
Itself (81s) |
559s |
Backward-Euler at N216
Routines |
Total mean time |
EG_SWAP_BOUNDS_DP (221s) |
ATMOS_PHYSICS1, ATMOS_PHYSICS2, G_WAVE_5A, ... |
221 + ... |
SWAP_BOUNDS & SWAP_BOUNDS_DP (424 + 257 = 681s)
|
SWAP_BOUNDS_MV (115s, itself) |
796s |
SWAP_BOUNDS_EW_DP (319s) |
SWAP_BOUNDS_NS_DP (361s, itself) |
795s |
SWAP_BOUNDS_EW_H1_DP (263s, itself) |
Itself (56s) |
795s |
CLASSIC at N216
Routines |
Total mean time |
EG_SWAP_BOUNDS_DP (133s) |
ATMOS_PHYSICS1, ATMOS_PHYSICS2, G_WAVE_5A, ... |
133 + ... |
SWAP_BOUNDS & SWAP_BOUNDS_DP (300 + 148 = 448s)
|
SWAP_BOUNDS_MV (133s, itself) |
581s |
SWAP_BOUNDS_EW_DP (212s) |
SWAP_BOUNDS_NS_DP (235s, itself) |
580s |
SWAP_BOUNDS_EW_H1_DP (163s, itself) |
Itself (48s) |
579s |
Summary of SWAP_BOUNDS* routines
More time is taken in these routines for N216,
probably because the heavy MPI communiciation means that
it doesn't scale as well as other routines.
And considerable more for the Backward-Euler runs, which
will be because
- of the extra fields to pass
- MPI barriers, meaning some PEs are having to wait for
other PEs delayed by imbalances in the code - probably
from UKCA code.
Backward-Euler at N96
Routines |
Total mean time |
UKCA_MAIN* (1,248s) |
1,248s |
UKCA_AERO_CTL (885s) |
UKCA_ ACTIVATE (143s) |
1,028s |
UKCA_AERO_STEP (835s) |
UKCA_ ABDULRAZZAK_ GHAN (132s) |
978s |
UKCA_COAGWITHNUCL (378s) |
UKCA_ CONDEN (141s) |
UKCA_ CHECK_ MD_ND (70s, itself) |
UKCA_ CALCNUCRATE (69s) |
UKCA_ VOLUME_ MODE (51s) |
Itself (126s) |
835s |
Itself (315s) |
UKCA_ SOLVECOAGNUCL _V (63s, itself) |
UKCA_ COND_ COFF_V (92s, itself) |
Itself (49s) |
UKCA_ BINAPARA (65s, itself) |
Itself (26s) |
806s |
Backward-Euler at N216
Routines |
Total mean time |
UKCA_MAIN* (1,163s) |
1,163s |
UKCA_AERO_CTL (813s) |
UKCA_ ACTIVATE (106s) |
919s |
UKCA_AERO_STEP (767s) |
UKCA_ ABDULRAZZAK_ GHAN (97s) |
864s |
UKCA_COAGWITHNUCL (355s) |
UKCA_ CONDEN (133s) |
UKCA_ CHECK_ MD_ND (63s, itself) |
UKCA_ CALCNUCRATE (59s) |
UKCA_ VOLUME_ MODE (45s) |
Itself (92s) |
747s |
Itself (294s) |
UKCA_ SOLVECOAGNUCL _V (61s, itself) |
UKCA_ COND_ COFF_V (83s, itself) |
Itself (50s) |
UKCA_ BINAPARA (55s, itself) |
Itself (23s) |
721s |
Backward-Euler at N216 multiplied by 4/3+
Routines |
Total mean time |
UKCA_MAIN* (1,550s) |
1,550s |
UKCA_AERO_CTL (1,084s) |
UKCA_ ACTIVATE (141s) |
1,225s |
UKCA_AERO_STEP (1,022s) |
UKCA_ ABDULRAZZAK_ GHAN (129s) |
1,151s |
UKCA_COAGWITHNUCL (473s) |
UKCA_ CONDEN (177s) |
UKCA_ CHECK_ MD_ND (84s, itself) |
UKCA_ CALCNUCRATE (79s) |
UKCA_ VOLUME_ MODE (60s) |
Itself (123s) |
1,108s |
Itself (392s) |
UKCA_ SOLVECOAGNUCL _V (81s, itself) |
UKCA_ COND_ COFF_V (111s, itself) |
Itself (67s) |
UKCA_ BINAPARA (73s, itself) |
Itself (31s) |
962s |
*UCKA_MAIN also calls STASH
+You might expect that if UKCA_MAIN1 was called
every timestep that it would be three times more expensive
for N96 (it's called every third step) and four times more
expensive for N216 (it's called every fourth step). For both
Backward-Euler runs, the reset of the code - namely ATM_STEP_4A,
takes about 3,000s so if the
assumption above is correct, it should be OK to simply scale
the N216 times by 4/3 to compare with the N96 times.
Summary of UKCA_MAIN1 profiles
The profile show that based on total time there are two main
subroutines called from UKCA_MAIN1
- UKCA_AERO_CTL. This has about the same total time for both N96
and N216 - so even though it's called less often compared to
the main code, the time for the N216 is not that much less.
- UKCA_ACTIVATE. The total time here does roughly look to scale
by how frequently it is called, so it is less for the N216 run
- As the total time spent in the UKCA_AERO_CTL - which doesn't
seem to scale by calls - greatly
exceeds UKCA_ACTIVATE, the reduction in total time in UKCA_MAIN1
is only 85s.
Overall summary
I've profiled four runs, where the time spent in most of
the routines are similar, to try and determine why
the slow down factor for the Backward-Euler code compared
to Classic is greater at N216 than at N96.
The main points are
- The routines below UKCA_AERO_CTL, which is most of
the total time in UKCA_MAIN1, aren't reduced by much
at N216 compared to N96 - even though it's called less
often. The total time in UKCA_MAIN is only 85s less
at N216, when we'd hope for a lot more.
- EG_CORRECT_TRACERS_UKCA doesn't seem to scale well and
is 79s more at N216 than at N96.
- The one success story is that UKCA_RADAER_BAND_AVERAGE
takes 113s less at N216 compared to N96.
- Unfortunately, the above
gain at N216 compared to N96, is lost in other parts of
ATMOS_PHYSICS1 - although I suspect that imbalances elsewhere,
such as in UKCA_MAIN, are
causing some PEs to arrive earlier in ATMOS_PHYSICS1 and
they are then held here by MPI barriers, and consequently
increasing the time spent in these routines.
- The total time in the SWAP_BOUNDS routines at N216
compared to N96 for the Backward-Euler runs is about 80s more
(about 796s compared to 717s).
- The number of fields in both runs at N96 are similar where as
the Backward-Euler run at N216 has about twice as many fields as
the Classic run at N216, explains the larger discrepancy in
total time in STASH at N216.