GA7 vs UKESM at UM11.7
Here I'm comparing a GA7 run, u-bw499, against a UKESM run, u-bw526
which were both run
- One month
- 24x21 MPI decomposition on 2 OpenMP threads, so 1008 cores or 28 nodes.
Note that I've turned off COSP and the timers in GA7 job to be consistent
with UKESM AMIP job.
See UKESM
ticket #707 for more on how they were created.
The numbers are mean total time, where total time in the routine and
all the routines called by that routine. Unless it says, `itself', which
means the time is only for the time in the routine itself. Where the time
in a routine is small I've not included them.
Two times indicates that routine is being run on two threads, where
the first time is the time on thread 1 and the second is the time on
thread 2.
On the UKESM profiles, I've shown in bold any times which are significantly
greater than GA7.
Profiling for top routines
GA7
Routines |
UM_SHELL (1177s) |
U_MODEL_4A (1175s) |
ATM_STEP_4A* (1086s) |
ATMOS _PHYS- ICS1 (194s) |
ATMOS _PHYS- ICS2 (114s) |
EG_SL_ HELM- HOLTZ (122s) |
TR_ SET_ PHYS _4A* (32s) |
EG_CORRECT _TRACERS _PRIESTLEY (37s)
|
SL_ TRAC- ER1_ 4A (57s) |
EG_ SL_ MOI- STURE (23s) |
EG_SL_ FULL_WIND (46s) |
ATM_ STEP_ STASH (91s) |
ST_ DIAG3 (33s) |
⇓ |
ATMOS_UKCA (225s) |
See profile for ATMOS_ PHYSICS1
|
See profile for ATMOS_ PHYSICS2
|
EG_ PRECON _DP_DP (65s) |
EG_SL_WIND_U, EG_SL_WIND_V & EG_SL_WIND_W (8 + 8 + 12 = 28s)
|
STASH (186s) |
See profiling for ATMOS_ UKCA |
TRI_ SOR_ DP_DP* (65s) |
Itself (13s) |
EG_INTERPOLATION _ETA (69s) |
DEP- ARTURE_ POINT _ETA (22s) |
See profile for STASH
|
Itself (40s) |
EG_ CUBIC_ LAG- RANGE (23s, itself) |
MONO_ ENFORCE (6s, itself) |
Itself (27s) |
Itself (5s) |
|
|
*Calls SWAP_BOUNDS, like many other routines.
UKESM
Routines |
UM_SHELL (2907s) |
U_MODEL_4A (2899s) |
ATM_STEP_4A* (2832s) |
ATMOS _PHYS- ICS1 (242s) |
ATMOS _PHYS- ICS2 (202s) |
EG_SL_ HELM- HOLTZ (138s) |
TR_ SET_ PHYS _4A* (121s) |
EG_CORRECT _TRACERS _PRIESTLEY (138s)
|
SL_ TRAC- ER1_ 4A (195s) |
EG_ SL_ MOI- STURE (23s) |
EG_SL_ FULL_WIND (49s) |
ATM_ STEP_ STASH (286s) |
ST_ DIAG3 (143s) |
⇓ |
ATMOS_UKCA (1129s) |
See profile for ATMOS_ PHYSICS1
|
See profile for ATMOS_ PHYSICS2
|
EG_ PRECON _DP_DP (70s) |
EG_SL_WIND_U, EG_SL_WIND_V & EG_SL_WIND_W (8 + 8 + 13 = 29s)
|
STASH (1057s) |
See profiling for ATMOS_ UKCA |
TRI_ SOR_ DP_DP* (69s) |
Itself (43s) |
EG_INTERPOLATION _ETA (149s) |
DEP- ARTURE_ POINT _ETA (25s) |
See profile for STASH
|
Itself (41s) |
EG_ CUBIC_ LAG- RANGE (51s, itself) |
MONO_ ENFORCE (16s, itself) |
Itself (64s) |
Itself (5s) |
|
|
*Calls SWAP_BOUNDS, like many other routines.
STASH
GA7
Routines |
STASH (186s) |
STWORK (185s) |
SPATIAL (61s) |
PP_HEAD (63s) |
⇓ |
STASH_ GATHER_ FIELD (73s) |
TEMPORAL (21s) |
STZONM* (1s) |
⇓ |
STEXTC (15s, itself) |
Itself (33s) |
Itself (23s) |
EXPPXI (50s, itself) |
GATHER_ PACK_ FIELD (11s, itself) |
STACCUM (14s, itself) |
GLOBAL_TO_LOCAL_SUBDOMAIN (13s)
|
Itself (8s) |
|
|
* Calls GLOBAL_2D_SUMS (29s)
UKESM
Routines |
STASH (1057s) |
STWORK (1056s) |
SPATIAL (373s) |
PP_HEAD (300s) |
⇓ |
STASH_ GATHER_ FIELD (212s) |
TEMPORAL (96s) |
STZONM* (163s) |
⇓ |
STEXTC (50s, itself) |
Itself (116s) |
Itself (108s) |
EXPPXI (234s, itself) |
GATHER_ PACK_ FIELD (161s, itself) |
STACCUM (64s, itself) |
GLOBAL_TO_LOCAL_SUBDOMAIN (51s)
|
Itself (32s) |
|
|
* Calls GLOBAL_2D_SUMS (
197s)
ATMOS_PHYSICS1
GA7
ATMOS_PHYSICS1* (194s) |
RAD_CTL (66s) |
MICROPHYS_CTL (80s) |
NI_GWD_CTL (42s) |
LW_RAD (42s, 42s) |
SW_RAD (15s, 15s) |
LS_PPN (77s) |
G_WAVE _5A* (22s) |
GW_ USSP* (19s) |
SET_AER (26s, 26s) |
SOCRATES_CALC (29s, 29s) |
LS_PPNC (71s) |
RADIANCE_CALC (29s, 29s) |
UKCA_RADAER_ BAND_AVERAGE (22s, 23s, itself) |
SOLVE_BAND_ K_EQV_SCL (21s, 21s) |
LSP_ICE (60s, 30s) |
MCICA_SAMPLE (19s, 19s) |
LSP_ CAPTURE (12s, 6s) |
LSP_ FALL (8s, 4s) |
MONOCHROMATIC_ RADIANCE (16s, 16s) |
LSP_MOMENTS (29s, 15s, itself) |
MONOCHROMATIC_ RADIANCE _TSEQ (14s, 14s) |
MCICA_COLUMN (13s, 13s) |
|
|
* Calls SWAP_BOUNDS
UKESM
ATMOS_PHYSICS1* (242s) |
RAD_CTL (101s) |
MICROPHYS_CTL (75s) |
NI_GWD_CTL (56s) |
LW_RAD (73s, 71s) |
SW_RAD (22s, 22s) |
LS_PPN (72s) |
G_WAVE _5A* (21s) |
GW_ USSP* (20s) |
SET_AER (45s, 44s) |
SOCRATES_CALC (48s, 47s) |
LS_PPNC (66s) |
RADIANCE_CALC (48s, 47s) |
UKCA_RADAER_ BAND_AVERAGE (30s, 29s, itself) |
SOLVE_BAND_ K_EQV_SCL (35s, 34s) |
LSP_ICE (54s, 36s) |
MCICA_SAMPLE (30s, 29s) |
LSP_ CAPTURE (11s, 7s) |
LSP_ FALL (7s, 5s) |
MONOCHROMATIC_ RADIANCE (26s, 25s) |
LSP_MOMENTS (26s, 17s, itself) |
MONOCHROMATIC_ RADIANCE _TSEQ (22s, 21s) |
MCICA_COLUMN (22s, 21s) |
|
|
* Calls SWAP_BOUNDS
ATMOS_PHYSICS2
GA7
Routines |
ATMOS_PHYSICS2 (113s) |
NI_CONV_CTL (39s) |
NI_IMP_CTL (27s) |
NI_BL_CTL (11s) |
ATMOS_PHYSICS2_ SWAP_IMP* (19s, 5s) |
GLUE_CONV_6A (29s, 28s) |
IMP_SOLVER* (13s) |
TR_MIX (4s, 3s) |
BDY_LAYR (11s) |
Itself (9s, 9s) |
MID_CONV_6A (9s, 9s) |
IMP_MIX (3s, 2s, itself) |
surf_couple_ explicit (7s) |
* Calls SWAP_BOUNDS
UKESM
Routines |
ATMOS_PHYSICS2 (202s) |
NI_CONV_CTL (88s) |
NI_IMP_CTL (43s) |
NI_BL_CTL (22s) |
ATMOS_PHYSICS2_ SWAP_IMP* (28s, 5s) |
GLUE_CONV_6A (58s, 57s) |
IMP_SOLVER* (20s) |
TR_MIX (11s, 10s) |
BDY_LAYR (22s) |
Itself (29s, 29s) |
MID_CONV_6A (14s, 14s) |
IMP_MIX (9s, 8s, itself) |
surf_couple_ explicit (18s) |
* Calls SWAP_BOUNDS
UKCA_MAIN1
GA7
Routines |
ATMOS_UKCA (225s) |
UKCA_MAIN (138s) |
STASH* (186s) |
UKCA_AERO_CTL (86s) |
UKCA_ACTIVATE (31s) |
UKCA_CHEMISTRY _CTL_BE (3s) |
UKCA_AERO_STEP (82s, 81s) |
UKCA_ ABDULRAZZAK _GHAN (28s) |
UKCA_COAG- WITHNUCL (28s, 28s) |
UKCA_ CONDEN (26s, 26s) |
UKCA_ VOLUME_ MODE (10s, 10s) |
Itself (33s) |
Itself (22s, 22s) |
UKCA_ COND_ COFF_V (21s, 21s, itself) |
*STASH is called from other routines as well
+also called from other routines, especially
boundary layer routines
UKESM
Routines |
ATMOS_UKCA (1129s) |
UKCA_MAIN (466s) |
UKCA_ PLEV_ DIAGS (169s) |
⇓ |
UKCA_AERO_CTL (84s) |
UKCA_ ACT- IVATE (36s) |
UKCA_CHEMISTRY_CTL (167s) |
UKCA_FASTJX (114s) |
UKCA_NEW_ EMISS_CTL (37s) |
STASH* (1057s) |
UKCA_AERO_STEP (78s, 78s) |
UKCA_ ABDUL- RAZZAK _GHAN (34s) |
ASAD_CDRIVE (132s) |
UKCA_ STRAT _PHOT- OL (21s) |
FASTJX_PHOTOJ (114s) |
UKCA_ADD_ EMISS (32s) |
UKCA_COAG- WITHNUCL (28s, 28s) |
UKCA_ CONDEN (24s, 24s) |
UKCA_ VOL- UME_ MODE (10s, 10s) |
Itself (33s) |
ASAD_SPMJPDRIV (118s, 119s) |
⇓ |
INI- JTAB (21s) |
FASTJX_OPMIE (65s, 66s) |
FLINT (24s, 24s, itself) |
Itself (16s) |
TRSRCE (15s, 14s, itself) |
TR_ MIX+ (11s, 10s) |
Itself (21s, 21s) |
UKCA_ COND_ COFF _V (19s, 19s, itself) |
ASAD_SPIMPMJP (116s, 116s) |
⇓ |
SET- TAB (20s) |
FASTJX_ MIESCT (47s, 48s) |
Itself (18s, 18s) |
IMP_ MIX (9s, 8s, itself) |
SP- LIN- SLV2 (45s, 46s, itself) |
SP- FUL- JAC (34s, 34s, itself) |
Itself (12s, 13s) |
ASAD_ DIFFUN (18s, 18s) |
BLKSLV (47s, 48s) |
ASAD_ PRLS (18s, 18s, itself) |
Itself (31s, 31s) |
MAT- INW (7s, 8s, itself) |
|
|
*STASH is called from other routines as well
+also called from other routines, especially
boundary layer routines
SWAP_BOUNDS
GA7
Routines |
SWAP_BOUNDS_HUB_4D_TO_3D_DP (74s) |
⇓ |
SWAP_ BOUNDS _MV (44s) |
SWAP_ BOUNDS _MPI_ RB_DP (25s, itself) |
SWAP_BOUNDS_HUB_DP (173s) |
SWAP_BOUNDS_ DDT_NSEW_WA_DP (110s) |
SWAP_BOUNDS_ DDT_AAO_DP (69s) |
Itself (42s) |
END_SWAP_ BOUNDS_ DDT_NSEW_ WA_DP (67s) |
BEGIN_SWAP_ BOUNDS_ DDT_NSEW_ WA_DP (44s) |
END_SWAP_ BOUNDS_ DDT_AAO_ DP (69s) |
BEGIN_SWAP_ BOUNDS_ DDT_AAO_ DP (24s) |
UKESM
Routines |
SWAP_BOUNDS_HUB_4D_TO_3D_DP (232s) |
⇓ |
SWAP_ BOUNDS _MV (62s) |
SWAP_ BOUNDS _MPI_ RB_DP (27s, itself) |
SWAP_BOUNDS_HUB_DP (375s) |
SWAP_BOUNDS_ DDT_NSEW_WA_DP (278s) |
SWAP_BOUNDS_ DDT_AAO_DP (96s) |
Itself (60s) |
END_SWAP_ BOUNDS_ DDT_NSEW_ WA_DP (150s) |
BEGIN_SWAP_ BOUNDS_ DDT_NSEW_ WA_DP (128s) |
END_SWAP_ BOUNDS_ DDT_AAO_ DP (94s) |
BEGIN_SWAP_ BOUNDS_ DDT_AAO_ DP (26s) |
Reason why UKESM AMIP is slower than GA7
UKESM AMIP runs in 2907s and GA7 runs in 1177s. This is 1730s
slower, which is 147% or 2.47 times slower, than GA7.
The profiling above suggests that the reasons why UKESM AMIP
is slower are
- UKESM has an enormous STASH: 1057s versus 186s for GA7.
- 871s extra
- Without this extra time UKESM AMIP would run in about
2036s. This is 859s slower, which is 73% or 1.73 times slower,
than GA7
- More fields to transport for UKESM
- More advection
- SL_TRACER1_4A: 195s versus 57s (138s more)
- EG_CORRECT_TRACERS_PRIESTLY: 138s versus 37s (101s more)
- TR_SET_PHYS_4A: 121s versus 32s (89s more)
- More convective transport
- NI_CONV_CTL: 88s versus 39s (49s more)
A total of 328s more.
- Extra chemistry (not found in GA7)
- UKCA_CHEMISTRY_CTL (167s)
- UKCA_FASTJX (114s)
- UKCA_NEW_EMISS_CTL (37s)
A total of 318s (27% of runtime for GA7)
- SWAP_BOUNDS is higher for UKESM. And UKESM will need to do more
halo exchanges than GA7, because it has more fields. However,
I think most of the time for SWAP bounds is probably absorbing
speed imbalances in the rest of the code, such as UKCA_MAIN,
because it contains MPI barriers.
- Whatever the cause, time attributed to SWAP_BOUNDS_HUB_DP:
375s versuses 173s (202s more)
- Extra time for radiation. I'm not sure why, but probably something to
do with RADAER.
- LW_RAD: 73s & 71s versus 42s & 42s (about 30s more)