Profiling month 1 vs month 2

It's been very noticeable that the first month of our prototype UKESM0.4 is running much quicker than all the following months, e.g. on 23 nodes the first month takes about 70.5 minutes (1.7 model years/day) and about 102 minutes the remaining months (1.2 model years/day).

Increasing the nodes of the ocean from 3 to 8 (u-af845) doesn't appear to speed the model up, which suggests the bottleneck for all months is still the atmosphere. In addition, Richard has also been turning on all the timers for the model components and his results suggest the time in the UKCA chemistry almost doubles after the first 20 days.

The profiling was done with u-af879.

Profiling the top routines

1st month

Routines
UM_SHELL (5,221s)
U_MODEL_4A (5,214s)
ATM_STEP_4A* (4,677s) OASIS3_ PUTA2O (210s) OASIS3_ GETO2A (14s)
ATMOS _PHYS- ICS1 (929s) ATMOS _PHYS- ICS2 (286s) EG_ SL_ HELM- HOLTZ (267s) TR_ SET_ PHYS _4A* (228s) EG_CORRECT _TRACERS _PRIESTLEY (296s) SL_ TRAC- ER1_ 4A (394s) EG_ SL_ MOI- STURE (56s) EG_SL_ FULL_WIND (138s)  ⇓  UKCA_MAIN1 (1,003s) ICE_ SHEET_ MASS (207s) OASIS3_ GET (9s)
Profile for ATMOS_ PHYSICS1 Profile for ATMOS_ PHYSICS2 and EG_SL_ HELMHOLTZ EG_SL_WIND_U, EG_SL_WIND_V & EG_SL_WIND_W (16 + 15 + 53 = 84s) STASH (289s) See profiling for UKCA_ MAIN1 GLOBAL_ 2D_ SUMS** (348s)
Itself (89s) EG_INTERPOLATION _ETA (254s) DEP- ARTURE_ POINT _ETA (95s) STWORK (289s)
EG_ CUBIC_ LAG- RANGE (78s, itself) MONO_ ENFORCE (30s, itself) Itself (120s) Itself (8s) PP_ HEAD (129s) EXP- PXI (102s, itself)
*should also link to SWAP_BOUNDS, like many other returns.
** GLOBAL_2D_SUMS is called by other routines as well

2nd month

Routines
UM_SHELL (7,000s)
U_MODEL_4A (6,990s)
ATM_STEP_4A* (6,006s) OASIS3_ PUTA2O (567s) OASIS3_ GETO2A (52s)
ATMOS _PHYS- ICS1 (914s) ATMOS _PHYS- ICS2 (283s) EG_ SL_ HELM- HOLTZ (292s) TR_ SET_ PHYS _4A* (202s) EG_CORRECT _TRACERS _PRIESTLEY (297s) SL_ TRAC- ER1_ 4A (386s) EG_ SL_ MOI- STURE (53s) EG_SL_ FULL_WIND (121s)  ⇓  UKCA_MAIN1 (1,624s) ICE_ SHEET_ MASS (563s) OASIS3_ GET (39s)
Profile for ATMOS_ PHYSICS1 Profile for ATMOS_ PHYSICS2 and EG_SL_ HELMHOLTZ EG_SL_WIND_U, EG_SL_WIND_V & EG_SL_WIND_W (15 + 15 + 41 = 71s) STASH (289s) See profiling for UKCA_ MAIN1 GLOBAL_ 2D_ SUMS** (724s)
Itself (89s) EG_INTERPOLATION _ETA (254s) DEP- ARTURE_ POINT _ETA (99s) STWORK (289s)
EG_ CUBIC_ LAG- RANGE (78s, itself) MONO_ ENFORCE (30s, itself) Itself (120s) Itself (8s) PP_ HEAD (129s) EXP- PXI (102s, itself)
*should also link to SWAP_BOUNDS, like many other returns.
** GLOBAL_2D_SUMS is called by other routines as well

Summary

The two routines showing a much greater increase in mean time are

  • UKCA_MAIN1
  • GLOBAL_2D_SUMS (called from OASIS3_PUTA2O -> ICE_SHEET_MASS). As OASIS3_PUT follows the call to UKCA_MAIN1, I suspect the extra time in here comes from a MPI barrier and a big imbalance in UKCA_MAIN1 meaning that some PEs have to wait a while in this routine for other PEs to catch up.

Profiling UKCA_MAIN1

1st month

Routines
UKCA_MAIN* (1,003s)
UKCA_AERO_CTL (146s) UKCA_ ACT- IVATE (41s) UKCA_CHEMISTRY_CTL (298s) UKCA_FASTJX (154s) UKCA_ NEW_ EMISS_ CTL (77s)
UKCA_AERO_STEP (142s) UKCA_ ABDUL- RAZZAK _GHAN (39s) ASAD_CDRIVE (266s) FASTJX_PHOTOJ (154s)
UKCA_ COAGWITH- NUCL (41s) UKCA_ CONDEN (38s) Itself (38s) ASAD_SPMJPDRIV (236s) FASTJX_OPMIE (93s) FLINT (28s, itself) Itself (22s)
Itself (32s) UKCA_ COND_ COFF _V (30s, itself) ASAD_SPIMPMJP (232s) FASTJX_ MIESCT (67s) Itself (26s)
SPLINSLV2 (109s, itself) SPFULJAC (68s, itself) Itself (21s) BLKSLV (67s)
Itself (45s)

2nd month

Routines
UKCA_MAIN* (1,624s)
UKCA_AERO_CTL (142s) UKCA_ ACT- IVATE (45s) UKCA_CHEMISTRY_CTL (921s) UKCA_FASTJX (154s) UKCA_ NEW_ EMISS_ CTL (77s)
UKCA_AERO_STEP (138s) UKCA_ ABDUL- RAZZAK _GHAN (44s) ASAD_CDRIVE (889s) FASTJX_PHOTOJ (154s)
UKCA_ COAGWITH- NUCL (41s) UKCA_ CONDEN (38s) Itself (43s) ASAD_SPMJPDRIV (860s) FASTJX_OPMIE (93s) FLINT (28s, itself) Itself (22s)
Itself (32s) UKCA_ COND_ COFF _V (30s, itself) ASAD_SPIMPMJP (854s) FASTJX_ MIESCT (67s) Itself (26s)
SPLINSLV2 (414s, itself) SPFULJAC (245s, itself) Itself (73s) BLKSLV (67s)
Itself (45s)

Summary

Remarkably clear signal, the 2nd month is slowed by the extra time in the calling tree of UKCA_CHEMISTRY_CTL -> ASAD_CDRIVE -> ASAD_SPMJPDRIV -> ASAD_SPIMPJP, and most of this can be attributed to the solvers SPLINSLV2 and SPFULJAC.

And if we look at the minimum and maximum time in these two routines, we can see how imbalanced they are

RoutineMinimum MeanMaximum
SPLINSLV2261.856 (PE 404) 414.231087.191 (PE 0)
SPFULJAC148.937 (PE 404) 244.75683.399 (PE 23)