Profiling for l_strip_ukca

In UM ticket 1686, I'm adding most of the code associated with the logical l_strip_ukca. Setting this to true should

  • Remove most of the UKCA fields from the D1 array
  • Remove the transportation, namely advection and convection, of the UKCA fields.

I've created a UKESM-hybrid N96 N48 ORCA1 job, u-bh774, where I've only turned on the profiling for Snr and I've made sure that Snr is the slowest component even when l_strip_ukca is true. To do this I have had to reduce the MPI tasks for Snr to be about the minimum that memory will allow and reduced the timestep from 20 mins to 15 mins. The full setting are

  • Run for one month
  • (ATM_PROCX,ATM_PROCY)=(24,18)
  • OMPTHR_ATM=2
  • IOS_NPROC=0
  • In app/um/rose-app.conf, steps_per_periodim=96
  • (ATM_PROCX_JNR,ATM_PROCY_JNR)=(32,20)
  • OMPTHR_ATM=3
  • IOS_NPROC=0
  • In app/um_jnr/rose-app.conf, steps_per_periodim=72
  • (NEMO_IPROC,NEMO_JPROC)=(12,9)
  • (CICE_BLKX,CICE_BLKY)=(30,37)
  • total nodes=82

The profiling below show is the total time, unless stated as 'itself', which is the time in a given routine plus all the time in routines called by the given routine.

Profiling for l_strip_ukca=false

Routines
UM_SHELL (4,534s)
U_MODEL_4A (4,528s)
ATM_STEP_4A* (4,265s) OASIS3_ GETO2A (130s) OASIS3 _GET_ HYBRID (45s) OASIS3_ PUTA2O (5s) OASIS3 _PUT_ HYBRID (7s)
ATMOS _PHYS- ICS1 (499s) ATMOS _PHYS- ICS2 (329s) EG_ SL_ HELM- HOLTZ (267s) TR_ SET_ PHYS _4A* (162s) EG_CORRECT _TRACERS _PRIESTLEY (199s) SL_ TRAC- ER1_ 4A (277s) EG_ SL_ MOI- STURE (34s) EG_SL_ FULL_WIND (88s)  ⇓  UKCA_MAIN1 (1,506s) OASIS3_GET (36s) OASIS3_PUT (9s)
ATMOS_ PHYS- ICS1 rout- ines ATMOS_ PHYS- ICS2 rout- ines EG_ SL_ HELM- HOLTZ rout- ines EG_SL_WIND_U, EG_SL_WIND_V & EG_SL_WIND_W (16 + 16 + 21 = 53s) STASH (1,482s) UKCA_ MAIN1 rout- ines
Itself (64s) EG_INTERPOLATION _ETA_PMF (234s) DEP- ARTURE_ POINT _ETA (44s) STWORK (1,481s)
EG_INTERPOLATION _ETA (295s) Itself (8s) PP_ HEAD (370s) EXP- PXI (303s, itself)
EG_ CUBIC_ LAG- RANGE (79s, itself) MONO_ ENFORCE (24s, itself) Itself (104s)
*should also link to SWAP_BOUNDS, like many other returns.
** GLOBAL_2D_SUMS is called by several routines and is 498s.

Profiling for l_strip_ukca=true

Routines
UM_SHELL (2,028s)
U_MODEL_4A (2,022s)
ATM_STEP_4A* (1,894s) OASIS3_ GETO2A (21s) OASIS3 _GET_ HYBRID (48s) OASIS3_ PUTA2O (5s) OASIS3 _PUT_ HYBRID (6s)
ATMOS _PHYS- ICS1 (499s) ATMOS _PHYS- ICS2 (178s) EG_ SL_ HELM- HOLTZ (247s) TR_ SET_ PHYS _4A* (9s) EG_CORRECT _TRACERS _PRIESTLEY (10s) SL_ TRAC- ER1_ 4A (13s) EG_ SL_ MOI- STURE (32s) EG_SL_ FULL_WIND (65s)  ⇓  OASIS3_GET (44s) OASIS3_PUT (8s)
ATMOS_ PHYS- ICS1 rout- ines ATMOS_ PHYS- ICS2 rout- ines EG_ SL_ HELM- HOLTZ rout- ines EG_SL_WIND_U, EG_SL_WIND_V & EG_SL_WIND_W (15 + 15 + 17 = 47s) STASH (629s)
Itself (3s) EG_INTERPOLATION _ETA_PMF (61s) DEP- ARTURE_ POINT _ETA (34s) STWORK (628s)
EG_INTERPOLATION _ETA (60s) Itself (8s) PP_ HEAD (159s) EXP- PXI (131s, itself)
EG_ CUBIC_ LAG- RANGE (19s, itself) MONO_ ENFORCE (4s, itself) Itself (26s)
*should also link to SWAP_BOUNDS, like many other returns.
** GLOBAL_2D_SUMS is called by several routines and is

The times highlighted in bold are those times which are much smaller than when running with l_strip_ukca=false. The runtime when l_strip_ukca=true is less than half the runtime for l_strip_ukca=false and the main reason for this are

  • UKCA_MAIN1 doesn't appear in the profiling when l_strip_ukca=true, because it doesn't get called. By setting l_strip_ukca=true, we saving about 2,247s (difference in total time for UM_SHELL), and about 1,506s of this is saved by not running UKCA_MAIN1 - the majority of this time.
  • There a lot fewer UKCA fields when l_strip_ukca=true, and this saves about 853s in STASH.
  • When l_strip_ukca=true, there's no advection of UKCA tracer fields and this saves hundreds of seconds across TR_SET_PHYS_4A, EG_CORRECT_TRACERS_PRIESTLY and SL_TRACER1_4A.
  • When l_strip_ukca=true, there's no convective transport of UKCA tracer fields and this saves about 151s in ATMOS_PHYSICS2. To see this explicity I need to provide more detailed profiling of the routines called by ATMOS_PHYSICS2 as shown below.

Profiling ATMOS_PHYSIC2 when l_strip_ukca=false

ATMOS_PHYSICS2 (329s)
NI_CONV_CTL (145s) NI_IMP_CTL (61s) SWAP_BOUNDS routines
GLUE_CONV_6A (102s, 101s) IMP_SOLVER (30s)
Itself (61s, 61s) MID_CONV_6A (24s, 24s)
Itself (15s, 14s)

Profiling ATMOS_PHYSIC2 when l_strip_ukca=true

ATMOS_PHYSICS2 (174s)
NI_CONV_CTL (43s) NI_IMP_CTL (38s) SWAP_BOUNDS routines
GLUE_CONV_6A (32s, 31s) IMP_SOLVER (18s)
Itself (12s, 12s) MID_CONV_6A (11s, 10s)
Itself (3s, 3s)

The tables above show that with l_strip_ukca=true there is just a general reduction in the time across all the routines called from ATMOS_PHYSICS2.