Profiling Snr UM

All runs are

  • N216
  • 16x16 for Snr; 32x16 for Jnr (N96)
  • 5 days

I'm profiling Snr UM with

perl drHook.pl --dir=/data/cr/ukesm/mstringe/cakdx_cakdy --nRoutines=9999
--endPe=256 [--orderBy=total]

Top level

CLASSIC

Routines
UM_SHELL (1,209s)
U_MODEL_4A (1,204s)
ATM_STEP_4A* (1,142s)
ATMOS _PHYS- ICS1 (315s) EG_ COR- RECT_ TRAC- ERS (61s) ATMOS _PHYS- ICS2 (221s) EG_ SL_ HELM- HOLTZ (183s) TR_ SET_ PHYS _4A* (17s) EG_ SISL_ INIT (34s) SL_ TRAC- ER1_ 4A (46s) EG_ SL_ MOI- STURE (38s) EG_SL_ FULL_WIND (78s) EG_Q_ TO_MIX (14s) ATM_ STEP_ STASH (29s)  ⇓ 
See profile for ATMOS_ PHYSICS1 and EG_CORRECT _TRACERS below See profile for ATMOS_ PHYSICS2 and EG_SL_ HELMHOLTZ below EG_ SISL_ INIT_ UVW (32s) EG_SL_WIND_U, EG_SL_WIND_V & EG_SL_WIND_W (24 + 24 + 26 = 74s) EG_SWAP_ BOUNDS_DP (64s) STASH (34s)
Itself (19s) EG_INTERPOLATION _ETA (103s) DEP- ARTURE_ POINT _ETA (54s) See profile for SWAP_ BOUNDS _DP below STWORK (34s)
EG_ CUBIC_ LAG- RANGE (47s, itself) EG_VERT_ WEIGHTS_ ETA (13s, itself) MONO_ ENFORCE (7s, itself) Itself (20s)         SPA- TIAL (11s) PP_ HEAD (10s) EXP- PXI (6s, itself)
*should also link to SWAP_BOUNDS_DP, like many other returns.

Snr at 3 August 2015

PEs for Jnr were 32x16, so I wouldn't expect it to hold up Snr significantly (dynamical core fields are only passed Snr -> Jnr on the hour in this run). So far the call the UKCA_MAIN1 and EG_CORRECT_TRACERS_UKCA have been stripped out. Bold numbers show where times are much bigger and underlined number were times are much smaller.

Routines
UM_SHELL (2,107s)
U_MODEL_4A (2,103s)
ATM_STEP_4A* (1,802s) OASIS3 _GET _SNR (127s)
ATMOS _PHYS- ICS1 (505s) EG_ COR- RECT_ TRAC- ERS (33s) ATMOS _PHYS- ICS2 (392s) EG_ SL_ HELM- HOLTZ (208s) TR_ SET_ PHYS _4A* (85s) EG_ SISL_ INIT (37s) SL_ TRAC- ER1_ 4A (234s) EG_ SL_ MOI- STURE (37s) EG_SL_ FULL_WIND (81s) EG_Q_ TO_MIX (46s) ATM_ STEP_ STASH (23s)  ⇓ 
See profile for ATMOS_ PHYSICS1 and EG_CORRECT _TRACERS below See profile for ATMOS_ PHYSICS2 and EG_SL_ HELMHOLTZ below EG_ SISL_ INIT_ UVW (35s) EG_SL_WIND_U, EG_SL_WIND_V & EG_SL_WIND_W (23 + 23 + 26 = 72s) EG_SWAP_ BOUNDS_DP (118s) STASH (26s)
Itself (39s) EG_INTERPOLATION _ETA (224s) DEP- ARTURE_ POINT _ETA (58s) See profile for SWAP_ BOUNDS _DP below STWORK (25s)
EG_ CUBIC_ LAG- RANGE (126s, itself) EG_VERT_ WEIGHTS_ ETA (13s, itself) MONO_ ENFORCE (24s, itself) Itself (20s)         SPA- TIAL (9s) PP_ HEAD (6s) EXP- PXI (4s, itself)
*should also link to SWAP_BOUNDS_DP, like many other returns.

Summary

  • OASIS3_GET_SNR, 127s
  • ATMOS_PHYSICS1, 190s extra. UKCA_RADAER_BAND_AVERAGE is 75s and UKCA_RADAER_COMPUTE_AOD is 6s, so 80s extra we can't avoid. Still 110s extra.
  • SL_TRACER1_4A is 188s extra.
  • ATMOS_PHYSICS2 is 172s extra.
  • TR_SET_PHYS_4A is 68s extra

This is 745s, and there 899s extra (at U_MODEL_4A) so still missing 154s extra somewhere.

Snr at 3 September 2015

In addition above I've removed for Snr where TRACER_UKCA/TRACERS_UKCA is added to SUPER_ARRAY, SUPER_TRACER_PHYS1 and SUPER_TRACER_PHYS2.

Routines
UM_SHELL (1,794s)
U_MODEL_4A (1,791s)
ATM_STEP_4A* (1,478s) OASIS3 _GET _SNR (134s)
ATMOS _PHYS- ICS1 (507s) EG_ COR- RECT_ TRAC- ERS (33s) ATMOS _PHYS- ICS2 (390s) EG_ SL_ HELM- HOLTZ (205s) TR_ SET_ PHYS _4A* (9s) EG_ SISL_ INIT (37s) SL_ TRAC- ER1_ 4A (26s) EG_ SL_ MOI- STURE (36s) EG_SL_ FULL_WIND (72s) EG_Q_ TO_MIX (46s) ATM_ STEP_ STASH (23s)  ⇓ 
See profile for ATMOS_ PHYSICS1 and EG_CORRECT _TRACERS below See profile for ATMOS_ PHYSICS2 and EG_SL_ HELMHOLTZ below EG_ SISL_ INIT_ UVW (35s) EG_SL_WIND_U, EG_SL_WIND_V & EG_SL_WIND_W (22 + 22 + 24 = 68s) EG_SWAP_ BOUNDS_DP (115s) STASH (25s)
Itself (19s) EG_INTERPOLATION _ETA (86s) DEP- ARTURE_ POINT _ETA (48s) See profile for SWAP_ BOUNDS _DP below STWORK (25s)
EG_ CUBIC_ LAG- RANGE (37s, itself) EG_VERT_ WEIGHTS_ ETA (11s, itself) MONO_ ENFORCE (5s, itself) Itself (20s)         SPA- TIAL (9s) PP_ HEAD (7s) EXP- PXI (4s, itself)
*should also link to SWAP_BOUNDS_DP, like many other returns.

Summary

  • Time in SL_TRACER1_4A and TR_SET_PHYS_4A has been greatly reduced, and is now smaller than for CLASSIC - should I worry about this?
  • I now think the extra time in ATMOS_PHYSICS1 is because of UKCA_RADAER_BAND_AVERAGE, as it's very imbalanced.
  • I'm still to remove the extra time spent in ATMOS_PHYSICS2, and I should look at EG_Q_TO_MIX.

Snr at 7 September 2015

In addition above I've removed for Snr where UKCA_TRACERS is added to TOT_TRACER.

Routines
UM_SHELL (1,587s)
U_MODEL_4A (1,576s)
ATM_STEP_4A* (1,256s) DUMPCTL (68s) MEANCTL (37s) OASIS3 _GET _SNR (136s)
ATMOS _PHYS- ICS1 (507s) EG_ COR- RECT_ TRAC- ERS (33s) ATMOS _PHYS- ICS2 (193s) EG_ SL_ HELM- HOLTZ (186s) TR_ SET_ PHYS _4A* (9s) EG_ SISL_ INIT (34s) SL_ TRAC- ER1_ 4A (26s) EG_ SL_ MOI- STURE (36s) EG_SL_ FULL_WIND (72s) EG_Q_ TO_MIX (44s)  ⇓  STASH (25s) UM_ WRITDUMP (62s) ACUMPS (42s)
See profile for ATMOS_ PHYSICS1 and EG_CORRECT _TRACERS below See profile for ATMOS_ PHYSICS2 and EG_SL_ HELMHOLTZ below EG_ SISL_ INIT_ UVW (33s) EG_SL_WIND_U, EG_SL_WIND_V & EG_SL_WIND_W (22 + 22 + 24 = 68s) EG_SWAP_ BOUNDS_DP (112s) STWORK (25s) GENERAL_ GATHER_FIELD (105s)
Itself (19s) EG_INTERPOLATION _ETA (86s) DEP- ARTURE_ POINT _ETA (48s) See profile for SWAP_ BOUNDS _DP below STASH_GATHER_ FIELD (104s)
EG_ CUBIC_ LAG- RANGE (37s, itself) EG_VERT_ WEIGHTS_ ETA (11s, itself) MONO_ ENFORCE (5s, itself) Itself (20s) GATHER_FIELD (104s)
GATHER_FIELD_MPL (104s, itself)
*should also link to SWAP_BOUNDS_DP, like many other returns.

Summary

  • This ATM_STEP_4A is now only 1,256s compared to the 1,142s for CLASSIC, so it's only 114s more. RADAER is a lot more than that. However, U_MODEL_4A is 1,576s compared to 1,204s for CLASSIC, so 372s more. 136s comes from OASIS_GET_SNR, but this still leaves around 236s extra unaccounted for
    • For CLASSIC the times in DUMPCTL and MEANCTL are 21s and 8s, so a total of 29s - whereas here we have 68 and 37, so a total of 105s. 76s more. We can probably loose all this time by removing the stash requests for section 34 and 38 fields from Snr UM - particularly those with usage UPMEAN.
      • By removing the UPMEAN stash requests for section 34 and 38 fields, the time in DUMPCTL and MEANCTL is reduced to 36s and 9s respectively, a total of 45s - 16s more than CLASSIC.
    • INITIAL_4A is 66s here and 27s for CLASSIC, so 39s more
      • Most of this is the 46s in INITDUMP -> UM_READDUMP -> UM_READ_MULTI (16s for CLASSIC). This then calls SCATTER_FIELD_MPL and GENERAL_SCATTER_FIELD, which are called by other routines.
      • We want the space intiliasing, but we don't need the UKCA fields scattering - because we'll be overwriting them later. However, it looks like these routines just scatter the D1 array and pretty much blind as to if they're scattering UKCA fields or not.
  • Not really shown here, is that time in CLASSIC for SWAP_BOUNDS is 149s, whereas here it's 224s.

Snr at 8 September 2015

Routines
UM_SHELL (1,532s)
U_MODEL_4A (1,525s)
ATM_STEP_4A* (1,260s) DUMPCTL (36s) MEANCTL (9s) OASIS3 _GET _SNR (141s)
ATMOS _PHYS- ICS1 (510s) EG_ COR- RECT_ TRAC- ERS (33s) ATMOS _PHYS- ICS2 (193s) EG_ SL_ HELM- HOLTZ (187s) TR_ SET_ PHYS _4A* (9s) EG_ SISL_ INIT (35s) SL_ TRAC- ER1_ 4A (26s) EG_ SL_ MOI- STURE (36s) EG_SL_ FULL_WIND (72s) EG_Q_ TO_MIX (42s)  ⇓  STASH (25s) UM_ WRITDUMP (36s) ACUMPS (9s)
See profile for ATMOS_ PHYSICS1 and EG_CORRECT _TRACERS below See profile for ATMOS_ PHYSICS2 and EG_SL_ HELMHOLTZ below EG_ SISL_ INIT_ UVW (33s) EG_SL_WIND_U, EG_SL_WIND_V & EG_SL_WIND_W (22 + 22 + 24 = 68s) EG_SWAP_ BOUNDS_DP (114s) STWORK (24s) GENERAL_ GATHER_FIELD (45s)
Itself (19s) EG_INTERPOLATION _ETA (86s) DEP- ARTURE_ POINT _ETA (48s) See profile for SWAP_ BOUNDS _DP below STASH_GATHER_ FIELD (44s)
EG_ CUBIC_ LAG- RANGE (37s, itself) EG_VERT_ WEIGHTS_ ETA (11s, itself) MONO_ ENFORCE (5s, itself) Itself (20s) GATHER_FIELD (45s)
GATHER_FIELD_MPL (45s, itself)
*should also link to SWAP_BOUNDS_DP, like many other returns.

Removing the section 34 and 38 fields with usage UPMEAN and produced a run with 1,532s in UM_SHELL, so 27% slower than CLASSIC run.

ATMOS_PHYSICS1 & EG_CORRECT_TRACERS

CLASSIC

ATMOS_PHYSICS1 (315s) EG_CORRECT_TRACERS (61s)
RAD_CTL (86s) MICROPHYS_CTL (62s) NI_GWD_CTL (78s)  ⇓   ⇓  EG_MASS_ CONSERVATION (48s)
LW_RAD (61s) SW_RAD (20s) LS_PPN (56s) G_ WAVE _5A (63s) GW_ USSP (14s) GLOBAL _2D_ SUMS (21s, itself) Itself (36s)
RADIANCE_CALC (79s) LS_PPNC (52s) SWAP_ BOUNDS (see table below)
SOLVE_BAND_K_EQV (60s) LSP_ICE (40s) Itself (13s)
MCICA_ SAMPLE (47s) SCALE_ ABSORB (8s) LSP_ INIT (9s) LSP_ FALL (8s)
MONOCHR- OMATIC_ RADIANCE (40s) Itself (7s) Itself (5s) Itself (7s)
MONOCHR- OMATIC_ RADIANCE _TSEQ (35s)
MCICA_ COLUMN (35s)
TWO_COEFF (24s)
TRANS_ SOURCE_ COEFF (14s) Itself (4s)
Itself (9s)
  

Snr UM

ATMOS_PHYSICS1 (505s) EG_CORRECT_TRACERS (34s)
RAD_CTL (161s) MICROPHYS_CTL (59s) NI_GWD _CTL (201s)  ⇓   ⇓  EG_MASS_ CONSERVATION (31s)
LW_RAD (116s) SW_RAD (38s) LS_PPN (54s) G_ WAVE _5A (171s) GW_ USSP (30s) GLOBAL _2D_ SUMS (28s, itself) Itself (23s)
RADIANCE_CALC (151s) LS_PPNC (51s) SWAP_ BOUNDS (see table below)
UKCA_ RADAER _BAND_ AVERAGE (76s, itself) SOLVE_BAND_K_EQV (59s) UKCA_ RADAER_ COMPUTE _AOD (6s, itself) LSP_ICE (39s) Itself (13s)
MCICA_ SAMPLE (46s) SCALE_ ABSORB (8s) LSP_ INIT (9s) LSP_ FALL (8s)
MONOCHR- OMATIC_ RADIANCE (40s) Itself (6s) Itself (5s) Itself (7s)
MONOCHR- OMATIC_ RADIANCE _TSEQ (35s)
MCICA_ COLUMN (34s)
TWO_COEFF (24s)
TRANS_ SOURCE_ COEFF (14s) Itself (4s)
Itself (9s)
  

If the extra time in NI_GWD_CTL is just a consequence of a barrier and massive imbalance in UKCA_RADAER_BAND_AVERAGE, where the times vary from 17s to 133s (and imbalance throughout day is likely to be much higher).

ATMOS_PHYSICS2

CLASSIC

ATMOS_PHYSICS2 (221s)
NI_CONV_CTL (95s) NI_IMP_CTL (49s) SWAP_BOUNDS, SWAP_BOUNDS_2D_MV & SWAP_BOUNDS_MV (see table below)
GLUE_CONV_5A (70s) IMP_SOLVER (24s)
Itself (30s) MID_CONV_5A (9s)
Itself (8s)

Snr UM

ATMOS_PHYSICS2 (392s)
NI_CONV_CTL (212s) NI_IMP_CTL (73s) SWAP_BOUNDS, SWAP_BOUNDS_2D_MV & SWAP_BOUNDS_MV (see table below)
GLUE_CONV_5A (166s) IMP_SOLVER (40s)
Itself (104s) MID_CONV_5A (38s)
Itself (18s)

All the parts of ATMOS_PHYSICS2 look bigger for my Snr UM.

Plume scavenging

Colin mentioned that this needed removing, although the total time in UKCA_SCAVENGING_MOD.UKCA_PLUME_SCAV is less than 1s.

Extra parts of code for full chemistry compared to CLASSIC

Extra partDo I want to keep it Flags controlling it
RADAER codeYes l_ukca_radaer
Extra storage spaceMost, if not all I think fields need to be present in start dump and pass the test is TSTMSK, which uses many of logicals in this table
UKCA_MAIN1 and code belowNo l_ukca
EG_CORRECT_TRACERS_UKCANo l_tracer, l_conserve_ukca_with_tr = .false.
Section 1.10 in NI_CONV_CTLNo l_biomass, l_dust, l_ocff, l_soot, l_sulp_nh3, l_sulp_so2, l_use_cariolle, tr_ukca, tr_vars
Number of sections in SL_TRACER_4ANo
UKCA_PLUME_SCAVNo, but it's small l_tracer .AND. l_ukca .AND. l_ukca_plume_scav .AND. npnts > 0

For CLASSIC job, gadga, most of the flags in &Amp;RUN_Aerosol are TRUE, including L_BIOMASS, L_OCFF, L_SOOT, L_SULPC_NH3 and L_SULPC_SO2.

Overall summary

Estimate of extra components of full chemistry compared to CLASSIC (1,209s in UM_SHELL).

Components % extra
What can be removed from Snr
Chemistry scheme/UKCA_MAIN1 (+4,878s**)+400%
Advection of UKCA fields/EG_CORRECT_TRACERS_UKCA (+238s*) +31%
SL transport of UKCA fields/SL_TRACER1_4A & TR_SET_PHYS_4A (+256s) +21%
Convective transport of UKCA fields/NI_CONV_CTL (+171s) +14%
Meaning of UKCA fields for diagnostics (+60s) +5%
What is difficult to remove from Snr (done through D1 and blind to section)
Scatter and gather of UKCA fields (+45s) +4%
What can't be removed from Snr
RADAER/UKCA_RADAER_BAND_AVERAGE (+190s) +16%
Receive/wait for coupling fields/OASIS3_GET_SNR (+140s) +12%
* Estimate: assume from profiling for full chemsitry that EG_CORRECT_TRACERS_UKCA is 1.58 times bigger than SL_TRACER1_4A, or 238s.
** From p95 in UKCA III that full chemistry is 3.08 times slower than my top Snr, so a full chemistry run is expected to take (2,107+238)*3.08=7,223s. Take off (2,107+238) to give time in UKCA_MAIN1 of 4,878s.

My final run is overall 26% slower than CLASSIC (but results not checked as cumf is giving a memory fault).

Changing PEs for Jnr

Jnr PEs Total time for Snr
UM_SHELL OASIS3_GET_SNR
48x28=1,344 1,506 107
32x16=512 1,532 141
16x24=384 1,740 334
16x16=256 2,413 1,016
8x24=192 3,075 1,686