Profiling offline oxidants at N216

I want to determine how the code's performance can actually drop of with increasing processor elements (PEs), by determining which parts of the code scale up well to more processors and which don't. Do do this I'm profiling three runs (all facep), one at the peak and two either side - one with less PEs and one with more PEs.

  • 32x32 (1,024 PEs), ran at 1.45 model years per day
  • 64x36 (2,304 PEs), ran at 1.93 model years per day (the current best)
  • 64x64 (4,096 PEs), ran at 1.46 model years per day, so very similiar in speed to the first run, but four times the number of PEs

The top profile tree

These profiles show the total time in each routine, so the time in within the routine and its children - unless is specifically says `itself', when it's only the time within the routine. A lot of simplications have been made by removing most of the routines with less than 40s.

32x32

Routines
UM_SHELL (1,596s)
U_MODEL_4A (1,573s)
ATM_STEP_4A* (1,008s) UKCA_ MAIN1 (319s) DUMPCTL (62s) MEANCTL (42s)
ATMOS _PHYS- ICS1 (264s) ATMOS _PHYS- ICS2 (119s) EG_ SL_ HELM- HOLTZ (137s) EG_CORRECT _TRACERS _UKCA (58s) SL_ TRACER1_ 4A (62s) EG_SL_ FULL_WIND (63s) EG_SWAP_ BOUNDS_DP (133s) STASH (91s) UM_ WRITDUMP (62s) ACUMPS (42s)
EG_SL_WIND_U, EG_SL_WIND_V & EG_SL_WIND_W (15 + 14 + 21 = 50s) See profile for SWAP_ BOUNDS _DP below STWORK (91s) GENERAL_ GATHER_FIELD (104s)
Itself (35s) EG_INTERPOLATION _ETA (70s) DEP- ARTURE_ POINT _ETA (53s) STASH_GATHER_ FIELD (103s)
Itself (26s, itself) EG_ CUBIC_ LAG- RANGE (27s, itself) MONO_ ENFORCE (5s, itself) Itself (10s) GATHER_FIELD (104s)
GATHER_FIELD_MPL (104s, itself)

64x36

Routines
UM_SHELL (1,111s)
U_MODEL_4A (1,072s)
ATM_STEP_4A* (569s) UKCA_ MAIN1 (179s) DUMPCTL (115s) MEANCTL (79s)
ATMOS _PHYS- ICS1 (126s) ATMOS _PHYS- ICS2 (63s) EG_ SL_ HELM- HOLTZ (103s) EG_CORRECT _TRACERS _UKCA (24s) SL_ TRACER1_ 4A (39s) EG_SL_ FULL_WIND (47s) EG_SWAP_ BOUNDS_DP (72s) STASH (87s) UM_ WRITDUMP (115s) ACUMPS (79s)
EG_SL_WIND_U, EG_SL_WIND_V & EG_SL_WIND_W (10 + 10 + 18 = 38s) See profile for SWAP_ BOUNDS _DP below STWORK (87s) GENERAL_ GATHER_FIELD (194s)
Itself (13s) EG_INTERPOLATION _ETA (40s) DEP- ARTURE_ POINT _ETA (42s) STASH_GATHER_ FIELD (193s)
Itself (26s, itself) EG_ CUBIC_ LAG- RANGE (13s, itself) MONO_ ENFORCE (3s, itself) Itself (5s) GATHER_FIELD (194s)
GATHER_FIELD_MPL (194s, itself)

64x64

Routines
UM_SHELL (1,482s)
U_MODEL_4A (1,207s)
ATM_STEP_4A* (478s) UKCA_ MAIN1 (131s) DUMPCTL (232s) MEANCTL (156s)
ATMOS _PHYS- ICS1 (82s) ATMOS _PHYS- ICS2 (42s) EG_ SL_ HELM- HOLTZ (103s) EG_CORRECT _TRACERS _UKCA (18s) SL_ TRACER1_ 4A (32s) EG_SL_ FULL_WIND (48s) EG_SWAP_ BOUNDS_DP (67s) STASH (85s) UM_ WRITDUMP (232s) ACUMPS (156s)
EG_SL_WIND_U, EG_SL_WIND_V & EG_SL_WIND_W (9 + 9 + 22 = 40s) See profile for SWAP_ BOUNDS _DP below STWORK (85s) GENERAL_ GATHER_FIELD (388s)
Itself (8s) EG_INTERPOLATION _ETA (28s) DEP- ARTURE_ POINT _ETA (50s) STASH_GATHER_ FIELD (386s)
Itself (15s, itself) EG_ CUBIC_ LAG- RANGE (7s, itself) MONO_ ENFORCE (2s, itself) Itself (3s) GATHER_FIELD (388s)
GATHER_FIELD_MPL (388s, itself)
*should also link to SWAP_BOUNDS_DP, like many other returns.

Profiling for SWAP_* routines

32x32

Routines Total mean time
EG_SWAP_BOUNDS_DP (133s) ATMOS_PHYSICS1, ATMOS_PHYSICS2, G_WAVE_5A, ... 133 + ...
SWAP_BOUNDS & SWAP_BOUNDS_DP (167 + 140 = 307s) SWAP_BOUNDS_MV (31s, itself) 338s
SWAP_BOUNDS_EW_DP (113s) SWAP_BOUNDS_NS_DP (193s, itself) 337s
SWAP_BOUNDS_EW_H1_DP (85s, itself) Itself (29s) 338s

64x36

Routines Total mean time
EG_SWAP_BOUNDS_DP (72s) ATMOS_PHYSICS1, ATMOS_PHYSICS2, G_WAVE_5A, ... 72 + ...
SWAP_BOUNDS & SWAP_BOUNDS_DP (101 + 75 = 176s) SWAP_BOUNDS_MV (18s, itself) 194s
SWAP_BOUNDS_EW_DP (69s) SWAP_BOUNDS_NS_DP (106s, itself) 193s
SWAP_BOUNDS_EW_H1_DP (45s, itself) Itself (23s) 192s

64x64

Routines Total mean time
EG_SWAP_BOUNDS_DP (67s) ATMOS_PHYSICS1, ATMOS_PHYSICS2, G_WAVE_5A, ... 67 + ...
SWAP_BOUNDS & SWAP_BOUNDS_DP (87 + 68 = 155s) SWAP_BOUNDS_MV (12s, itself) 167s
SWAP_BOUNDS_EW_DP (50s) SWAP_BOUNDS_NS_DP (104s, itself) 166s
SWAP_BOUNDS_EW_H1_DP (34s, itself) Itself (16s) 166s

Conclusions

This profiling shows that all the time spent in routines continues to decrease in almost all routines accept the purple routines (a slight exception is EG_SL_FULL_WIND and the routines below). The time in the purple routines seem to roughly scale by the number of PEs. The purple routines begin with two routines

  • MEANCTL: To accumulate partial sums and create time-meaned data. Only runs if lmean set to true.
  • DUMPCTL: Controls the production and naming of output dump files. Also selectively adds dump files to the list of dumps for processing by the external dump server process. Only if ldump is true

As far as I can gather, both lmean and ldump are only true when creating a dump file and this is only the case for these runs at the final timestep. I don't think we can do without these calls, but we could improve speed by running the code for longer before requesting a dump.