Profiling offline oxidants at N216
I want to determine how the code's performance can actually drop
of with increasing processor elements (PEs), by determining which
parts of the code scale up well to more
processors and which don't. Do do this I'm profiling three runs
(all facep), one at the peak and two either side - one with less
PEs and one with more PEs.
- 32x32 (1,024 PEs), ran at 1.45 model years per day
- 64x36 (2,304 PEs), ran at 1.93 model years per day (the current
best)
- 64x64 (4,096 PEs), ran at 1.46 model years per day, so very
similiar in speed to the first run, but four times the number of PEs
The top profile tree
These profiles show the total time in each routine, so the time
in within the routine and its children - unless is specifically says
`itself', when it's only the time within the routine. A lot
of simplications have been made by removing most of the routines with
less than 40s.
32x32
Routines |
UM_SHELL (1,596s) |
U_MODEL_4A (1,573s) |
ATM_STEP_4A* (1,008s) |
UKCA_ MAIN1 (319s) |
DUMPCTL (62s) |
MEANCTL (42s) |
ATMOS _PHYS- ICS1 (264s) |
ATMOS _PHYS- ICS2 (119s) |
EG_ SL_ HELM- HOLTZ (137s) |
EG_CORRECT _TRACERS _UKCA (58s)
|
SL_ TRACER1_ 4A (62s) |
EG_SL_ FULL_WIND (63s) |
EG_SWAP_ BOUNDS_DP (133s) |
STASH (91s) |
UM_ WRITDUMP (62s) |
ACUMPS (42s) |
EG_SL_WIND_U, EG_SL_WIND_V & EG_SL_WIND_W (15 + 14 + 21 = 50s)
|
See profile for SWAP_ BOUNDS _DP below |
STWORK (91s) |
GENERAL_ GATHER_FIELD (104s) |
Itself (35s) |
EG_INTERPOLATION _ETA (70s) |
DEP- ARTURE_ POINT _ETA (53s) |
STASH_GATHER_ FIELD (103s) |
Itself (26s, itself) |
EG_ CUBIC_ LAG- RANGE (27s, itself) |
MONO_ ENFORCE (5s, itself) |
Itself (10s) |
GATHER_FIELD (104s) |
GATHER_FIELD_MPL (104s, itself) |
64x36
Routines |
UM_SHELL (1,111s) |
U_MODEL_4A (1,072s) |
ATM_STEP_4A* (569s) |
UKCA_ MAIN1 (179s) |
DUMPCTL (115s) |
MEANCTL (79s) |
ATMOS _PHYS- ICS1 (126s) |
ATMOS _PHYS- ICS2 (63s) |
EG_ SL_ HELM- HOLTZ (103s) |
EG_CORRECT _TRACERS _UKCA (24s)
|
SL_ TRACER1_ 4A (39s) |
EG_SL_ FULL_WIND (47s) |
EG_SWAP_ BOUNDS_DP (72s) |
STASH (87s) |
UM_ WRITDUMP (115s) |
ACUMPS (79s) |
EG_SL_WIND_U, EG_SL_WIND_V & EG_SL_WIND_W (10 + 10 + 18 = 38s)
|
See profile for SWAP_ BOUNDS _DP below |
STWORK (87s) |
GENERAL_ GATHER_FIELD (194s) |
Itself (13s) |
EG_INTERPOLATION _ETA (40s) |
DEP- ARTURE_ POINT _ETA (42s) |
STASH_GATHER_ FIELD (193s) |
Itself (26s, itself) |
EG_ CUBIC_ LAG- RANGE (13s, itself) |
MONO_ ENFORCE (3s, itself) |
Itself (5s) |
GATHER_FIELD (194s) |
GATHER_FIELD_MPL (194s, itself) |
64x64
Routines |
UM_SHELL (1,482s) |
U_MODEL_4A (1,207s) |
ATM_STEP_4A* (478s) |
UKCA_ MAIN1 (131s) |
DUMPCTL (232s) |
MEANCTL (156s) |
ATMOS _PHYS- ICS1 (82s) |
ATMOS _PHYS- ICS2 (42s) |
EG_ SL_ HELM- HOLTZ (103s) |
EG_CORRECT _TRACERS _UKCA (18s)
|
SL_ TRACER1_ 4A (32s) |
EG_SL_ FULL_WIND (48s) |
EG_SWAP_ BOUNDS_DP (67s) |
STASH (85s) |
UM_ WRITDUMP (232s) |
ACUMPS (156s) |
EG_SL_WIND_U, EG_SL_WIND_V & EG_SL_WIND_W (9 + 9 + 22 = 40s)
|
See profile for SWAP_ BOUNDS _DP below |
STWORK (85s) |
GENERAL_ GATHER_FIELD (388s) |
Itself (8s) |
EG_INTERPOLATION _ETA (28s) |
DEP- ARTURE_ POINT _ETA (50s) |
STASH_GATHER_ FIELD (386s) |
Itself (15s, itself) |
EG_ CUBIC_ LAG- RANGE (7s, itself) |
MONO_ ENFORCE (2s, itself) |
Itself (3s) |
GATHER_FIELD (388s) |
GATHER_FIELD_MPL (388s, itself) |
*should also link to SWAP_BOUNDS_DP, like many other
returns.
Profiling for SWAP_* routines
32x32
Routines |
Total mean time |
EG_SWAP_BOUNDS_DP (133s) |
ATMOS_PHYSICS1, ATMOS_PHYSICS2, G_WAVE_5A, ... |
133 + ... |
SWAP_BOUNDS & SWAP_BOUNDS_DP (167 + 140 = 307s)
|
SWAP_BOUNDS_MV (31s, itself) |
338s |
SWAP_BOUNDS_EW_DP (113s) |
SWAP_BOUNDS_NS_DP (193s, itself) |
337s |
SWAP_BOUNDS_EW_H1_DP (85s, itself) |
Itself (29s) |
338s |
64x36
Routines |
Total mean time |
EG_SWAP_BOUNDS_DP (72s) |
ATMOS_PHYSICS1, ATMOS_PHYSICS2, G_WAVE_5A, ... |
72 + ... |
SWAP_BOUNDS & SWAP_BOUNDS_DP (101 + 75 = 176s)
|
SWAP_BOUNDS_MV (18s, itself) |
194s |
SWAP_BOUNDS_EW_DP (69s) |
SWAP_BOUNDS_NS_DP (106s, itself) |
193s |
SWAP_BOUNDS_EW_H1_DP (45s, itself) |
Itself (23s) |
192s |
64x64
Routines |
Total mean time |
EG_SWAP_BOUNDS_DP (67s) |
ATMOS_PHYSICS1, ATMOS_PHYSICS2, G_WAVE_5A, ... |
67 + ... |
SWAP_BOUNDS & SWAP_BOUNDS_DP (87 + 68 = 155s)
|
SWAP_BOUNDS_MV (12s, itself) |
167s |
SWAP_BOUNDS_EW_DP (50s) |
SWAP_BOUNDS_NS_DP (104s, itself) |
166s |
SWAP_BOUNDS_EW_H1_DP (34s, itself) |
Itself (16s) |
166s |
Conclusions
This profiling shows that all the time spent in routines
continues to decrease in almost all routines accept the purple
routines (a slight exception is EG_SL_FULL_WIND and the routines
below). The time in the purple routines seem to roughly scale by
the number of PEs. The purple routines begin with two routines
- MEANCTL: To accumulate partial sums and create time-meaned data.
Only runs if lmean set to true.
- DUMPCTL: Controls the production and naming of output dump files.
Also selectively adds dump files to the list of dumps
for processing by the external dump server process.
Only if ldump is true
As far as I can gather, both lmean and ldump are only true
when creating a dump file and this is only the case for these
runs at the final timestep. I don't think we can do without these
calls, but we could improve speed by running the code for longer
before requesting a dump.