I've written two branches to add OpenMP for the solver (this is mostly the call over ASAD_CDRIVE in UKCA_CHEMISTRY_CTL) and to FASTJX, see UM tickets 3185 and 3450. Here, I want to profile how effective these have been.
I've run two jobs with Dr Hook
And both jobs have
All the times given in the tables below are mean total times, where total time in the same spent in a routine + all the routines they call. Where it is just the time in a given routine, I've written `itself'. For a given routine, mean times will inevitably hide asymmetry across PEs which is likely to flatter routines whose times vary significantly across PEs. For example, a routine which is very active in sunlight and very inactive in the dark may take 10s to complete when in sunlight and 2s to complete in the dark. Hence, half its PEs will take 10s and the other half will take 2s, which is an average of 6s. However, a second routine following this with an MPI barrier may be very cheap but the PEs in the dark will have to wait in this second routine so this time will attributed to the second routine when it would probably be fairer to attribute it to the first routine.
Where I've given two times, they are the times for each thread.
GA7.1 + StratTrop without branches
|
GA7.1 + StratTrop with branches
|
Comparing the two runs
Routines | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
UKCA_MAIN* (1,249s) | ||||||||||||||||||||||||
UKCA_AERO_CTL (188s) | UKCA_ ACT- IVATE (73s) | UKCA_CHEMISTRY_CTL (431s) | UKCA_FASTJX (200s) | UKCA_NEW_ EMISS_CTL (86s) | ||||||||||||||||||||
UKCA_AERO_STEP (178s) | UKCA_ ABDUL- RAZZAK _GHAN (71s) | ASAD_CDRIVE (401s) | UKCA_ STRAT _PHOT- OL (21s) | FASTJX_PHOTOJ (199s) | UKCA_ADD_ EMISS (67s) | |||||||||||||||||||
UKCA_COAG- WITHNUCL (37s, 36s) | UKCA_ CONDEN (59s) | UKCA_ VOL- UME_ MODE (23s) | Itself (69s) | ASAD_SPMJPDRIV (368s) | ⇓ | INI- JTAB (20s) | FASTJX_OPMIE (105s) | FLINT (46s, itself) | Itself (31s) | TRSRCE (37s, itself) | TR_ MIX+ (21s) | |||||||||||||
Itself (28s, 28s) | UKCA_ SOLVE- COAG- NUCL _V (8s, 8s, itself) | UKCA_ COND_ COFF _V (46s, itself) | Itself (13s) | Itself (8s) | ASAD_SPIMPMJP (362s) | ⇓ | SET- TAB (20s) | FASTJX_ MIESCT (65s) | Itself (40s) | IMP_ MIX (17s, itself) | ||||||||||||||
SP- LIN- SLV2 (163s, itself) | SP- FUL- JAC (110s, itself) | Itself (36s) | ASAD_ DIFFUN (49s) | Itself (4s) | BLKSLV (65s) | |||||||||||||||||||
ASAD_ PRLS (49s, itself) | Itself (35s) | MAT- INW (15s, itself) | ||||||||||||||||||||||
Routines | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
UKCA_MAIN* (985s) | ||||||||||||||||||||||||
UKCA_AERO_CTL (186s) | UKCA_ ACT- IVATE (74s) | UKCA_CHEMISTRY_CTL (237s) | UKCA_FASTJX (133s) | UKCA_NEW_ EMISS_CTL (83s) | ||||||||||||||||||||
UKCA_AERO_STEP (176s) | UKCA_ ABDUL- RAZZAK _GHAN (72s) | ASAD_CDRIVE (203s, 203s) | UKCA_ STRAT _PHOT- OL (25s, 25s) | FASTJX_PHOTOJ (132s) | UKCA_ADD_ EMISS (66s) | |||||||||||||||||||
UKCA_COAG- WITHNUCL (35s, 37s) | UKCA_ CONDEN (57s) | UKCA_ VOL- UME_ MODE (23s) | Itself (70s) | ASAD_SPMJPDRIV (186s, 186s) | ⇓ | INI- JTAB (13s, 11s) | FASTJX_OPMIE (76s, 76s) | FLINT (28s, 27s, itself) | Itself (19s) | TRSRCE (35s, itself) | TR_ MIX+ (21s) | |||||||||||||
Itself (28s, 29s) | UKCA_ SOLVE- COAG- NUCL _V (8s, 8s, itself) | UKCA_ COND_ COFF _V (46s, itself) | Itself (11s) | Itself (8s) | ASAD_SPIMPMJP (183s, 183s) | ⇓ | SET- TAB (13s, 11s) | FASTJX_ MIESCT (55s, 55s) | Itself (21s, 21s) | IMP_ MIX (17s, itself) | ||||||||||||||
SP- LIN- SLV2 (84s, 84s, itself) | SP- FUL- JAC (55s, 55s, itself) | Itself (17s) | ASAD_ DIFFUN (25s, 25s) | Itself (2s, 2s) | BLKSLV (55s, 55s) | |||||||||||||||||||
ASAD_ PRLS (25s, 25s, itself) | Itself (36s, 36s) | MAT- INW (8s, 8s, itself) | ||||||||||||||||||||||
Comparing the two runs
Some general points
Routines | % of total time in UKCA_MAIN1 | |
---|---|---|
facei on IBM | u-an421 on Cray | |
UKCA_AERO_CTL | 17.5% | 15.1% |
UKCA_ACTIVATE | 3.09% | 5.84% |
UKCA_CHEMISTRY_CTL | 48.5% | 34.5% |
UKCA_FASTJX | 17.7% | 16.0% |
UKCA_EMISSION_CTL/UKCA_NEW_EMISS_CTL | 1.66% | 6.89% |
DO k = 1, n_tracers ! Add emissions over all model layers except at surface DO ilev = 2, model_levels CALL trsrce( & rows, row_length, 0, 0, 0, 0, & theta, q, qcl, qcf , exner_rho_levels, rho_r2, & tracers(:,:,ilev,k), em_field(:,:,ilev,k), ilev, & timestep, 1, 1, 0.0) END DO ...fairly easily, in ukca_add_emiss_mod.F90, and that should save 20s or so.
I've re-run u-an424 with Maff's OpenMP branch, which is already on the trunk, and I've also added OpenMP into ukca_add_emiss_mod.F90 as suggested above (see 3501. Here's the profiling for this run.
Routines | ||
---|---|---|
UM_SHELL (3,676s) | ||
U_MODEL_4A (3,674s) | ||
ATM_STEP_4A (3,544s) | ||
ATMOS_PHYSICS1 (875s) | Everything else (1,769s) | UKCA_MAIN1 (900s) |
Routines | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
UKCA_MAIN* (900s) | ||||||||||||||||||||||||
UKCA_AERO_CTL (126s) | UKCA_ ACT- IVATE (74s) | UKCA_CHEMISTRY_CTL (237s) | UKCA_FASTJX (133s) | UKCA_NEW_ EMISS_CTL (57s) | ||||||||||||||||||||
UKCA_AERO_STEP (120s) | UKCA_ ABDUL- RAZZAK _GHAN (72s) | ASAD_CDRIVE (203s, 203s) | UKCA_ STRAT _PHOT- OL (25s, 25s) | FASTJX_PHOTOJ (132s) | UKCA_ADD_ EMISS (37s) | |||||||||||||||||||
UKCA_COAG- WITHNUCL (39s, 33s) | UKCA_ CONDEN (31s, 26s) | UKCA_ VOL- UME_ MODE (13s, 11s) | Itself (70s) | ASAD_SPMJPDRIV (186s, 186s) | ⇓ | INI- JTAB (13s, 12s) | FASTJX_OPMIE (76s, 77s) | FLINT (28s, 27s, itself) | Itself (19s) | TRSRCE (16s, 16s, itself) | TR_ MIX+ (15s, 14s) | |||||||||||||
Itself (31s, 26s) | UKCA_ SOLVE- COAG- NUCL _V (9s, 7s, itself) | UKCA_ COND_ COFF _V (25s, 21s, itself) | Itself (6s, 5s) | Itself (4s, 4s) | ASAD_SPIMPMJP (183s, 183s) | ⇓ | SET- TAB (13s, 12s) | FASTJX_ MIESCT (55s, 55s) | Itself (21s, 22s) | IMP_ MIX (12s, 11s, itself) | ||||||||||||||
SP- LIN- SLV2 (84s, 84s, itself) | SP- FUL- JAC (55s, 55s, itself) | Itself (17s, 17s) | ASAD_ DIFFUN (25s, 25s) | Itself (2s, 2s) | BLKSLV (55s, 55s) | |||||||||||||||||||
ASAD_ PRLS (25s, 25s, itself) | Itself (36s, 36s) | MAT- INW (8s, 8s, itself) | ||||||||||||||||||||||
Some general points
From the profiling above, it's clear that we've able to add OpenMP into the all the main parts of UKCA_MAIN1 except UKCA_ACTIVATE. In UM #3506 I outline how I tried to do this, where the summary is
In order to get my branches onto the trunk I need to upgrade my branches to UM10.9, which I've done. I'm now profiling two jobs
And both jobs have
GA7.1 + StratTrop without branches
|
GA7.1 + StratTrop with branches
|
This suggests the OpenMP branches have
Routines | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
UKCA_MAIN* (1,212s) | ||||||||||||||||||||||||
UKCA_AERO_CTL (106s) | UKCA_ ACT- IVATE (73s) | UKCA_CHEMISTRY_CTL (432s) | UKCA_FASTJX (201s) | UKCA_EMISS _CTL (82s) | ||||||||||||||||||||
UKCA_AERO_STEP (100s) | UKCA_ ABDUL- RAZZAK _GHAN (71s) | ASAD_CDRIVE (401s) | UKCA_ STRAT _PHOT- OL (21s) | FASTJX_PHOTOJ (201s) | UKCA_ADD_ EMISS (66s) | |||||||||||||||||||
UKCA_COAG- WITHNUCL (35s, 30s) | UKCA_ CONDEN (31s, 26s) | UKCA_ VOL- UME_ MODE (13s, 11s) | Itself (70s) | ASAD_SPMJPDRIV (368s) | ⇓ | INI- JTAB (21s) | FASTJX_OPMIE (106s) | FLINT (46s, itself) | Itself (31s) | TRSRCE (35s, itself) | TR_ MIX+ (21s) | |||||||||||||
Itself (27s, 23s) | UKCA_ SOLVE- COAG- NUCL _V (7s, 6s, itself) | UKCA_ COND_ COFF _V (25s, 21s, itself) | Itself (6s, 5s) | Itself (4s, 4s) | ASAD_SPIMPMJP (362s) | ⇓ | SET- TAB (21s) | FASTJX_ MIESCT (66s) | Itself (40s) | IMP_ MIX (17s, itself) | ||||||||||||||
SP- LIN- SLV2 (161s, itself) | SP- FUL- JAC (110s, itself) | Itself (39s) | ASAD_ DIFFUN (49s) | Itself (4s) | BLKSLV (66s) | |||||||||||||||||||
ASAD_ PRLS (49s, itself) | Itself (36s) | MAT- INW (15s, itself) | ||||||||||||||||||||||
Routines | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
UKCA_MAIN* (925s) | ||||||||||||||||||||||||
UKCA_AERO_CTL (106s) | UKCA_ ACT- IVATE (73s) | UKCA_CHEMISTRY_CTL (238s) | UKCA_FASTJX (133s) | UKCA_ EMISS _CTL (54s) | ||||||||||||||||||||
UKCA_AERO_STEP (100s) | UKCA_ ABDUL- RAZZAK _GHAN (71s) | ASAD_CDRIVE (204s, 204s) | UKCA_ STRAT _PHOT- OL (25s, 25s) | FASTJX_PHOTOJ (133s) | UKCA_ADD_ EMISS (37s) | |||||||||||||||||||
UKCA_COAG- WITHNUCL (35s, 30s) | UKCA_ CONDEN (31s, 26s) | UKCA_ VOL- UME_ MODE (13s, 11s) | Itself (69s) | ASAD_SPMJPDRIV (187s, 187s) | ⇓ | INI- JTAB (12s, 13s) | FASTJX_OPMIE (76s, 76s) | FLINT (28s, 28s, itself) | Itself (19s) | TRSRCE (16s, 16s, itself) | TR_ MIX+ (15s, 14s) | |||||||||||||
Itself (27s, 23s) | UKCA_ SOLVE- COAG- NUCL _V (7s, 6s, itself) | UKCA_ COND_ COFF _V (25s, 21s, itself) | Itself (6s, 5s) | Itself (4s, 4s) | ASAD_SPIMPMJP (184s, 184s) | ⇓ | SET- TAB (12s, 13s) | FASTJX_ MIESCT (55s, 55s) | Itself (21s, 21s) | IMP_ MIX (12s, 11s, itself) | ||||||||||||||
SP- LIN- SLV2 (82s, 82s, itself) | SP- FUL- JAC (55s, 55s, itself) | Itself (20s, 20s) | ASAD_ DIFFUN (25s, 25s) | Itself (2s, 2s) | BLKSLV (55s, 55s) | |||||||||||||||||||
ASAD_ PRLS (25s, 25s, itself) | Itself (36s, 36s) | MAT- INW (8s, 8s, itself) | ||||||||||||||||||||||
This shows that