Adding OpenMP to the full chemistry

Introduction

I've written two branches to add OpenMP for the solver (this is mostly the call over ASAD_CDRIVE in UKCA_CHEMISTRY_CTL) and to FASTJX, see UM tickets 3185 and 3450. Here, I want to profile how effective these have been.

I've run two jobs with Dr Hook

u-an421: a copy of Mohit's GA7.1 + StratTrop UM10.7, u-ak990
u-an424: same as above, but with my two OpenMP branches in

And both jobs have

(ATM_PROCX, ATM_PROCY)=(18,24)
OMPTHR_ATM=2, i.e. two threads

All the times given in the tables below are mean total times, where total time in the same spent in a routine + all the routines they call. Where it is just the time in a given routine, I've written `itself'. For a given routine, mean times will inevitably hide asymmetry across PEs which is likely to flatter routines whose times vary significantly across PEs. For example, a routine which is very active in sunlight and very inactive in the dark may take 10s to complete when in sunlight and 2s to complete in the dark. Hence, half its PEs will take 10s and the other half will take 2s, which is an average of 6s. However, a second routine following this with an MPI barrier may be very cheap but the PEs in the dark will have to wait in this second routine so this time will attributed to the second routine when it would probably be fairer to attribute it to the first routine.

Where I've given two times, they are the times for each thread.

Top level

GA7.1 + StratTrop without branches

Routines
UM_SHELL (4,182s)
U_MODEL_4A (4,180s)
ATM_STEP_4A (4,051s)
ATMOS_PHYSICS1 (872s)	Everything else (1,930s)	UKCA_MAIN1 (1,249s)

GA7.1 + StratTrop with branches

Routines
UM_SHELL (3,732s)
U_MODEL_4A (3,730s)
ATM_STEP_4A (3,606s)
ATMOS_PHYSICS1 (875s)	Everything else (1,746s)	UKCA_MAIN1 (985s)

Comparing the two runs

The run with the branches is 11% faster (450s quicker), although total times usually have a few % noise.
The saving in mean total time in UKCA_MAIN1 is only 264s (6% of total run time). However, the run times per PE in UKCA_MAIN1 will vary a lot, and it's likely that the asymmetry in time is absorbed by something in `Everything else', which has an MPI barrier. Running on two threads should halve the asymmetry.

UKCA_MAIN1

GA7.1 + StratTrop without branches

Routines
UKCA_MAIN* (1,249s)
UKCA_AERO_CTL (188s)					UKCA_ ACT- IVATE (73s)	UKCA_CHEMISTRY_CTL (431s)							UKCA_FASTJX (200s)					UKCA_NEW_ EMISS_CTL (86s)
UKCA_AERO_STEP (178s)					UKCA_ ABDUL- RAZZAK _GHAN (71s)	ASAD_CDRIVE (401s)						UKCA_ STRAT _PHOT- OL (21s)	FASTJX_PHOTOJ (199s)					UKCA_ADD_ EMISS (67s)
UKCA_COAG- WITHNUCL (37s, 36s)		UKCA_ CONDEN (59s)		UKCA_ VOL- UME_ MODE (23s)	Itself (69s)	ASAD_SPMJPDRIV (368s)					⇓	INI- JTAB (20s)	FASTJX_OPMIE (105s)			FLINT (46s, itself)	Itself (31s)	TRSRCE (37s, itself)	TR_ MIX⁺ (21s)
Itself (28s, 28s)	UKCA_ SOLVE- COAG- NUCL _V (8s, 8s, itself)	UKCA_ COND_ COFF _V (46s, itself)	Itself (13s)	Itself (8s)		ASAD_SPIMPMJP (362s)				⇓	⇓	SET- TAB (20s)	FASTJX_ MIESCT (65s)		Itself (40s)				IMP_ MIX (17s, itself)
						SP- LIN- SLV2 (163s, itself)	SP- FUL- JAC (110s, itself)	Itself (36s)	ASAD_ DIFFUN (49s)			Itself (4s)	BLKSLV (65s)
									ASAD_ PRLS (49s, itself)				Itself (35s)	MAT- INW (15s, itself)
													Itself (35s)	MAT- INW (15s, itself)

*UCKA_MAIN also calls STASH, and probably quite large ~ 200s
⁺also called from other routines, especially boundary layer routines

GA7.1 + StratTrop without branches

Routines
UKCA_MAIN* (985s)
UKCA_AERO_CTL (186s)					UKCA_ ACT- IVATE (74s)	UKCA_CHEMISTRY_CTL (237s)							UKCA_FASTJX (133s)					UKCA_NEW_ EMISS_CTL (83s)
UKCA_AERO_STEP (176s)					UKCA_ ABDUL- RAZZAK _GHAN (72s)	ASAD_CDRIVE (203s, 203s)						UKCA_ STRAT _PHOT- OL (25s, 25s)	FASTJX_PHOTOJ (132s)					UKCA_ADD_ EMISS (66s)
UKCA_COAG- WITHNUCL (35s, 37s)		UKCA_ CONDEN (57s)		UKCA_ VOL- UME_ MODE (23s)	Itself (70s)	ASAD_SPMJPDRIV (186s, 186s)					⇓	INI- JTAB (13s, 11s)	FASTJX_OPMIE (76s, 76s)			FLINT (28s, 27s, itself)	Itself (19s)	TRSRCE (35s, itself)	TR_ MIX⁺ (21s)
Itself (28s, 29s)	UKCA_ SOLVE- COAG- NUCL _V (8s, 8s, itself)	UKCA_ COND_ COFF _V (46s, itself)	Itself (11s)	Itself (8s)		ASAD_SPIMPMJP (183s, 183s)				⇓	⇓	SET- TAB (13s, 11s)	FASTJX_ MIESCT (55s, 55s)		Itself (21s, 21s)				IMP_ MIX (17s, itself)
						SP- LIN- SLV2 (84s, 84s, itself)	SP- FUL- JAC (55s, 55s, itself)	Itself (17s)	ASAD_ DIFFUN (25s, 25s)			Itself (2s, 2s)	BLKSLV (55s, 55s)
									ASAD_ PRLS (25s, 25s, itself)				Itself (36s, 36s)	MAT- INW (8s, 8s, itself)
													Itself (36s, 36s)	MAT- INW (8s, 8s, itself)

*UCKA_MAIN also calls STASH, and probably quite large ~ 200s
⁺also called from other routines, especially boundary layer routines

Comparing the two runs

Adding the solver branch has droppped the total time in ASAD_CDRIVE from 401s to 203s, so that's been fairly successful at halving the time. And the total time in UKCA_CHEMISTRY_CTL is reduced by about 45%.
The FASTJX branch is less successful as the time in FASTJX_OPMIE is only reduced from 105s to 76s. The total time in UKCA_FASTJX is reduced by about 33%.

Some general points

None of these jobs have Matt Glover's branch to add OpenMP across the whole of the aerosol chemistry, including UKCA_ACTIVATE.
The profiling above suggest that solver branch has halved the time in INIJTAB. This is misleading because INIJTAB is only called by one thread, but I've allowed this to be either thread 0 or thread 1 - depending on which gets there first. This is why the time in this routine is shared between thread 0 and thread 1. It is likely that the slower thread will arrive here shortly after the faster thread and be held by the CRITICAL OpenMP argument in subroutine UKCA_STRAT_PHOTOL until the faster thread is finished. Hence, it's unlikely we'll get thread concurrency here.

It's interesting to compare the % breakdown of total time in UKCA_MAIN1 for an old UM run on the IBM and GA7.1 + StratTrop without my branches, u-an421, which is shown in table below.

Routines	% of total time in UKCA_MAIN1
Routines	facei on IBM	u-an421 on Cray
UKCA_AERO_CTL	17.5%	15.1%
UKCA_ACTIVATE	3.09%	5.84%
UKCA_CHEMISTRY_CTL	48.5%	34.5%
UKCA_FASTJX	17.7%	16.0%
UKCA_EMISSION_CTL/UKCA_NEW_EMISS_CTL	1.66%	6.89%

This is far from a clean comparison, so we should be careful to read too much into this, but it does look like the proportion of time in UKCA_CHEMISTRY_CTL has dropped significantly while it's increased significantly in the EMISSIONS routine (UKCA_EMISSION_CTL/UKCA_NEW_EMISS_CTL) since moving to the Cray. The last comparison does seem consistent with a more general perception that I/O is more expensive on the Cray than on the IBM. It's also surprising that the proportion of time in UKCA_AERO_CTL isn't significantly less for the Cray run - as this has OpenMP, which wasn't in the IBM run.

An initial look suggests we could probably put OpenMP around

DO k = 1, n_tracers
  !   Add emissions over all model layers except at surface
  DO ilev = 2, model_levels
      CALL trsrce(                                                   &
        rows, row_length, 0, 0, 0, 0,                                &
        theta, q, qcl, qcf , exner_rho_levels, rho_r2,               &
        tracers(:,:,ilev,k), em_field(:,:,ilev,k), ilev,             &
        timestep, 1, 1, 0.0)
  END DO
...

fairly easily, in ukca_add_emiss_mod.F90, and that should save 20s or so.

Adding in further OpenMP branches

I've re-run u-an424 with Maff's OpenMP branch, which is already on the trunk, and I've also added OpenMP into ukca_add_emiss_mod.F90 as suggested above (see 3501. Here's the profiling for this run.

Top level

Routines
UM_SHELL (3,676s)
U_MODEL_4A (3,674s)
ATM_STEP_4A (3,544s)
ATMOS_PHYSICS1 (875s)	Everything else (1,769s)	UKCA_MAIN1 (900s)

UKCA_MAIN1

Routines
UKCA_MAIN* (900s)
UKCA_AERO_CTL (126s)					UKCA_ ACT- IVATE (74s)	UKCA_CHEMISTRY_CTL (237s)							UKCA_FASTJX (133s)					UKCA_NEW_ EMISS_CTL (57s)
UKCA_AERO_STEP (120s)					UKCA_ ABDUL- RAZZAK _GHAN (72s)	ASAD_CDRIVE (203s, 203s)						UKCA_ STRAT _PHOT- OL (25s, 25s)	FASTJX_PHOTOJ (132s)					UKCA_ADD_ EMISS (37s)
UKCA_COAG- WITHNUCL (39s, 33s)		UKCA_ CONDEN (31s, 26s)		UKCA_ VOL- UME_ MODE (13s, 11s)	Itself (70s)	ASAD_SPMJPDRIV (186s, 186s)					⇓	INI- JTAB (13s, 12s)	FASTJX_OPMIE (76s, 77s)			FLINT (28s, 27s, itself)	Itself (19s)	TRSRCE (16s, 16s, itself)	TR_ MIX⁺ (15s, 14s)
Itself (31s, 26s)	UKCA_ SOLVE- COAG- NUCL _V (9s, 7s, itself)	UKCA_ COND_ COFF _V (25s, 21s, itself)	Itself (6s, 5s)	Itself (4s, 4s)		ASAD_SPIMPMJP (183s, 183s)				⇓	⇓	SET- TAB (13s, 12s)	FASTJX_ MIESCT (55s, 55s)		Itself (21s, 22s)				IMP_ MIX (12s, 11s, itself)
						SP- LIN- SLV2 (84s, 84s, itself)	SP- FUL- JAC (55s, 55s, itself)	Itself (17s, 17s)	ASAD_ DIFFUN (25s, 25s)			Itself (2s, 2s)	BLKSLV (55s, 55s)
									ASAD_ PRLS (25s, 25s, itself)				Itself (36s, 36s)	MAT- INW (8s, 8s, itself)
													Itself (36s, 36s)	MAT- INW (8s, 8s, itself)

*UCKA_MAIN also calls STASH, and probably quite large ~ 200s
⁺also called from other routines, especially boundary layer routines

Some general points

This run is about 12% faster than the run without OpenMP branches (u-an421) (3,676s instead of 4182s)
The everything else here is actually larger than the run above with two branches, which suggests that maybe I was a bit lucky with speed of run above. And that branches above probably don't save as much as 11%.
Adding Maff's branch has saved about 61s on UKCA_AERO_CTL, plus any time gained from reducing asymmetry in UKCA.
Adding my OpenMP branch for emissions code has saved about 26s (0.7% of total).

Attempting to add OpenMP into UKCA_ACTIVATE code

From the profiling above, it's clear that we've able to add OpenMP into the all the main parts of UKCA_MAIN1 except UKCA_ACTIVATE. In UM #3506 I outline how I tried to do this, where the summary is

Most of total time for UKCA_ACTIVATE is spent in UKCA_ABDULRAZZAK_GHAN. The call to UKCA_ABDULRAZZAK_GHAN is not within a DO loop, and so it doesn't look easy to add OpenMP into UKCA_ACTIVATE. The easier option is likely to be adding it into UKCA_ABDULRAZZAK_GHAN.
By profiling UKCA_ABDULRAZZAK_GHAN, I found that 67.9s out of 70.7s for a one month of simulation for a particular run is spent in one of the DO loops in this routine.
I don't think we can add OpenMP to this DO loop, as is, and maintain results. However, I could move an inner DO loop to the outside and add OpenMP to this and retain results.
Unfortunately, this only saved about 10s - when I would have hoped for about 30s - and I don't think this merits adding in its current state.

Moving to UM10.9

In order to get my branches onto the trunk I need to upgrade my branches to UM10.9, which I've done. I'm now profiling two jobs

u-ar709: GA7 + StratTrop. I've upgraded Mohit's GA7 + StratTrop UM10.8 to UM10.9.
u-ar728: same as above but including my solver branch, fastjx branch and add emissions branch.

And both jobs have

(ATM_PROCX, ATM_PROCY)=(18,24)
OMPTHR_ATM=2, i.e. two threads

Top level

GA7.1 + StratTrop without branches

Routines
UM_SHELL (3,578s)
U_MODEL_4A (3,576s)
ATM_STEP_4A (3,336s)
ATMOS_PHYSICS1 (450s)	Everything else (1,674s)	UKCA_MAIN1 (1,212s)

GA7.1 + StratTrop with branches

Routines
UM_SHELL (3,121s)
U_MODEL_4A (3,118s)
ATM_STEP_4A (2,872s)
ATMOS_PHYSICS1 (442s)	Everything else (1,505s)	UKCA_MAIN1 (925s)

This suggests the OpenMP branches have

Saved about 8.7% on the total time. This is 287s (8.0% of total time) saved in the mean total time in UKCA_MAIN, and a little bit for reducing the asymmetry between the slowest and fastest PE.

UKCA_MAIN1

GA7 + StratTrop without branches

Routines
UKCA_MAIN* (1,212s)
UKCA_AERO_CTL (106s)					UKCA_ ACT- IVATE (73s)	UKCA_CHEMISTRY_CTL (432s)							UKCA_FASTJX (201s)					UKCA_EMISS _CTL (82s)
UKCA_AERO_STEP (100s)					UKCA_ ABDUL- RAZZAK _GHAN (71s)	ASAD_CDRIVE (401s)						UKCA_ STRAT _PHOT- OL (21s)	FASTJX_PHOTOJ (201s)					UKCA_ADD_ EMISS (66s)
UKCA_COAG- WITHNUCL (35s, 30s)		UKCA_ CONDEN (31s, 26s)		UKCA_ VOL- UME_ MODE (13s, 11s)	Itself (70s)	ASAD_SPMJPDRIV (368s)					⇓	INI- JTAB (21s)	FASTJX_OPMIE (106s)			FLINT (46s, itself)	Itself (31s)	TRSRCE (35s, itself)	TR_ MIX⁺ (21s)
Itself (27s, 23s)	UKCA_ SOLVE- COAG- NUCL _V (7s, 6s, itself)	UKCA_ COND_ COFF _V (25s, 21s, itself)	Itself (6s, 5s)	Itself (4s, 4s)		ASAD_SPIMPMJP (362s)				⇓	⇓	SET- TAB (21s)	FASTJX_ MIESCT (66s)		Itself (40s)				IMP_ MIX (17s, itself)
						SP- LIN- SLV2 (161s, itself)	SP- FUL- JAC (110s, itself)	Itself (39s)	ASAD_ DIFFUN (49s)			Itself (4s)	BLKSLV (66s)
									ASAD_ PRLS (49s, itself)				Itself (36s)	MAT- INW (15s, itself)
													Itself (36s)	MAT- INW (15s, itself)

*UCKA_MAIN also calls STASH, and probably quite large ~ 200s
⁺also called from other routines, especially boundary layer routines

GA7 + StratTrop with branches

Routines
UKCA_MAIN* (925s)
UKCA_AERO_CTL (106s)					UKCA_ ACT- IVATE (73s)	UKCA_CHEMISTRY_CTL (238s)							UKCA_FASTJX (133s)					UKCA_ EMISS _CTL (54s)
UKCA_AERO_STEP (100s)					UKCA_ ABDUL- RAZZAK _GHAN (71s)	ASAD_CDRIVE (204s, 204s)						UKCA_ STRAT _PHOT- OL (25s, 25s)	FASTJX_PHOTOJ (133s)					UKCA_ADD_ EMISS (37s)
UKCA_COAG- WITHNUCL (35s, 30s)		UKCA_ CONDEN (31s, 26s)		UKCA_ VOL- UME_ MODE (13s, 11s)	Itself (69s)	ASAD_SPMJPDRIV (187s, 187s)					⇓	INI- JTAB (12s, 13s)	FASTJX_OPMIE (76s, 76s)			FLINT (28s, 28s, itself)	Itself (19s)	TRSRCE (16s, 16s, itself)	TR_ MIX⁺ (15s, 14s)
Itself (27s, 23s)	UKCA_ SOLVE- COAG- NUCL _V (7s, 6s, itself)	UKCA_ COND_ COFF _V (25s, 21s, itself)	Itself (6s, 5s)	Itself (4s, 4s)		ASAD_SPIMPMJP (184s, 184s)				⇓	⇓	SET- TAB (12s, 13s)	FASTJX_ MIESCT (55s, 55s)		Itself (21s, 21s)				IMP_ MIX (12s, 11s, itself)
						SP- LIN- SLV2 (82s, 82s, itself)	SP- FUL- JAC (55s, 55s, itself)	Itself (20s, 20s)	ASAD_ DIFFUN (25s, 25s)			Itself (2s, 2s)	BLKSLV (55s, 55s)
									ASAD_ PRLS (25s, 25s, itself)				Itself (36s, 36s)	MAT- INW (8s, 8s, itself)
													Itself (36s, 36s)	MAT- INW (8s, 8s, itself)

*UCKA_MAIN also calls STASH, and probably quite large ~ 200s
⁺also called from other routines, especially boundary layer routines

This shows that

The solver branch (UKCA_CHEMISTRY_CTL) has saved about 194s, which is a 45% saving on the total time in UKCA_CHEMISTRY_CTL and a 5.4% saving on the total runtime.
The FASTJX branch has saved about 68s, which is a 34% saving on the total time in FASTJX_PHOTOJ and 1.9% saving on the total runtime.
The emission branch (UKCA_ADD_EMISS) has saved about 28s, which is a saving of 42% on the total time in UKCA_ADD_EMISS and 0.8% saving on the total runtime.

Marc's pages

Adding OpenMP to the full chemistry

Introduction

Top level

GA7.1 + StratTrop without branches

GA7.1 + StratTrop with branches

UKCA_MAIN1

GA7.1 + StratTrop without branches

GA7.1 + StratTrop without branches

Adding in further OpenMP branches

Top level

UKCA_MAIN1

Attempting to add OpenMP into UKCA_ACTIVATE code

Moving to UM10.9

Top level

GA7.1 + StratTrop without branches

GA7.1 + StratTrop with branches

UKCA_MAIN1

GA7 + StratTrop without branches

GA7 + StratTrop with branches