Profiling spin-up

Unless stated as `itself', the times below are the total times, which is the time spent in each routine and all the routines which it spawns/calls. Times specified as itself, include only the time spent in that routine. A lot of the smaller routines are not included to make the diagrams simpler, so the total times lower down the table will not add to the total times higher up, but the totals should be closish. Let me know if I've missed any significant routines.

I've run u-ag763 for one month on 24*24 broadwell nodes (20,736 cores) which is close to as fast as we can get for an ORCA1 run. This explains why the MPI communication, largely indicated by calls to LBC_LNK, is very significant. For the calls to LBC_LNK I've sometimes indicated the grid-type, such as T, U or V, because it gives a clue as to what is being passed. For example, I'm guessing two fields of grid type T are probably temperature and salinity. I haven't included more here, because I don't have the space and I'm frequently not sure what is being passed.

Routines
NEMOGCM (403s)
NEMO_INIT (44s)				STP (378s)
ZDF_ BFR_ INIT** (0s, itself)	ICB_ INIT (9s)	TRC_ INIT (14s)	ISTATE_ INIT (10s)	TRC_STP (139s)	SBC (128s)	IOM_ INIT* (15s)	DYN_SPG (25s)			STP_ CTL (13s, itself)	TRA_ADV (9s)			DOM _VVL _SF_ SWP (6s)	ZDF_ TKE* (6s)
	ICB_ RST_ READ (8s)	TRC_ RST_ READ (13s)	Itself (6s)	See TRC_STP routines below for more detail	See SBC routines below for more detail		DYN_ SPG_ FLT (25s)				TRA_ADV_TVD (8s)			Itself (5s)
	Itself (7s)	Itself (9s)					SOL_PCG (24s)		⇓ for U & V		NONOSC (5s)		⇓ for U, V & W
	Itself (7s)	Itself (9s)					Itself (21s)	⇓			⇓ for T (*2), U & V	Itself (2s)
				⇓	⇓			⇓			⇓ for T (*2), U & V
				LBC_LNK (126s)
				MPP_LNK_2D (70s)					MPP_LNK_3D (56s)

* Profiling further causes branch conflicts with other branches, and these are relatively small contributions, so I've not profiled further.
** Before all the scientist were kicked-off XCF, the time in ZDF_BGF_INIT was about 18s for a one month run, but this has massively shrunk now we're only running on XCE (probably because the research disk is only being accessed by one HPC, so less load on I/O).

TRC_STP routines

This is the passive tracer part of the model

Routines
TRC_STP (139s)
TRC_TRP (105s)						TRC_SMS (14s)
TRC_ADV (69s)			TRC_LDF (17s)	TRC_BBL (10s)	TRC_NXT (9s)	TRC_SMS_MEDUSA (14s)
TRA_ADV_MUSCL (54s)		⇓ for T (*2)	TRA_LDF_ISO (19s, itself)	BBL (10s)	⇓ for T	TRC_BIO_MEDUSA (13s)
Itself (34s)	⇓ for U (2) & V (2)		TRA_LDF_ISO (19s, itself)	EOS_RAB/ RAB_2D (10s)		Itself (10s)	⇓ for T (*2)
	⇓ for U (2) & V (2)			⇓ for T (*2)			⇓ for T (*2)
	LBC_LNK

SBC routines

These are the routines which update data, open boundaries and provide surface BCs (including sea-ice).

Routines
SBC (128s)
SBC_BLK_CORE (59s)					SBC_ICE_CICE (47s)					ICB_STP (12s)	SBC_ FWB (6s, itself)	⇓ for T
FLD_READ (43s*)				BLK_OCE_ CORE (17s)	CICE_RUN: into CICE (27s)	CICE_SBC_IN (11s)		CICE_SBC _OUT (7s)		ICB_LBC_MPP (10s, itself)
⇓	FLD_ INIT (1s**)		⇓	⇓ for T (*2), U & V		NEMO2CICE (11s)		CICE2NEMO (5s)
FLD_GET (41s, itself)		FLD_CLOPN (1s**, itself)				⇓	Itself (5s)	⇓	Itself (3s)
⇓ for Z (*2)	Itself (19s)	FLD_CLOPN (1s**, itself)					Itself (5s)		Itself (3s)
⇓ for Z (*2)
LBC_LNK

* FLD_READ is clearly called from other routines as well, and so not all this time can be attributed to calls from SBC_BLK_CORE.
** Before all the scientist were kicked-off XCF, the time in FLD_INIT and FLD_CLOPN was about 43s and 31s respectively for a one month run, but this has massively shrunk now we're only running on XCE (probably because the research disk is only being accessed by one HPC, so less load on I/O).

Running for 6 months and Maff's changes

I want to test Maff's speed-up changes, so it'll be more reliable if the numbers are bigger. The routines he's changed are

BBL (OPA_SRC/TRA/trabbl.F90)
FLD_READ (OPA_SRC/SBC/fldread.F90)
SOL_PCG (OPA_SRC/SOL/solpcg.F90)
TRA_ADV (OPA_SRC/TRA/traadv.F90)
TRA_ADV_MUSCL (OPA_SRC/TRA/traadv_muscl.F90)
TRA_LDF_ISO (OPA_SRC/TRA/traldf_iso.F90)
TRC_BBL (TOP_SRC/TRP/trcbbl.F90)
TRC_BIO_MEDUSA (TOP_SRC/MEDUSA/trcbio_medusa.F90)
TRC_LDF (TOP_SRC/TRP/trcldf.F90)

Below I've highlighted these routines with bold font and yellow background.

I've used u-ag763 to the run without Maff's changes and u-ah983 for the run with Maff's changes.

Top level without Maff's changes

Routines
NEMOGCM (2,268s)
NEMO_INIT (50s)				STP (2,218s)
ZDF_ BFR_ INIT** (0s, itself)	ICB_ INIT (8s)	TRC_ INIT (19s)	ISTATE_ INIT (5s)	TRC_STP (836s)	SBC (784s)	IOM_ INIT* (27s)	DYN_SPG (185s)			STP_ CTL (70s, itself)	TRA_ADV (55s)			DOM _VVL _SF_ SWP (36s)	ZDF_ TKE* (33s)
	ICB_ RST_ READ (8s)	TRC_ RST_ READ (19s)	Itself (4s)	See TRC_STP routines below for more detail	See SBC routines below for more detail		DYN_SPG_ FLT (185s)				TRA_ADV_TVD (50s)			Itself (29s)
	Itself (7s)	Itself (12s)					SOL_PCG (179s)		⇓ for U & V		NONOSC (28s)		⇓ for U, V & W
	Itself (7s)	Itself (12s)					Itself (159s)	⇓			⇓ for T (*2), U & V	Itself (14s)
				⇓	⇓			⇓			⇓ for T (*2), U & V
				LBC_LNK (797s)
				MPP_LNK_2D (441s)					MPP_LNK_3D (356s)

Top level with Maff's changes

Routines
NEMOGCM (2,044s)
NEMO_INIT (55s)				STP (1989s)
ZDF_ BFR_ INIT** (0s, itself)	ICB_ INIT (8s)	TRC_ INIT (25s)	ISTATE_ INIT (8s)	TRC_STP (748s)	SBC (763s)	IOM_ INIT* (s)	DYN_SPG (111s)			STP_ CTL (56s, itself)	TRA_ADV (53s)			DOM _VVL _SF_ SWP (36s)	ZDF_ TKE* (s)
	ICB_ RST_ READ (8s)	TRC_ RST_ READ (24s)	Itself (s)	See TRC_STP routines below for more detail	See SBC routines below for more detail		DYN_SPG_ FLT (111s)				TRA_ADV_TVD (47s)			Itself (29s)
	Itself (7s)	Itself (17s)					SOL_PCG (106s)		⇓ for U & V		NONOSC (27s)		⇓ for U, V & W
	Itself (7s)	Itself (17s)					Itself (93s)	⇓			⇓ for T (*2), U & V	Itself (14s)
				⇓	⇓			⇓			⇓ for T (*2), U & V
				LBC_LNK (750s)
				MPP_LNK_2D (425s)					MPP_LNK_3D (326s)

TRC_STP routines without Maff's changes

This is the passive tracer part of the model

Routines
TRC_STP (836s)
TRC_TRP (669s)						TRC_SMS (84s)
TRC_ADV (420s)			TRC_LDF (104s)	TRC_BBL (60s)	TRC_NXT (55s)	TRC_SMS_MEDUSA (83s)
TRA_ADV_MUSCL (323s)		⇓ for T (*2)	TRA_LDF_ISO (117s, itself)	BBL (59s)	⇓ for T	TRC_BIO_MEDUSA (76s)
Itself (201s)	⇓ for U (2) & V (2)		TRA_LDF_ISO (117s, itself)	EOS_RAB/ RAB_2D (58s)		Itself (58s)	⇓ for T (*2)
	⇓ for U (2) & V (2)			⇓ for T (*2)			⇓ for T (*2)
	LBC_LNK

TRC_STP routines with Maff's changes

This is the passive tracer part of the model

Routines
TRC_STP (748s)
TRC_TRP (582s)						TRC_SMS (84s)
TRC_ADV (380s)			TRC_LDF (59s)	TRC_BBL (62s)	TRC_NXT (55s)	TRC_SMS_MEDUSA (83s)
TRA_ADV_MUSCL (289s)		⇓ for T (*2)	TRA_LDF_ISO (67s, itself)	BBL (61s)	⇓ for T	TRC_BIO_MEDUSA (76s)
Itself (179s)	⇓ for U (2) & V (2)		TRA_LDF_ISO (67s, itself)	EOS_RAB/ RAB_2D (61s)		Itself (60s)	⇓ for T (*2)
	⇓ for U (2) & V (2)			⇓ for T (*2)			⇓ for T (*2)
	LBC_LNK

SBC routines without Maff's changes

These are the routines which update data, open boundaries and provide surface BCs (including sea-ice).

Routines
SBC (783s)
SBC_BLK_CORE (354s)					SBC_ICE_CICE (303s)					ICB_STP (69s)	SBC_ FWB (33s, itself)	⇓ for T
FLD_READ (254s*)				BLK_OCE_ CORE (101s)	CICE_RUN: into CICE (183s)	CICE_SBC_IN (73s)		CICE_SBC _OUT (47s)		ICB_LBC_MPP (57s, itself)
⇓	FLD_ INIT (1s**)		⇓	⇓ for T (*2), U & V		NEMO2CICE (73s)		CICE2NEMO (36s)
FLD_GET (247s, itself)		FLD_CLOPN (7s**, itself)				⇓	Itself (33s)	⇓	Itself (19s)
⇓ for Z (*2)	Itself (108s)	FLD_CLOPN (7s**, itself)					Itself (33s)		Itself (19s)
⇓ for Z (*2)
LBC_LNK

SBC routines with Maff's changes

These are the routines which update data, open boundaries and provide surface BCs (including sea-ice).

Routines
SBC (763s)
SBC_BLK_CORE (351s)					SBC_ICE_CICE (283s)					ICB_STP (67s)	SBC_ FWB (36s, itself)	⇓ for T
FLD_READ (251s*)				BLK_OCE_ CORE (102s)	CICE_RUN: into CICE (163s)	CICE_SBC_IN (70s)		CICE_SBC _OUT (50s)		ICB_LBC_MPP (57s, itself)
⇓	FLD_ INIT (1s**)		⇓	⇓ for T (*2), U & V		NEMO2CICE (50s)		CICE2NEMO (38s)
FLD_GET (246s, itself)		FLD_CLOPN (5s**, itself)				⇓	Itself (31s)	⇓	Itself (20s)
⇓ for Z (*2)	Itself (108s)	FLD_CLOPN (5s**, itself)					Itself (31s)		Itself (20s)
⇓ for Z (*2)
LBC_LNK

Conclusions

Comparisons for these two runs shows that Maff's optimisiations making big speed-ups to

TRA_LDF_ISO,
SOL_PCG and
TRA_ADV_MUSCL

and the other routines don't show a large speed-up here. Timings on the Cray does seem to general vary by around 10%, so I'd need to make several comparisons to be sure.

Generally, it looks like Maff's optimisation are saving about 10% on the run.

Marc's pages

Profiling spin-up

TRC_STP routines

SBC routines

Running for 6 months and Maff's changes

Top level without Maff's changes

Top level with Maff's changes

TRC_STP routines without Maff's changes

TRC_STP routines with Maff's changes

SBC routines without Maff's changes

SBC routines with Maff's changes

Conclusions