Updating the profiling of spin-up

Profiling of MEDUSA was done at revision 6509 of MEDUSA branch. However, we'd like to know how these timings have changed since the following has happened

  • removal of NEMO CPP key key_debug_medusa
  • removal of MPI calls should never really have been added, see change at revision 6259

and a general updating of code - MEDUSA branch is now at revision 7611.

In addition to this, we'd like to get some of Maff's optimisations into the GO6 package branch, so Richard has created a version of the GO6 package branch, which includes most of them. Profiling this is both interesting and we'll hopefully convince the owners of the GO6 package branch to add them in.

The runs I've made are

  • u-ag763: old profiling at revision 6509 of MEDUSA
  • u-aj369: revision 7611 of MEDUSA with the GO6 package branch at revision 7573
  • u-aj369: revision 7611 of MEDUSA with Richard's version of the GO6 package branch which has many of Maff's optimisations in

All the runs are

  • ORCA1
  • 6 months
  • (NEMO_IPROC,NEMO_JPROC)=(24,24)
  • OCN_PPN=24
  • Run on Broadwell XCE/F

Unless stated as `itself', the times below are the total times, which is the time spent in each routine and all the routines which it spawns/calls. Times specified as itself, include only the time spent in that routine. A lot of the smaller routines are not included to make the diagrams simpler, so the total times lower down the table will not add to the total times higher up, but the totals should be closish. For the calls to LBC_LNK I've sometimes indicated the grid-type, such as T, U or V, because it gives a clue as to what is being passed. For example, I'm guessing two fields of grid type T are probably temperature and salinity. I haven't included more here, because I don't have the space and I'm frequently not sure what is being passed.

Top level

Revision 6509 of MEDUSA

Routines
NEMOGCM (2,268s)
NEMO_INIT (50s) STP (2,218s)
ICB_ INIT (8s) TRC_ INIT (19s) TRC_STP (836s) SBC (784s) IOM_ INIT* (27s) DYN_SPG (185s) STP_ CTL (70s, itself) TRA_ADV (55s) DOM _VVL _SF_ SWP (36s) ZDF_ TKE* (33s)
ICB_ RST_ READ (8s) TRC_ RST_ READ (19s) See TRC_STP routines below for more detail See SBC routines below for more detail DYN_SPG_ FLT (185s) TRA_ADV_TVD (50s) Itself (29s)
Itself (7s) Itself (12s) SOL_PCG (179s)  ⇓  for U & V NONOSC (28s)  ⇓  for U, V & W
Itself (159s)  ⇓   ⇓  for T (*2), U & V Itself (14s)
 ⇓   ⇓ 
LBC_LNK (797s)
MPP_LNK_2D (441s) MPP_LNK_3D (356s)
* Profiling further causes branch conflicts with other branches, and these are relatively small contributions, so I've not profiled further.

Revision 7611 of MEDUSA with GO6 package branch at revision 7573

Routines
NEMOGCM (2,135s)
NEMO_INIT (18s) STP (2,121s)
ICB_ INIT (1s) TRC_ INIT (2s) ISTATE_ INIT (4s) TRC_STP (967s) SBC (537s) IOM_ INIT* (86s) DYN_SPG (106s) STP_ CTL (62s, itself) TRA_ADV (56s) DOM _VVL _SF_ SWP (9s) DOM _VVL _SF_ NXT (6s) ZDF_ TKE* (32s)
ICB_ RST_ READ (1s) TRC_ RST_ READ (0s) Itself (3s) See TRC_STP routines below for more detail See SBC routines below for more detail DYN_SPG_ FLT (106s) TRA_ADV_TVD (50s) DOM_ VVL_ INTERPOL (16s)
Itself (1s) Itself (0s) SOL_PCG (94s)  ⇓  for U & V NONOSC (27s)  ⇓  for U, V & W Itself (8s)  ⇓ 
Itself (83s)  ⇓   ⇓  for T (*2), U & V Itself (15s)
 ⇓   ⇓ 
LBC_LNK (632s)
MPP_LNK_2D (373s) MPP_LNK_3D (259s)
* Profiling further causes branch conflicts with other branches, and these are relatively small contributions, so I've not profiled further.

Revision 7611 of MEDUSA with Richard and Maff's optimised GO6 package branch

Routines
NEMOGCM (2,085s)
NEMO_INIT (34s) STP (2,071s)
ICB_ INIT (7s) TRC_ INIT (3s) ISTATE_ INIT (8s) TRC_STP (911s) SBC (538s) IOM_ INIT* (98s) DYN_SPG (109s) STP_ CTL (64s, itself) TRA_ADV (50s) DOM _VVL _SF_ SWP (9s) DOM _VVL _SF_ NXT (6s) ZDF_ TKE* (33s)
ICB_ RST_ READ (3s) TRC_ RST_ READ (0s) Itself (5s) See TRC_STP routines below for more detail See SBC routines below for more detail DYN_SPG_ FLT (109s) TRA_ADV_TVD (44s) DOM_ VVL_ INTERPOL (16s)
Itself (0s) Itself (0s) SOL_PCG (95s)  ⇓  for U & V NONOSC (23s)  ⇓  for U, V & W Itself (8s)  ⇓ 
Itself (84s)  ⇓   ⇓  for T (*2), U & V Itself (10s)
 ⇓   ⇓ 
LBC_LNK (647s)
MPP_LNK_2D (375s) MPP_LNK_3D (272s)
* Profiling further causes branch conflicts with other branches, and these are relatively small contributions, so I've not profiled further.

Summary

For the change between revision 6509 and 7611.

  • Time in NEMO_INIT has dropped alot between revision 6509 and 7611 of MEDUSA. This is probably because the netCDF chunking in /data/d02/jpalmier/ATM-FORCING_eORCA1/30y_from_coupled now looks OK.
  • Time in TRC_STP has gone up by more than 100s.
  • Time in SBC has dropped about 250s.
  • IOM_INIT has increased a lot, but so has probably the diagnostics
  • Time in SOL_PCG has dropped significantly. This has a convergence criteria, so I think it must be convergencing much quicker for later revision of MEDUSA code.
  • Time in LBC_LNK has dropped by around 165s. Presumably because of removing the extra calls that shouldn't really have been added.

For the optimisation branch, there's no obvious improvement here. This includes SOL_PCG which showed good improvement before.

TRC_STP routines

This is the passive tracer part of the model

Revision 6509 of MEDUSA

Routines
TRC_STP (836s)
TRC_TRP (669s) TRC_SMS (84s)
TRC_ADV (420s) TRC_LDF (104s) TRC_BBL (60s) TRC_SBC (0s) TRC_NXT (55s) TRC_SMS_MEDUSA (83s)
TRA_ADV_MUSCL (323s)  ⇓  for T (*2) TRA_LDF_ISO (117s, itself) BBL (59s)  ⇓  for T  ⇓  for T TRC_BIO_MEDUSA (76s)
Itself (201s)  ⇓  for U (*2) & V (*2) EOS_RAB/ RAB_2D (58s) Itself (58s)  ⇓  for T (*2)
 ⇓  for T (*2)
LBC_LNK

Revision 7611 of MEDUSA with GO6 package branch at revision 7573

Routines
TRC_STP (967s)
TRC_TRP (676s) TRC_SMS (184s)
TRC_ADV (336s) TRC_LDF (128s) TRC_BBL (2s) TRC_SBC (130s) TRC_NXT (53s) TRC_SMS_MEDUSA (182s)
TRA_ADV_MUSCL (331s)  ⇓  for T (*2) TRA_LDF_ISO (142s, itself) BBL (2s)  ⇓  for T  ⇓  for T TRC_BIO_MEDUSA (174s)
Itself (221s)  ⇓  for U (*2) & V (*2) EOS_RAB/ RAB_2D/ RAB_3D (6s) Itself (120s)  ⇓  for T (*2)
 ⇓  for T (*2)
LBC_LNK

Revision 7611 of MEDUSA with Richard and Maff's optimised GO6 package branch

Routines
TRC_STP (911s)
TRC_TRP (619s) TRC_SMS (184s)
TRC_ADV (329s) TRC_LDF (78s) TRC_BBL (2s) TRC_SBC (131s) TRC_NXT (53s) TRC_SMS_MEDUSA (183s)
TRA_ADV_MUSCL (323s)  ⇓  for T (*2) TRA_LDF_ISO (85s, itself) BBL (1s)  ⇓  for T  ⇓  for T TRC_BIO_MEDUSA (174s)
Itself (197s)  ⇓  for U (*2) & V (*2) EOS_RAB/ RAB_2D (6s) Itself (120s)  ⇓  for T (*2)
 ⇓  for T (*2)
LBC_LNK

Summary

For the change between revision 6509 and 7611.

  • Time in TRC_BIO_MEDUSA more than doubled. It's not entirely clear why. Additions do include
    • A lot of extra diagnostics, which includes many extra calls to WRK_ALLOC, WRK_DEALLOC and IOM_PUT.
    • Extra calls to LBC_LNK for FGCO2 and DMS_SURF2D.
    • Extra calls to OXY_SATO (worth 13s extra)
  • Time in TRC_SBC becomes around 130s when previously it was zero, while TRC_BBL is reduced to close to zero. I've looked at these routines and most of the total time for these routines can be attributed to LBC_LNK, which also provides an MPI barrier. I think imbalance in runtime is being attributed to TRC_BBL for revision 6509 of MEDUSA and this has moved to TRC_SBC for revision 7611 (probably because of an extra LBC_LNK call from TRC_SBC, introduced by GO6 package branch).
  • Time in TRC_ADV is significantly less than before.
  • Time in TRC_LDF is increased

For the optimisation branch, the only significant improvement seems to be for TRA_LDF_ISO, which is almost halved. Previously, we had good improvement for TRA_ADV_MUSCL and this has improved, but not by a lot.

SBC routines

These are the routines which update data, open boundaries and provide surface BCs (including sea-ice).

Revision 6509 of MEDUSA

Routines
SBC (783s)
SBC_BLK_CORE (354s) SBC_ICE_CICE (303s) ICB_STP (69s) SBC_ FWB (33s, itself)  ⇓  for T
FLD_READ (254s*) BLK_OCE_ CORE (101s) CICE_RUN: into CICE (~183s) CICE_SBC_IN (73s) CICE_SBC _OUT (47s) ICB_LBC_MPP (57s, itself)
FLD_GET (247s, itself)  ⇓  for T (*2), U & V NEMO2CICE (73s) CICE2NEMO (36s)
 ⇓  for Z (*2) Itself (108s)  ⇓  Itself (33s)  ⇓  Itself (19s)
LBC_LNK
* FLD_READ is clearly called from other routines as well, and so not all this time can be attributed to calls from SBC_BLK_CORE.

Revision 7611 of MEDUSA with GO6 package branch at revision 7573

Routines
SBC (537s)
SBC_BLK_CORE (158s) SBC_ICE_CICE (319s) ICB_STP (23s) SBC_ FWB (12s, itself)  ⇓  for T
FLD_READ (127s*) BLK_OCE_ CORE (35s) CICE_RUN: into CICE (~188s) CICE_SBC_IN (77s) CICE_SBC _OUT (55s) ICB_LBC_MPP (8s, itself)
FLD_GET (119s, itself)  ⇓  for T (*2), U & V NEMO2CICE (75s) CICE2NEMO (42s)
 ⇓  for Z (*2) Itself (66s)  ⇓  Itself (35s)  ⇓  Itself (22s)
LBC_LNK
* FLD_READ is clearly called from other routines as well, and so not all this time can be attributed to calls from SBC_BLK_CORE.

Revision 6509 of MEDUSA with Richard and Maff's optimised GO6 package branch

Routines
SBC (538s)
SBC_BLK_CORE 137(s) SBC_ICE_CICE (328s) ICB_STP (25s) SBC_ FWB (20s, itself)  ⇓  for T
FLD_READ (110s*) BLK_OCE_ CORE (32s) CICE_RUN: into CICE (~193s) CICE_SBC_IN (81s) CICE_SBC _OUT (54s) ICB_LBC_MPP (8s, itself)
FLD_GET (102s, itself)  ⇓  for T (*2), U & V NEMO2CICE (80s) CICE2NEMO (41s)
 ⇓  for Z (*2) Itself (56s)  ⇓  Itself (38s)  ⇓  Itself (23s)
LBC_LNK
* FLD_READ is clearly called from other routines as well, and so not all this time can be attributed to calls from SBC_BLK_CORE.

Summary

For the change between revision 6509 and 7611.

  • Time in FLD_GET about halved
  • BLK_OCE_CORE is reduced to about a third
  • Time in ICB_STP more than halved
  • Time in SBC_FWB also significantly reduced.

For the optimisation branch, FLD_READ is quicker but probably not large enough to be outside the Cray noise on timings to be sure.

Overall summary

For the change between revision 6509 and 7611.

  • Time in NEMO_INIT and reduction in FLD_GET suggest that using probably chunked files has helped a lot.
  • Time in SOL_PCG has dropped significantly. This has a convergence criteria, so I think it must be convergencing much quicker for later revision of MEDUSA code.
  • Time in LBC_LNK has dropped by around 165s. Presumably because of removing the extra calls that shouldn't really have been added.
  • I think the time attributed to RAB_2D at revsion 6509 of MEDUSA and to TRC_SBC at revision 7611 of MEDUSA is mostly from model imbalance, and these routines have unfairly picked up this time because of an MPI barrier.
  • Time in TRC_BIO_MEDUSA more than doubled. It's not entirely clear why. Additions do include
    • A lot of extra diagnostics, which includes many extra calls to WRK_ALLOC, WRK_DEALLOC and IOM_PUT.
    • Extra calls to LBC_LNK for FGCO2 and DMS_SURF2D.
    • Extra calls to OXY_SATO (worth 13s extra)