Unless stated as `itself', the times below are the total times, which is the time spent in each routine and all the routines which it spawns/calls. Times specified as itself, include only the time spent in that routine. A lot of the smaller routines are not included to make the diagrams simpler, so the total times lower down the table will not add to the total times higher up, but the totals should be closish. Let me know if I've missed any significant routines.
I've run u-ag763 for one month on 24*24 broadwell nodes (20,736 cores) which is close to as fast as we can get for an ORCA1 run. This explains why the MPI communication, largely indicated by calls to LBC_LNK, is very significant. For the calls to LBC_LNK I've sometimes indicated the grid-type, such as T, U or V, because it gives a clue as to what is being passed. For example, I'm guessing two fields of grid type T are probably temperature and salinity. I haven't included more here, because I don't have the space and I'm frequently not sure what is being passed.
Routines | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NEMOGCM (403s) | |||||||||||||||
NEMO_INIT (44s) | STP (378s) | ||||||||||||||
ZDF_ BFR_ INIT** (0s, itself) | ICB_ INIT (9s) | TRC_ INIT (14s) | ISTATE_ INIT (10s) | TRC_STP (139s) | SBC (128s) | IOM_ INIT* (15s) | DYN_SPG (25s) | STP_ CTL (13s, itself) | TRA_ADV (9s) | DOM _VVL _SF_ SWP (6s) | ZDF_ TKE* (6s) | ||||
ICB_ RST_ READ (8s) | TRC_ RST_ READ (13s) | Itself (6s) | See TRC_STP routines below for more detail | See SBC routines below for more detail | DYN_ SPG_ FLT (25s) | TRA_ADV_TVD (8s) | Itself (5s) | ||||||||
Itself (7s) | Itself (9s) | SOL_PCG (24s) | ⇓ for U & V | NONOSC (5s) | ⇓ for U, V & W | ||||||||||
Itself (21s) | ⇓ | ⇓ for T (*2), U & V | Itself (2s) | ||||||||||||
⇓ | ⇓ | ||||||||||||||
LBC_LNK (126s) | |||||||||||||||
MPP_LNK_2D (70s) | MPP_LNK_3D (56s) |
This is the passive tracer part of the model
Routines | |||||||
---|---|---|---|---|---|---|---|
TRC_STP (139s) | |||||||
TRC_TRP (105s) | TRC_SMS (14s) | ||||||
TRC_ADV (69s) | TRC_LDF (17s) | TRC_BBL (10s) | TRC_NXT (9s) | TRC_SMS_MEDUSA (14s) | |||
TRA_ADV_MUSCL (54s) | ⇓ for T (*2) | TRA_LDF_ISO (19s, itself) | BBL (10s) | ⇓ for T | TRC_BIO_MEDUSA (13s) | ||
Itself (34s) | ⇓ for U (*2) & V (*2) | EOS_RAB/ RAB_2D (10s) | Itself (10s) | ⇓ for T (*2) | |||
⇓ for T (*2) | |||||||
LBC_LNK |
These are the routines which update data, open boundaries and provide surface BCs (including sea-ice).
Routines | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
SBC (128s) | ||||||||||||
SBC_BLK_CORE (59s) | SBC_ICE_CICE (47s) | ICB_STP (12s) | SBC_ FWB (6s, itself) | ⇓ for T | ||||||||
FLD_READ (43s*) | BLK_OCE_ CORE (17s) | CICE_RUN: into CICE (27s) | CICE_SBC_IN (11s) | CICE_SBC _OUT (7s) | ICB_LBC_MPP (10s, itself) | |||||||
⇓ | FLD_ INIT (1s**) | ⇓ | ⇓ for T (*2), U & V | NEMO2CICE (11s) | CICE2NEMO (5s) | |||||||
FLD_GET (41s, itself) | FLD_CLOPN (1s**, itself) | ⇓ | Itself (5s) | ⇓ | Itself (3s) | |||||||
⇓ for Z (*2) | Itself (19s) | |||||||||||
LBC_LNK |
I want to test Maff's speed-up changes, so it'll be more reliable if the numbers are bigger. The routines he's changed are
Below I've highlighted these routines with bold font and yellow background.
I've used u-ag763 to the run without Maff's changes and u-ah983 for the run with Maff's changes.
Routines | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NEMOGCM (2,268s) | |||||||||||||||
NEMO_INIT (50s) | STP (2,218s) | ||||||||||||||
ZDF_ BFR_ INIT** (0s, itself) | ICB_ INIT (8s) | TRC_ INIT (19s) | ISTATE_ INIT (5s) | TRC_STP (836s) | SBC (784s) | IOM_ INIT* (27s) | DYN_SPG (185s) | STP_ CTL (70s, itself) | TRA_ADV (55s) | DOM _VVL _SF_ SWP (36s) | ZDF_ TKE* (33s) | ||||
ICB_ RST_ READ (8s) | TRC_ RST_ READ (19s) | Itself (4s) | See TRC_STP routines below for more detail | See SBC routines below for more detail | DYN_SPG_ FLT (185s) | TRA_ADV_TVD (50s) | Itself (29s) | ||||||||
Itself (7s) | Itself (12s) | SOL_PCG (179s) | ⇓ for U & V | NONOSC (28s) | ⇓ for U, V & W | ||||||||||
Itself (159s) | ⇓ | ⇓ for T (*2), U & V | Itself (14s) | ||||||||||||
⇓ | ⇓ | ||||||||||||||
LBC_LNK (797s) | |||||||||||||||
MPP_LNK_2D (441s) | MPP_LNK_3D (356s) |
Routines | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NEMOGCM (2,044s) | |||||||||||||||
NEMO_INIT (55s) | STP (1989s) | ||||||||||||||
ZDF_ BFR_ INIT** (0s, itself) | ICB_ INIT (8s) | TRC_ INIT (25s) | ISTATE_ INIT (8s) | TRC_STP (748s) | SBC (763s) | IOM_ INIT* (s) | DYN_SPG (111s) | STP_ CTL (56s, itself) | TRA_ADV (53s) | DOM _VVL _SF_ SWP (36s) | ZDF_ TKE* (s) | ||||
ICB_ RST_ READ (8s) | TRC_ RST_ READ (24s) | Itself (s) | See TRC_STP routines below for more detail | See SBC routines below for more detail | DYN_SPG_ FLT (111s) | TRA_ADV_TVD (47s) | Itself (29s) | ||||||||
Itself (7s) | Itself (17s) | SOL_PCG (106s) | ⇓ for U & V | NONOSC (27s) | ⇓ for U, V & W | ||||||||||
Itself (93s) | ⇓ | ⇓ for T (*2), U & V | Itself (14s) | ||||||||||||
⇓ | ⇓ | ||||||||||||||
LBC_LNK (750s) | |||||||||||||||
MPP_LNK_2D (425s) | MPP_LNK_3D (326s) |
This is the passive tracer part of the model
Routines | |||||||
---|---|---|---|---|---|---|---|
TRC_STP (836s) | |||||||
TRC_TRP (669s) | TRC_SMS (84s) | ||||||
TRC_ADV (420s) | TRC_LDF (104s) | TRC_BBL (60s) | TRC_NXT (55s) | TRC_SMS_MEDUSA (83s) | |||
TRA_ADV_MUSCL (323s) | ⇓ for T (*2) | TRA_LDF_ISO (117s, itself) | BBL (59s) | ⇓ for T | TRC_BIO_MEDUSA (76s) | ||
Itself (201s) | ⇓ for U (*2) & V (*2) | EOS_RAB/ RAB_2D (58s) | Itself (58s) | ⇓ for T (*2) | |||
⇓ for T (*2) | |||||||
LBC_LNK |
This is the passive tracer part of the model
Routines | |||||||
---|---|---|---|---|---|---|---|
TRC_STP (748s) | |||||||
TRC_TRP (582s) | TRC_SMS (84s) | ||||||
TRC_ADV (380s) | TRC_LDF (59s) | TRC_BBL (62s) | TRC_NXT (55s) | TRC_SMS_MEDUSA (83s) | |||
TRA_ADV_MUSCL (289s) | ⇓ for T (*2) | TRA_LDF_ISO (67s, itself) | BBL (61s) | ⇓ for T | TRC_BIO_MEDUSA (76s) | ||
Itself (179s) | ⇓ for U (*2) & V (*2) | EOS_RAB/ RAB_2D (61s) | Itself (60s) | ⇓ for T (*2) | |||
⇓ for T (*2) | |||||||
LBC_LNK |
These are the routines which update data, open boundaries and provide surface BCs (including sea-ice).
Routines | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
SBC (783s) | ||||||||||||
SBC_BLK_CORE (354s) | SBC_ICE_CICE (303s) | ICB_STP (69s) | SBC_ FWB (33s, itself) | ⇓ for T | ||||||||
FLD_READ (254s*) | BLK_OCE_ CORE (101s) | CICE_RUN: into CICE (183s) | CICE_SBC_IN (73s) | CICE_SBC _OUT (47s) | ICB_LBC_MPP (57s, itself) | |||||||
⇓ | FLD_ INIT (1s**) | ⇓ | ⇓ for T (*2), U & V | NEMO2CICE (73s) | CICE2NEMO (36s) | |||||||
FLD_GET (247s, itself) | FLD_CLOPN (7s**, itself) | ⇓ | Itself (33s) | ⇓ | Itself (19s) | |||||||
⇓ for Z (*2) | Itself (108s) | |||||||||||
LBC_LNK |
These are the routines which update data, open boundaries and provide surface BCs (including sea-ice).
Routines | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
SBC (763s) | ||||||||||||
SBC_BLK_CORE (351s) | SBC_ICE_CICE (283s) | ICB_STP (67s) | SBC_ FWB (36s, itself) | ⇓ for T | ||||||||
FLD_READ (251s*) | BLK_OCE_ CORE (102s) | CICE_RUN: into CICE (163s) | CICE_SBC_IN (70s) | CICE_SBC _OUT (50s) | ICB_LBC_MPP (57s, itself) | |||||||
⇓ | FLD_ INIT (1s**) | ⇓ | ⇓ for T (*2), U & V | NEMO2CICE (50s) | CICE2NEMO (38s) | |||||||
FLD_GET (246s, itself) | FLD_CLOPN (5s**, itself) | ⇓ | Itself (31s) | ⇓ | Itself (20s) | |||||||
⇓ for Z (*2) | Itself (108s) | |||||||||||
LBC_LNK |
Comparisons for these two runs shows that Maff's optimisiations making big speed-ups to
and the other routines don't show a large speed-up here. Timings on the Cray does seem to general vary by around 10%, so I'd need to make several comparisons to be sure.
Generally, it looks like Maff's optimisation are saving about 10% on the run.