Speed tests for MEDUSA + GC3.1

We may need to spin-up with a coupled model. The speed tests below are with u-ak109, with settings

UM10.6
OMPTHR_ATM=2
IOS_NPROC=6
u-ai403, u-ak770 & u-ak838 use a 10 day dump for NEMO, but u-ak734 uses a one month dump. Currently only the one month dump produces all the data we need.

Before Christmas, with a 3 month cycle, I got 6.63 model years/day (see table from UKESM core presentation). I was hoping to get something similar for this run. My notes on this are on p130 of UKESM General III, and the job suite was u-ai403.

u-ak734 is a copy of u-aj659.

Description (job suite)	ATM_PROCX * ATM_PROCY	NEMO_IPROC * NEMO_JPROC (CICE_BLKX * CICE_BLKY)	Total nodes	Cylce length (months)	Times for first 4 cycles	Average time for 1 cycle	Speed (model yrs/day)
GC3.1 + MEDUSA^* (u-ak109)	32*22	1212 (3028)	44	1	27:35 (1,655s), 30:34 (1,834s), 28:45 (1,725s) & 29:36 (1,733s)	1,736.75s (28:57)	4.15
GC3.1 + MEDUSA^* (u-ak109)	48*24	1212 (3028)	69	1	30:18 (1,818s), 30:46 (1,846s), 28:59 (1,739s) & 30:01 (1,801s)	1,801s (30:01)	4.00
GC3.1 + MEDUSA^* (u-ak109)	48*24	1212 (3028)	69	1	30:01 (1,801s), 29:27 (1,767s), 29:51 (1,791s) & 29:40 (1,780s)	1,784.75s (29:45)	4.03
GC3.1, older setup (u-ai403)	48*24	98 (4042)	67	1	18:35 (1,115s), 21:24 (1,284s), 19:02 (1,142s) & 19:11 (1,151s)	1,173s (19:33)	6.14
GC3.1, newer setup^* (u-ak734)	48*24	98 (4042)	67	1	22:03 (1,323s), 22:20 (1,340s), 22:19 (1,339s) & 22:32 (1,352s)	1,338.5s (22:19)	5.38
GC3.1, newer setup (u-ak734)	48*24	98 (4042)	67	1	22:31 (1,351s), 23:39 (1,419s), 23:11 (1,391s) & 22:47 (1,367s)	1,382s (23:02)	5.21
GC3.1, newer setup (u-ak734)	48*24	1212 (3028)	69	1	21:31 (1,291s), 22:05 (1,325s), 22:13 (1,333s) & 22:13 (1,333s)	1,320.5s (22:01)	5.45
GC3.1, newer setup^* (u-ak734)	48*24	98 (4042)	67	3	59:50 (3,590s), 1:00:53 (3,653s), 59:50 (3,590s) & 59:44 (3,584s)	3,604.25s (1:00:04)	6.00
GC3.1, older setup (u-ai403)	48*24	98 (4042)	67	3	51:56 (3,116s), 53:51 (3,211s), 52:08 (3,128s) & 52:51 (3,171s)	3,156.5s (52:37)	6.84

^*These runs had ios_thread_0_calls_mpi, ios_use_async_dump and ios_use_async_stash set to .true., and they should be .false.

Why has our GC3.1 configuration lost so much speed?

Comparing our older and newer setups for GC3.1, we appear to have lost about 0.75 model years/day, why?

All the configurations below are

GC3.1
(ATM_PROCX,ATM_PROCY)=(48,24)
(NEMO_IPROC,NEMO_JPROC)=(9,8)
(CICE_BLKX,CICE_BLKY)=(40,42)
Total nodes = 67 (except u-ak838, which has one extra node for XIOS)
Run for four one month cycles

Job suite	Description	Times for first 4 months	Average time for 1 month	Speed (model yrs/day)
u-ai403	Older setup	18:35 (1,115s), 21:24 (1,284s), 19:02 (1,142s) & 19:11 (1,151s)	1,173s (19:33)	6.14
u-ai403⁺	Older setup	18:11 (1,091s), 18:56 (1,136s), 19:22 (1,162s) & 18:09 (1,089s)	1,119.5s (18:40)	6.43
u-ak734^*	newer setup	22:03 (1,323s), 22:20 (1,340s), 22:19 (1,339s) & 22:32 (1,352s)	1,338.5s (22:19)	5.38
u-ak770	Older setup but newer revision of GO6 package branch^**	20:57 (1,257s), 19:58 (1,198s), 21:17 (1,277s) & 20:33 (1,233s)	1,241.25s (20:41)	5.85
u-ak770⁺	Older setup but newer revision of GO6 package branch^**	20:30 (1,230s), 20:52 (1,252s), 21:12 (1,272s) & 20:59 (1,259s)	1,253.25s (20:53)	5.75
u-ak838⁺	Older setup but newer revision of GO6 package branch^** and detached XIOS	18:54 (1,134s), 19:07 (1,147s), 20:15 (1,215s) & 19:29 (1,169s)	1,166.25s (19:26)	6.17

^*These runs had ios_thread_0_calls_mpi, ios_use_async_dump and ios_use_async_stash set to .true., and they should be .false.
^**Moving from revision 7197 of GO6 package branch to 7651
⁺With NEMO timing on

Activating the timings routines is done by

->nemo_cice
- ->namelist
  - ->NEMO namelist
    - ->Miscellaneous Namelists
      - ->Control Prints & Benchmarks (namctl)
        
        ->nn_timing: Activate timing routines

Timing section	Job suite
Timing section	u-ai403 (older GO6 package branch)	u-ak770 (new GO6 package branch)	u-ak838 (new GO6 package branch & detached XIOS)
dia_wri	2	148	145
Average CPU	981	1,162	1,023

It looks like we've add more than 2 minutes to the ocean run, with the extra diagnostics, and a significant amount of this is probably added to end of the run. I've tried detaching XIOS again, but it isn't showing much improvement, if any.

Can we speed-up GC3.1 + MEDUSA?

I can understand why the extra diagnostics have slower down our GC3.1 configuration, but our GC3.1 + MEDUSA is still looking very slow. Do we need to add further nodes to the ocean?

All the configurations below are

u-ak109
(ATM_PROCX,ATM_PROCY)=(48,24)
OMPTHR_ATM=2
IOS_NPROC=6
Atmosphere nodes=65
Run for four one month cycles

NEMO_IPROC * NEMO_JPROC (CICE_BLKX * CICE_BLKY)	Total nodes	Times for first 4 months	Average time for 1 month	Speed (model yrs/day)
129 (3037)	68	35:10 (2,110s), 33:01 (1,987s), 34:42 (2,082s) & 34:20 (2,060s)	2,059.75s (34:20)	3.50
1212 (3028)	69	30:18 (1,818s), 30:46 (1,846s), 28:59 (1,739s) & 30:01 (1,801s)	1,801s (30:01)	4.00
1812 (2028)⁺	71	27:07 (1,627s), 27:12 (1,632s), 28:05 (1,685s) & 26:58 (1,618s)	1640.5s (27:21)	4.39
1816 (2021)⁺	73	33:52 (2,032s), 33:28 (2,008s), 33:50 (2,057s) & 32:42 (1,962s)	2,014.75s (33:35)	3.57

⁺With NEMO timing on

What speed is UKESM-CN?

Before Christmas, I got 5.12 model years/day out of UKESM0.4-CN for a 3 month cycle, and you wouldn't expect GC3.1 + MEDUSA to be slower than UKESM0.5-CN.

All the configurations below are

(ATM_PROCX,ATM_PROCY)=(48,24)
OMPTHR_ATM=2
IOS_NPROC=6
Run for four one month cycles

Description	NEMO_IPROC * NEMO_JPROC (CICE_BLKX * CICE_BLKY)	Total nodes	Times for first 4 months	Average time for 1 month	Speed (model yrs/day)
UKESM0.4-CN, IOS_NPROC=0 (u-ai432)	129 (3037)	67	25:06 (1,506s), 24:54 (1,494s), 25:34 (1,534s) & 25:24 (1,524s)	1,514.5 (25:15)	4.75
UKESM0.4-CN, IOS_NPROC=6 (u-ai432)	129 (3037)	68	24:51 (1,491s), 24:32 (1,472s), 25:15 (1,515s) & 24:59 (1,462s)	1,485s (24:45)	4.85
UKESM0.5-CN (u-aj599)	129 (3037)	68	33:59 (2,039s), 34:56 (2,096s), 34:59 (2,099s) & 34:37 (2,077s)	2,077.75 (34:38)	3.47

I've done some profiling of UKESM0.4-CN versus UKESM0.5-CN and this suggest the extra time in UKESM0.5-CN comes from OASIS3_GETO2A, which points at the ocean.

Looking at the NEMO timers for MEDUSA + GC3.1

Elapsed time in PE 0 in seconds.

Section	NEMO_IPROC * NEMO_JPROC		(Time for 129)/(Time for 1816)
Section	12*9 (3 nodes)	18*16 (8 nodes)	(Time for 129)/(Time for 1816)
sbc_cpl_rcv	42	326	0.129
tra_adv_muscl	281	105	2.68
tra_ldf_iso	256	86	2.98
trc_sbc	158	84	1.83
dia_wri	123	65	1.89
sbc_ice_cice	78	49	1.59
trc_sms	82	49	1.67

The ratio of 8 nodes/3 nodes =2.67, so timing regions which scale with cores should have this number in the last column. Both TRA_ADV_MUSCL and TRA_LDF_ISO have ratio numbers similar to this, while the other routines are not as good. The time in the `sbc_cpl_rcv' region is much greater for the 8 node job, which suggests it's waiting around a lot.

OASIS timers

I've also looked at the OASIS timers which show the total wall time to be

	3 nodes	8 nodes
toyatm	1,735s	1,102s
toyocn	2,005s	1,854s

which shows that the wall time in the ocean is much greater than the atmosphere for both jobs. I think this extra time waiting for the ocean is for the ocean to write its restart dumps and diagnostics. And throwing extra cores at this will probably slow this process down (because potentially there is more gathering).

Updating to UM10.7

I've copied u-ak109 to u-al312 and updated this new job to UM10.7. I've taken the best configuration for u-ak109 from above

(ATM_PROCX,ATM_PROCY)=(48,24)
OMPTHR_ATM=2
IOS_NPROC=6
(NEMO_IPROC,NEMO_JPROC)=(12,9)
(CICE_BLKX,CICE_BLKY)=(30,37)
Total nodes=68

And the first four months have taken 36:12 (2,172s), 36:56 (2,216s), 34:46 (2,086s) and 36:38 (2,198s), which is an average of 2,168s (36:08). This is a speed of 3.32 model years/day.

Moving to 3 month cycle

I'm copying u-al312 to u-al873. I've only be able to get this the `split_freq' addition to iodef.xml to work with detached XIOS.

All the configurations below are

(ATM_PROCX,ATM_PROCY)=(48,24)
OMPTHR_ATM=2
IOS_NPROC=6
(NEMO_IPROC,NEMO_JPROC)=(12,9)
(CICE_BLKX,CICE_BLKY)=(30,37)
XIOS_NPROC=6
Total nodes=69
Run for three three month cycles

Description	Times for 3 month cycles	Average time for one three month cycle	Speed (model yrs/day)
u-al873, "one_file", full iodef.xml, XIOS on one month cycle	1:20:19 (4,819s), 1:22:10 (4,930s) & 1:20:30 (4,830s)	4,860s (1:21:00)	4.44
u-al873, "multiple_file", full iodef.xml, XIOS on one month cycle	1:26:50 (5,210s), 1:25:52 (5,152s) & 1:23:34 (5014s)	5,125s (1:25:25)	4.21
u-am012, "one_file", full iodef.xml, XIOS on ten day cycle	1:24:49 (5,089s), 1:25:53 (5,153s) & 1:25:42 (5,142s)	5,128 (1:25:28)	4.21
u-am012, "multiple_file", full iodef.xml, XIOS on ten day cycle	1:24:21 (5,061s), 1:24:20 (5,060s) & 1:24:06 (5,046s)	5,056s (1:24:16)	4.27
u-am015, "multiple_file", removed groupMEDUSA_cmip6 from iodef.xml, XIOS on one month cycle	1:21:37 (4,897s), 1:21:58 (4,918s) & 1:21:08 (4,868s)	4,894s (1:21:34)	4.41

Conclusions

The comparison between one_file and multiple_file and between XIOS on 10 day cycle compared with a 1 month cycle are far from conclusive (I think the first run for u-al873 was done on a Thursday evening when the XCS was probably quiet). Given this:
- Any previous testing has suggested that multiple_file is quicker and this is what we expect, so I'll stick with this.
- If there's little difference between the 10 day and one month XIOS cycle, we might as well stick with what we're given - currently this is the one month cycle.
Very roughly groupMEDUSA_cmip6 seems to add about 2.5 mins to run

Benchmark against GC3.1 N96/ORCA1

I've created my latest GC3.1 N96/ORCA1, u-am151, and I'm running with

(ATM_PROCX,ATM_PROCY)=(48,24)
OMPTHR_ATM=2
IOS_NPROC=6
XIOS_NPROC=6
Run for three three month cycles
"multiple_file"
full iodef.xml
XIOS on one month cycle

to get an idea of what MEDUSA is adding to the 3 month cycle

NEMO_IPROCNEMO_JPROC (CICE_BLKXCICE_BLKY)	Total nodes	Times for 3 month cycles	Average time for one three month cycle	Speed (model yrs/day)
98 (4042)	68	1:04:32 (3,872s), 59:43 (3,583s), 58:08 (3,488s)	3,648 (1:00:48)	5.92
129 (3037)	69	55:20 (3,320s), 56:48 (3,408s) & 56:31 (3,391s)	3,373s (56:13)	6.40
1212 (3028)	70	57:26 (3,446s), 58:40 (3,520s) & 1:00:14 (3,614s)	3,527 (58:47)	6.12

I think that the 12*12 run being significantly slower than the 12*9 run was probably down to heavy load on XCS at the time (some of my urgent jobs were queuing), but it probably does suggest that 12*9 is enough ocean cores to achieve optimum speed.

Adding more PEs to ocean for MEDUSA + GC3.1

These runs are all done with u-am015

(ATM_PROCX,ATM_PROCY)=(48,24)
OMPTHR_ATM=2
IOS_NPROC=6
XIOS_NPROC=6
One month cycle for XIOS
Using "multiple_file"
Removed groupMEDUSA_cmip6
Run for three three month cycles

NEMO_IPROCNEMO_JPROC (CICE_BLKXCICE_BLKY)	Total nodes	Times for 3 month cycles	Average time for one three month cycle	Speed (model yrs/day)
129 (3037)	69	1:21:37 (4,897s), 1:21:58 (4,918s) & 1:21:08 (4,868s)	4,894s (1:21:34)	4.41
1212 (3028)	70	1:03:33 (3,813s), 1:04:45 (3,885s) & 1:03:21 (3,801s)	3,833s (1:03:53)	5.64
1812 (2028)	72	53:56 (3,236s), 56:17 (3,377s) & 55:20 (3,320s)	3,311s (55:11)	6.52
1816 (2021)	74	56:27 (3,387s), 56:28 (3,388s) & 56:33 (3,393s)	3,389s (56:29)	6.37

Like the 12*12 run above, I think the 18*16 run was made when the load on XCS was heavy.

6 month cycle

I'm copying u-am015 to u-am198 and I'll try running a 6 month cycle.

(ATM_PROCX,ATM_PROCY)=(48,24)
OMPTHR_ATM=2
IOS_NPROC=6
XIOS_NPROC=6
One month cycle for XIOS
Using "multiple_file"
Removed groupMEDUSA_cmip6
Run for three six month cycles

NEMO_IPROCNEMO_JPROC (CICE_BLKXCICE_BLKY)	Total nodes	Times for 6 month cycles	Average time for one six month cycle	Speed (model yrs/day)
1212 (3028)	70	2:05:58 (7,558s), 2:08:15 (7,695s) & 2:06:39 (7,599s)	7,617s (2:06:57)	5.67
1812 (2028)	72	1:48:22 (6,502s), 1:52:01 (6,721s) & 1:51:54 (6,714s)	6,646 (1:50:46)	6.50

The speed of this configuration on a 3 month cycle was 5.64 model years/day, so this doesn't look a lot faster.

3 OpenMP threads

I'm copying u-am015 to u-am197 and I'll try running with 3 OpenMP threads

(ATM_PROCX,ATM_PROCY)=(48,24)
OMPTHR_ATM=3
IOS_NPROC=6
XIOS_NPROC=6
One month cycle for XIOS
Using "multiple_file"
Removed groupMEDUSA_cmip6
Run for three three month cycles

While doing this, Maff thinks his OpenMP in aerosol chemistry branch is almost finished, and good enough for speed tests. I'm copying u-am197 to u-am274 and adding in Maff's branch.

Aerosol OpenMP	NEMO_IPROCNEMO_JPROC (CICE_BLKXCICE_BLKY)	Total nodes	Times for 3 month cycles	Average time for one three month cycle	Speed (model yrs/day)
No	1812 (2028)	103	52:02 (3,122s), 53:36 (3,216s) & 53:19 (3,199s)	3,179s (52:59)	6.79
No	1816 (2021)	105	53:28 (3,208s), 54:01 (3,241s) & 52:29 (3,149s)	3,199s (53:19)	6.75
Yes	1818 (2019)	107	52:27 (3,147s), 50:34 (3,034s) & 51:13 (3,073s)	3,085s (51:25)	7.00

Summary

The main points

With the extra diagnostics, we should definitely use detached XIOS
Using a 3 month cycle, means that most of the I/O can be done concurrently
The 6 month cycle gives some speed-up ... probably not enough to justify the risk of losing whole 6 months in the event of a crash.
Once XIOS is detached and we run for 3 months, most of the I/O is done concurrently, and so it doesn't make much difference whether we use one_file or multiple_file, use 10 day cycle or 1 month cycle for XIOS or use strip out diagnostics from XIOS (it will save a few minutes at the end).
Cutting groupMEDUSA_cmip6 seemed to save a few minutes.

Cleaning-up creation of job

I've taken a number of steps and they're not all in a straight line, so I'm starting again at the MEDUSA + GC3.1 job I had a while ago, u-al312, and copying this to u-am254. The configuration is

(ATM_PROCX,ATM_PROCY)=(48,24)
OMPTHR_ATM=2
IOS_NPROC=6
Total ATMOS nodes=65
(NEMO_IPROC,NEMO_JPROC)=(18,12)
(CICE_BLKX,CICE_BLKY)=(20,28)
XIOS_NPROC=6
Total OCEAN nodes=7
Total nodes=72
iodef.xml with "one_file", one month cycle and complete (I've not cut anything)

Length of cycle	Times for three cycles	Average time for one cycle	Speed (model yrs/day)
One month	20:03 (1,203s), 22:00 (1,320s) & 20:14 (1,214s)	1,246s (20:46)	5.78
Three months	56:52 (3,412s), 56:50 (3,410s) & 56:39 (3,399s)	3,407 (56:47)	6.34
Six months	Crashed with problem writing qtrCFC11

Adding in Maff's OpenMP branch

In u-am276, I added the new land-sea mask to u-am254. To this I've added in Maff's OpenMP branch for u-am300 and then removed groupMEDUSA_cmip6 for u-am354.

(ATM_PROCX,ATM_PROCY)=(48,24)
OMPTHR_ATM=2
IOS_NPROC=6
XIOS_NPROC=6
One month cycle for XIOS
Using "one_file"
Full iodef.xml or removed groupMEDUSA_cmip6.
Run for three three month cycles

iodef.xml	NEMO_IPROCNEMO_JPROC (CICE_BLKXCICE_BLKY)	Total nodes	Times for 3 month cycles	Average time for one three month cycle	Speed (model yrs/day)
Full	1812 (2028)	72	52:43 (3,163s), 56:37 (3,397s) & 56:25 (3,385s)	3,315s (55:15)	6.52
Full	1814 (2024)	73	52:05 (3,125s), 52:39 (3,159s) & 52:52 (3,172s)	3,152s (52:32)	6.85
No MEDUSA CMIP6	1814 (2024)	73	53:01 (3,181s), 53:46 (3,226s) & 56:42 (3,402s)	3,270 (54:30)	6.61
Full	1816 (2021)	74	52:38 (3,158s), 52:35 (3,155s) & 55:34 (3,334s)	3,216s (53:36)	6.72
No MEDUSA CMIP6	1816 (2021)	74	54:48 (3,288s), 54:04 (3,244s) & 54:06 (3,246s)	3,259 (54:19)	6.63

Chosen configurations and options for speeding this up

I've chosen the last configuration above on the basis that it looks like we can't get any extra speed by adding more ocean cores. And given the imbalance between resources for atmosphere and ocean, it's probably best to have one more ocean node than I think we need.

(ATM_PROCX,ATM_PROCY)=(48,24)
OMPTHR_ATM=2
IOS_NPROC=6
Total ATMOS nodes=65
(NEMO_IPROC,NEMO_JPROC)=(18,16)
(CICE_BLKX,CICE_BLKY)=(20,21)
XIOS_NPROC=6
Total OCEAN nodes=9
Total nodes=74
One month cycle for XIOS
Using "one_file"
Removed groupMEDUSA_cmip6 from iodef.xml
Run for three three month cycles

Other things we could do to speed this up are.

Use 3 OpenMP threads in atmosphere, gain ~0.5 model years/day
Using "multiple_file" should save a bit, but I've not be able to notice this.
Cut groupMEDUSA_cmip6, gained about ~2 minutes or ~0.25 model years/day, which is what I've done. Cull others might gain a similar amount.
Maff's extra compiler options should gain about 10-15% so ~0.7 model years/day - but we know there's a risk attached this as they change results.

Coming back to 3 OpenMP threads

I think I ought to be able to get more than the extra 0.5 model years/day that I got when I tried this above (did I really have enough ocean nodes when I did this test). The job u-am375 is a copy of u-am354 and I've added in the 3rd OpenMP thread.

(ATM_PROCX,ATM_PROCY)=(48,24)
OMPTHR_ATM=3
IOS_NPROC=6
XIOS_NPROC=6
One month cycle for XIOS
Using "one_file"
Removed groupMEDUSA_cmip6 from iodef.xml
Run for three three month cycles

NEMO_IPROCNEMO_JPROC (CICE_BLKXCICE_BLKY)	Total nodes	Times for 3 month cycles	Average time for one three month cycle	Speed (model yrs/day)
1816 (2021)⁺⁺	106	47:39 (2,859s), 52:02 (3,122s) & 47:25 (2,845s)	2,942 (49:02)	7.34
1818 (2019)⁺	107	48:53 (2,933s), 47:11 (2,831s) & 47:52 (2,872s)	2,879s (47:59)	7.5
2018 (1819)	108	48:46 (2,926s), 49:02 (2,942s) & 48:48 (2,928s)	2,932s (48:52)	7.37
2418 (1519)	110	47:38 (2,858s), 48:53 (2,933s) & 52:34 (3,154s)	2,982s (49:42)	7.24

⁺The queued times were 56:15 (Friday evening), 1:06 & 0:34.
⁺⁺The queued times were 44:26, 1:11 & 0:39 (all Monday morning)

I can probably get about 0.65 model years, by using ~33 extra nodes, but I think it does make more likely to be stuck in long queue times.

Archiving two dumps in one 3 month cycle

We need to archive both the start of December and the start of January each year. This means that we'll need to create an ocean dump every month instead of every 3 months, and we need to be able to archive two dumps - one of which is not at the end of the run. I also need to know how much this is slowing down the run.

Marc's pages

Speed tests for MEDUSA + GC3.1

Why has our GC3.1 configuration lost so much speed?

Can we speed-up GC3.1 + MEDUSA?

What speed is UKESM-CN?

Looking at the NEMO timers for MEDUSA + GC3.1

OASIS timers

Updating to UM10.7

Moving to 3 month cycle

Benchmark against GC3.1 N96/ORCA1

Adding more PEs to ocean for MEDUSA + GC3.1

6 month cycle

3 OpenMP threads

Summary

Cleaning-up creation of job

Adding in Maff's OpenMP branch

Chosen configurations and options for speeding this up

Coming back to 3 OpenMP threads

Archiving two dumps in one 3 month cycle