Speed tests for MEDUSA + GC3.1

We may need to spin-up with a coupled model. The speed tests below are with u-ak109, with settings

  • UM10.6
  • OMPTHR_ATM=2
  • IOS_NPROC=6
  • u-ai403, u-ak770 & u-ak838 use a 10 day dump for NEMO, but u-ak734 uses a one month dump. Currently only the one month dump produces all the data we need.

Before Christmas, with a 3 month cycle, I got 6.63 model years/day (see table from UKESM core presentation). I was hoping to get something similar for this run. My notes on this are on p130 of UKESM General III, and the job suite was u-ai403.

u-ak734 is a copy of u-aj659.

Description (job suite) ATM_PROCX * ATM_PROCY NEMO_IPROC * NEMO_JPROC (CICE_BLKX * CICE_BLKY) Total nodes Cylce length (months) Times for first 4 cycles Average time for 1 cycle Speed (model yrs/day)
GC3.1 + MEDUSA* (u-ak109) 32*2212*12 (30*28) 441 27:35 (1,655s), 30:34 (1,834s), 28:45 (1,725s) & 29:36 (1,733s) 1,736.75s (28:57) 4.15
GC3.1 + MEDUSA* (u-ak109) 48*2412*12 (30*28) 691 30:18 (1,818s), 30:46 (1,846s), 28:59 (1,739s) & 30:01 (1,801s) 1,801s (30:01) 4.00
GC3.1 + MEDUSA* (u-ak109) 48*2412*12 (30*28) 691 30:01 (1,801s), 29:27 (1,767s), 29:51 (1,791s) & 29:40 (1,780s) 1,784.75s (29:45) 4.03
GC3.1, older setup (u-ai403) 48*249*8 (40*42) 671 18:35 (1,115s), 21:24 (1,284s), 19:02 (1,142s) & 19:11 (1,151s) 1,173s (19:33) 6.14
GC3.1, newer setup* (u-ak734) 48*249*8 (40*42) 671 22:03 (1,323s), 22:20 (1,340s), 22:19 (1,339s) & 22:32 (1,352s) 1,338.5s (22:19) 5.38
GC3.1, newer setup (u-ak734) 48*249*8 (40*42) 671 22:31 (1,351s), 23:39 (1,419s), 23:11 (1,391s) & 22:47 (1,367s) 1,382s (23:02) 5.21
GC3.1, newer setup (u-ak734) 48*2412*12 (30*28) 691 21:31 (1,291s), 22:05 (1,325s), 22:13 (1,333s) & 22:13 (1,333s) 1,320.5s (22:01) 5.45
GC3.1, newer setup* (u-ak734) 48*249*8 (40*42) 673 59:50 (3,590s), 1:00:53 (3,653s), 59:50 (3,590s) & 59:44 (3,584s) 3,604.25s (1:00:04) 6.00
GC3.1, older setup (u-ai403) 48*249*8 (40*42) 673 51:56 (3,116s), 53:51 (3,211s), 52:08 (3,128s) & 52:51 (3,171s) 3,156.5s (52:37) 6.84
*These runs had ios_thread_0_calls_mpi, ios_use_async_dump and ios_use_async_stash set to .true., and they should be .false.

Why has our GC3.1 configuration lost so much speed?

Comparing our older and newer setups for GC3.1, we appear to have lost about 0.75 model years/day, why?

All the configurations below are

  • GC3.1
  • (ATM_PROCX,ATM_PROCY)=(48,24)
  • (NEMO_IPROC,NEMO_JPROC)=(9,8)
  • (CICE_BLKX,CICE_BLKY)=(40,42)
  • Total nodes = 67 (except u-ak838, which has one extra node for XIOS)
  • Run for four one month cycles
Job suite Description Times for first 4 months Average time for 1 month Speed (model yrs/day)
u-ai403 Older setup 18:35 (1,115s), 21:24 (1,284s), 19:02 (1,142s) & 19:11 (1,151s) 1,173s (19:33) 6.14
u-ai403+ Older setup 18:11 (1,091s), 18:56 (1,136s), 19:22 (1,162s) & 18:09 (1,089s) 1,119.5s (18:40) 6.43
u-ak734* newer setup 22:03 (1,323s), 22:20 (1,340s), 22:19 (1,339s) & 22:32 (1,352s) 1,338.5s (22:19) 5.38
u-ak770 Older setup but newer revision of GO6 package branch** 20:57 (1,257s), 19:58 (1,198s), 21:17 (1,277s) & 20:33 (1,233s) 1,241.25s (20:41) 5.85
u-ak770+ Older setup but newer revision of GO6 package branch** 20:30 (1,230s), 20:52 (1,252s), 21:12 (1,272s) & 20:59 (1,259s) 1,253.25s (20:53) 5.75
u-ak838+ Older setup but newer revision of GO6 package branch** and detached XIOS 18:54 (1,134s), 19:07 (1,147s), 20:15 (1,215s) & 19:29 (1,169s) 1,166.25s (19:26) 6.17
*These runs had ios_thread_0_calls_mpi, ios_use_async_dump and ios_use_async_stash set to .true., and they should be .false.
**Moving from revision 7197 of GO6 package branch to 7651
+With NEMO timing on

Activating the timings routines is done by

  • ->nemo_cice
    • ->namelist
      • ->NEMO namelist
        • ->Miscellaneous Namelists
          • ->Control Prints & Benchmarks (namctl)
            • ->nn_timing: Activate timing routines
Timing section Job suite
u-ai403 (older GO6 package branch) u-ak770 (new GO6 package branch) u-ak838 (new GO6 package branch & detached XIOS)
dia_wri2148145
Average CPU9811,1621,023

It looks like we've add more than 2 minutes to the ocean run, with the extra diagnostics, and a significant amount of this is probably added to end of the run. I've tried detaching XIOS again, but it isn't showing much improvement, if any.

Can we speed-up GC3.1 + MEDUSA?

I can understand why the extra diagnostics have slower down our GC3.1 configuration, but our GC3.1 + MEDUSA is still looking very slow. Do we need to add further nodes to the ocean?

All the configurations below are

  • u-ak109
  • (ATM_PROCX,ATM_PROCY)=(48,24)
  • OMPTHR_ATM=2
  • IOS_NPROC=6
  • Atmosphere nodes=65
  • Run for four one month cycles
NEMO_IPROC * NEMO_JPROC (CICE_BLKX * CICE_BLKY) Total nodes Times for first 4 months Average time for 1 month Speed (model yrs/day)
12*9 (30*37) 68 35:10 (2,110s), 33:01 (1,987s), 34:42 (2,082s) & 34:20 (2,060s) 2,059.75s (34:20) 3.50
12*12 (30*28) 69 30:18 (1,818s), 30:46 (1,846s), 28:59 (1,739s) & 30:01 (1,801s) 1,801s (30:01) 4.00
18*12 (20*28)+ 71 27:07 (1,627s), 27:12 (1,632s), 28:05 (1,685s) & 26:58 (1,618s) 1640.5s (27:21) 4.39
18*16 (20*21)+ 73 33:52 (2,032s), 33:28 (2,008s), 33:50 (2,057s) & 32:42 (1,962s) 2,014.75s (33:35) 3.57
+With NEMO timing on

What speed is UKESM-CN?

Before Christmas, I got 5.12 model years/day out of UKESM0.4-CN for a 3 month cycle, and you wouldn't expect GC3.1 + MEDUSA to be slower than UKESM0.5-CN.

All the configurations below are

  • (ATM_PROCX,ATM_PROCY)=(48,24)
  • OMPTHR_ATM=2
  • IOS_NPROC=6
  • Run for four one month cycles
Description NEMO_IPROC * NEMO_JPROC (CICE_BLKX * CICE_BLKY) Total nodes Times for first 4 months Average time for 1 month Speed (model yrs/day)
UKESM0.4-CN, IOS_NPROC=0 (u-ai432) 12*9 (30*37)67 25:06 (1,506s), 24:54 (1,494s), 25:34 (1,534s) & 25:24 (1,524s) 1,514.5 (25:15) 4.75
UKESM0.4-CN, IOS_NPROC=6 (u-ai432) 12*9 (30*37)68 24:51 (1,491s), 24:32 (1,472s), 25:15 (1,515s) & 24:59 (1,462s) 1,485s (24:45) 4.85
UKESM0.5-CN (u-aj599) 12*9 (30*37)68 33:59 (2,039s), 34:56 (2,096s), 34:59 (2,099s) & 34:37 (2,077s) 2,077.75 (34:38) 3.47

I've done some profiling of UKESM0.4-CN versus UKESM0.5-CN and this suggest the extra time in UKESM0.5-CN comes from OASIS3_GETO2A, which points at the ocean.

Looking at the NEMO timers for MEDUSA + GC3.1

Elapsed time in PE 0 in seconds.

Section NEMO_IPROC * NEMO_JPROC (Time for 12*9)/(Time for 18*16)
12*9 (3 nodes) 18*16 (8 nodes)
sbc_cpl_rcv423260.129
tra_adv_muscl2811052.68
tra_ldf_iso256862.98
trc_sbc158841.83
dia_wri123651.89
sbc_ice_cice78491.59
trc_sms82491.67

The ratio of 8 nodes/3 nodes =2.67, so timing regions which scale with cores should have this number in the last column. Both TRA_ADV_MUSCL and TRA_LDF_ISO have ratio numbers similar to this, while the other routines are not as good. The time in the `sbc_cpl_rcv' region is much greater for the 8 node job, which suggests it's waiting around a lot.

OASIS timers

I've also looked at the OASIS timers which show the total wall time to be

3 nodes 8 nodes
toyatm1,735s1,102s
toyocn2,005s1,854s

which shows that the wall time in the ocean is much greater than the atmosphere for both jobs. I think this extra time waiting for the ocean is for the ocean to write its restart dumps and diagnostics. And throwing extra cores at this will probably slow this process down (because potentially there is more gathering).

Updating to UM10.7

I've copied u-ak109 to u-al312 and updated this new job to UM10.7. I've taken the best configuration for u-ak109 from above

  • (ATM_PROCX,ATM_PROCY)=(48,24)
  • OMPTHR_ATM=2
  • IOS_NPROC=6
  • (NEMO_IPROC,NEMO_JPROC)=(12,9)
  • (CICE_BLKX,CICE_BLKY)=(30,37)
  • Total nodes=68

And the first four months have taken 36:12 (2,172s), 36:56 (2,216s), 34:46 (2,086s) and 36:38 (2,198s), which is an average of 2,168s (36:08). This is a speed of 3.32 model years/day.

Moving to 3 month cycle

I'm copying u-al312 to u-al873. I've only be able to get this the `split_freq' addition to iodef.xml to work with detached XIOS.

All the configurations below are

  • (ATM_PROCX,ATM_PROCY)=(48,24)
  • OMPTHR_ATM=2
  • IOS_NPROC=6
  • (NEMO_IPROC,NEMO_JPROC)=(12,9)
  • (CICE_BLKX,CICE_BLKY)=(30,37)
  • XIOS_NPROC=6
  • Total nodes=69
  • Run for three three month cycles
Description Times for 3 month cycles Average time for one three month cycle Speed (model yrs/day)
u-al873, "one_file", full iodef.xml, XIOS on one month cycle 1:20:19 (4,819s), 1:22:10 (4,930s) & 1:20:30 (4,830s) 4,860s (1:21:00) 4.44
u-al873, "multiple_file", full iodef.xml, XIOS on one month cycle 1:26:50 (5,210s), 1:25:52 (5,152s) & 1:23:34 (5014s) 5,125s (1:25:25) 4.21
u-am012, "one_file", full iodef.xml, XIOS on ten day cycle 1:24:49 (5,089s), 1:25:53 (5,153s) & 1:25:42 (5,142s) 5,128 (1:25:28) 4.21
u-am012, "multiple_file", full iodef.xml, XIOS on ten day cycle 1:24:21 (5,061s), 1:24:20 (5,060s) & 1:24:06 (5,046s) 5,056s (1:24:16) 4.27
u-am015, "multiple_file", removed groupMEDUSA_cmip6 from iodef.xml, XIOS on one month cycle 1:21:37 (4,897s), 1:21:58 (4,918s) & 1:21:08 (4,868s) 4,894s (1:21:34) 4.41

Conclusions

  • The comparison between one_file and multiple_file and between XIOS on 10 day cycle compared with a 1 month cycle are far from conclusive (I think the first run for u-al873 was done on a Thursday evening when the XCS was probably quiet). Given this:
    • Any previous testing has suggested that multiple_file is quicker and this is what we expect, so I'll stick with this.
    • If there's little difference between the 10 day and one month XIOS cycle, we might as well stick with what we're given - currently this is the one month cycle.
  • Very roughly groupMEDUSA_cmip6 seems to add about 2.5 mins to run

Benchmark against GC3.1 N96/ORCA1

I've created my latest GC3.1 N96/ORCA1, u-am151, and I'm running with

  • (ATM_PROCX,ATM_PROCY)=(48,24)
  • OMPTHR_ATM=2
  • IOS_NPROC=6
  • XIOS_NPROC=6
  • Run for three three month cycles
  • "multiple_file"
  • full iodef.xml
  • XIOS on one month cycle

to get an idea of what MEDUSA is adding to the 3 month cycle

NEMO_IPROC*NEMO_JPROC (CICE_BLKX*CICE_BLKY) Total nodes Times for 3 month cycles Average time for one three month cycle Speed (model yrs/day)
9*8 (40*42) 68 1:04:32 (3,872s), 59:43 (3,583s), 58:08 (3,488s) 3,648 (1:00:48) 5.92
12*9 (30*37) 69 55:20 (3,320s), 56:48 (3,408s) & 56:31 (3,391s) 3,373s (56:13) 6.40
12*12 (30*28) 70 57:26 (3,446s), 58:40 (3,520s) & 1:00:14 (3,614s) 3,527 (58:47) 6.12

I think that the 12*12 run being significantly slower than the 12*9 run was probably down to heavy load on XCS at the time (some of my urgent jobs were queuing), but it probably does suggest that 12*9 is enough ocean cores to achieve optimum speed.

Adding more PEs to ocean for MEDUSA + GC3.1

These runs are all done with u-am015

  • (ATM_PROCX,ATM_PROCY)=(48,24)
  • OMPTHR_ATM=2
  • IOS_NPROC=6
  • XIOS_NPROC=6
  • One month cycle for XIOS
  • Using "multiple_file"
  • Removed groupMEDUSA_cmip6
  • Run for three three month cycles
NEMO_IPROC*NEMO_JPROC (CICE_BLKX*CICE_BLKY) Total nodes Times for 3 month cycles Average time for one three month cycle Speed (model yrs/day)
12*9 (30*37) 69 1:21:37 (4,897s), 1:21:58 (4,918s) & 1:21:08 (4,868s) 4,894s (1:21:34) 4.41
12*12 (30*28) 70 1:03:33 (3,813s), 1:04:45 (3,885s) & 1:03:21 (3,801s) 3,833s (1:03:53) 5.64
18*12 (20*28) 72 53:56 (3,236s), 56:17 (3,377s) & 55:20 (3,320s) 3,311s (55:11) 6.52
18*16 (20*21) 74 56:27 (3,387s), 56:28 (3,388s) & 56:33 (3,393s) 3,389s (56:29) 6.37

Like the 12*12 run above, I think the 18*16 run was made when the load on XCS was heavy.

6 month cycle

I'm copying u-am015 to u-am198 and I'll try running a 6 month cycle.

  • (ATM_PROCX,ATM_PROCY)=(48,24)
  • OMPTHR_ATM=2
  • IOS_NPROC=6
  • XIOS_NPROC=6
  • One month cycle for XIOS
  • Using "multiple_file"
  • Removed groupMEDUSA_cmip6
  • Run for three six month cycles
NEMO_IPROC*NEMO_JPROC (CICE_BLKX*CICE_BLKY) Total nodes Times for 6 month cycles Average time for one six month cycle Speed (model yrs/day)
12*12 (30*28) 70 2:05:58 (7,558s), 2:08:15 (7,695s) & 2:06:39 (7,599s) 7,617s (2:06:57) 5.67
18*12 (20*28) 72 1:48:22 (6,502s), 1:52:01 (6,721s) & 1:51:54 (6,714s) 6,646 (1:50:46) 6.50

The speed of this configuration on a 3 month cycle was 5.64 model years/day, so this doesn't look a lot faster.

3 OpenMP threads

I'm copying u-am015 to u-am197 and I'll try running with 3 OpenMP threads

  • (ATM_PROCX,ATM_PROCY)=(48,24)
  • OMPTHR_ATM=3
  • IOS_NPROC=6
  • XIOS_NPROC=6
  • One month cycle for XIOS
  • Using "multiple_file"
  • Removed groupMEDUSA_cmip6
  • Run for three three month cycles

While doing this, Maff thinks his OpenMP in aerosol chemistry branch is almost finished, and good enough for speed tests. I'm copying u-am197 to u-am274 and adding in Maff's branch.

Aerosol OpenMP NEMO_IPROC*NEMO_JPROC (CICE_BLKX*CICE_BLKY) Total nodes Times for 3 month cycles Average time for one three month cycle Speed (model yrs/day)
No 18*12 (20*28) 103 52:02 (3,122s), 53:36 (3,216s) & 53:19 (3,199s) 3,179s (52:59) 6.79
No 18*16 (20*21) 105 53:28 (3,208s), 54:01 (3,241s) & 52:29 (3,149s) 3,199s (53:19) 6.75
Yes 18*18 (20*19) 107 52:27 (3,147s), 50:34 (3,034s) & 51:13 (3,073s) 3,085s (51:25) 7.00

Summary

The main points

  • With the extra diagnostics, we should definitely use detached XIOS
  • Using a 3 month cycle, means that most of the I/O can be done concurrently
  • The 6 month cycle gives some speed-up ... probably not enough to justify the risk of losing whole 6 months in the event of a crash.
  • Once XIOS is detached and we run for 3 months, most of the I/O is done concurrently, and so it doesn't make much difference whether we use one_file or multiple_file, use 10 day cycle or 1 month cycle for XIOS or use strip out diagnostics from XIOS (it will save a few minutes at the end).
  • Cutting groupMEDUSA_cmip6 seemed to save a few minutes.

Cleaning-up creation of job

I've taken a number of steps and they're not all in a straight line, so I'm starting again at the MEDUSA + GC3.1 job I had a while ago, u-al312, and copying this to u-am254. The configuration is

  • (ATM_PROCX,ATM_PROCY)=(48,24)
  • OMPTHR_ATM=2
  • IOS_NPROC=6
  • Total ATMOS nodes=65
  • (NEMO_IPROC,NEMO_JPROC)=(18,12)
  • (CICE_BLKX,CICE_BLKY)=(20,28)
  • XIOS_NPROC=6
  • Total OCEAN nodes=7
  • Total nodes=72
  • iodef.xml with "one_file", one month cycle and complete (I've not cut anything)
Length of cycle Times for three cycles Average time for one cycle Speed (model yrs/day)
One month 20:03 (1,203s), 22:00 (1,320s) & 20:14 (1,214s) 1,246s (20:46) 5.78
Three months 56:52 (3,412s), 56:50 (3,410s) & 56:39 (3,399s) 3,407 (56:47) 6.34
Six months Crashed with problem writing qtrCFC11

Adding in Maff's OpenMP branch

In u-am276, I added the new land-sea mask to u-am254. To this I've added in Maff's OpenMP branch for u-am300 and then removed groupMEDUSA_cmip6 for u-am354.

  • (ATM_PROCX,ATM_PROCY)=(48,24)
  • OMPTHR_ATM=2
  • IOS_NPROC=6
  • XIOS_NPROC=6
  • One month cycle for XIOS
  • Using "one_file"
  • Full iodef.xml or removed groupMEDUSA_cmip6.
  • Run for three three month cycles
iodef.xml NEMO_IPROC*NEMO_JPROC (CICE_BLKX*CICE_BLKY) Total nodes Times for 3 month cycles Average time for one three month cycle Speed (model yrs/day)
Full 18*12 (20*28) 72 52:43 (3,163s), 56:37 (3,397s) & 56:25 (3,385s) 3,315s (55:15) 6.52
Full 18*14 (20*24) 73 52:05 (3,125s), 52:39 (3,159s) & 52:52 (3,172s) 3,152s (52:32) 6.85
No MEDUSA CMIP6 18*14 (20*24) 73 53:01 (3,181s), 53:46 (3,226s) & 56:42 (3,402s) 3,270 (54:30) 6.61
Full 18*16 (20*21) 74 52:38 (3,158s), 52:35 (3,155s) & 55:34 (3,334s) 3,216s (53:36) 6.72
No MEDUSA CMIP6 18*16 (20*21) 74 54:48 (3,288s), 54:04 (3,244s) & 54:06 (3,246s) 3,259 (54:19) 6.63

Chosen configurations and options for speeding this up

I've chosen the last configuration above on the basis that it looks like we can't get any extra speed by adding more ocean cores. And given the imbalance between resources for atmosphere and ocean, it's probably best to have one more ocean node than I think we need.

  • (ATM_PROCX,ATM_PROCY)=(48,24)
  • OMPTHR_ATM=2
  • IOS_NPROC=6
  • Total ATMOS nodes=65
  • (NEMO_IPROC,NEMO_JPROC)=(18,16)
  • (CICE_BLKX,CICE_BLKY)=(20,21)
  • XIOS_NPROC=6
  • Total OCEAN nodes=9
  • Total nodes=74
  • One month cycle for XIOS
  • Using "one_file"
  • Removed groupMEDUSA_cmip6 from iodef.xml
  • Run for three three month cycles

Other things we could do to speed this up are.

  • Use 3 OpenMP threads in atmosphere, gain ~0.5 model years/day
  • Using "multiple_file" should save a bit, but I've not be able to notice this.
  • Cut groupMEDUSA_cmip6, gained about ~2 minutes or ~0.25 model years/day, which is what I've done. Cull others might gain a similar amount.
  • Maff's extra compiler options should gain about 10-15% so ~0.7 model years/day - but we know there's a risk attached this as they change results.

Coming back to 3 OpenMP threads

I think I ought to be able to get more than the extra 0.5 model years/day that I got when I tried this above (did I really have enough ocean nodes when I did this test). The job u-am375 is a copy of u-am354 and I've added in the 3rd OpenMP thread.

  • (ATM_PROCX,ATM_PROCY)=(48,24)
  • OMPTHR_ATM=3
  • IOS_NPROC=6
  • XIOS_NPROC=6
  • One month cycle for XIOS
  • Using "one_file"
  • Removed groupMEDUSA_cmip6 from iodef.xml
  • Run for three three month cycles
NEMO_IPROC*NEMO_JPROC (CICE_BLKX*CICE_BLKY) Total nodes Times for 3 month cycles Average time for one three month cycle Speed (model yrs/day)
18*16 (20*21)++ 106 47:39 (2,859s), 52:02 (3,122s) & 47:25 (2,845s) 2,942 (49:02) 7.34
18*18 (20*19)+ 107 48:53 (2,933s), 47:11 (2,831s) & 47:52 (2,872s) 2,879s (47:59) 7.5
20*18 (18*19) 108 48:46 (2,926s), 49:02 (2,942s) & 48:48 (2,928s) 2,932s (48:52) 7.37
24*18 (15*19) 110 47:38 (2,858s), 48:53 (2,933s) & 52:34 (3,154s) 2,982s (49:42) 7.24
+The queued times were 56:15 (Friday evening), 1:06 & 0:34.
++The queued times were 44:26, 1:11 & 0:39 (all Monday morning)

I can probably get about 0.65 model years, by using ~33 extra nodes, but I think it does make more likely to be stuck in long queue times.

Archiving two dumps in one 3 month cycle

We need to archive both the start of December and the start of January each year. This means that we'll need to create an ocean dump every month instead of every 3 months, and we need to be able to archive two dumps - one of which is not at the end of the run. I also need to know how much this is slowing down the run.