We may need to spin-up with a coupled model. The speed tests below are with u-ak109, with settings
Before Christmas, with a 3 month cycle, I got 6.63 model years/day (see table from UKESM core presentation). I was hoping to get something similar for this run. My notes on this are on p130 of UKESM General III, and the job suite was u-ai403.
u-ak734 is a copy of u-aj659.
Description (job suite) | ATM_PROCX * ATM_PROCY | NEMO_IPROC * NEMO_JPROC (CICE_BLKX * CICE_BLKY) | Total nodes | Cylce length (months) | Times for first 4 cycles | Average time for 1 cycle | Speed (model yrs/day) |
---|---|---|---|---|---|---|---|
GC3.1 + MEDUSA* (u-ak109) | 32*22 | 12*12 (30*28) | 44 | 1 | 27:35 (1,655s), 30:34 (1,834s), 28:45 (1,725s) & 29:36 (1,733s) | 1,736.75s (28:57) | 4.15 |
GC3.1 + MEDUSA* (u-ak109) | 48*24 | 12*12 (30*28) | 69 | 1 | 30:18 (1,818s), 30:46 (1,846s), 28:59 (1,739s) & 30:01 (1,801s) | 1,801s (30:01) | 4.00 |
GC3.1 + MEDUSA* (u-ak109) | 48*24 | 12*12 (30*28) | 69 | 1 | 30:01 (1,801s), 29:27 (1,767s), 29:51 (1,791s) & 29:40 (1,780s) | 1,784.75s (29:45) | 4.03 |
GC3.1, older setup (u-ai403) | 48*24 | 9*8 (40*42) | 67 | 1 | 18:35 (1,115s), 21:24 (1,284s), 19:02 (1,142s) & 19:11 (1,151s) | 1,173s (19:33) | 6.14 |
GC3.1, newer setup* (u-ak734) | 48*24 | 9*8 (40*42) | 67 | 1 | 22:03 (1,323s), 22:20 (1,340s), 22:19 (1,339s) & 22:32 (1,352s) | 1,338.5s (22:19) | 5.38 |
GC3.1, newer setup (u-ak734) | 48*24 | 9*8 (40*42) | 67 | 1 | 22:31 (1,351s), 23:39 (1,419s), 23:11 (1,391s) & 22:47 (1,367s) | 1,382s (23:02) | 5.21 |
GC3.1, newer setup (u-ak734) | 48*24 | 12*12 (30*28) | 69 | 1 | 21:31 (1,291s), 22:05 (1,325s), 22:13 (1,333s) & 22:13 (1,333s) | 1,320.5s (22:01) | 5.45 |
GC3.1, newer setup* (u-ak734) | 48*24 | 9*8 (40*42) | 67 | 3 | 59:50 (3,590s), 1:00:53 (3,653s), 59:50 (3,590s) & 59:44 (3,584s) | 3,604.25s (1:00:04) | 6.00 |
GC3.1, older setup (u-ai403) | 48*24 | 9*8 (40*42) | 67 | 3 | 51:56 (3,116s), 53:51 (3,211s), 52:08 (3,128s) & 52:51 (3,171s) | 3,156.5s (52:37) | 6.84 |
Comparing our older and newer setups for GC3.1, we appear to have lost about 0.75 model years/day, why?
All the configurations below are
Job suite | Description | Times for first 4 months | Average time for 1 month | Speed (model yrs/day) |
---|---|---|---|---|
u-ai403 | Older setup | 18:35 (1,115s), 21:24 (1,284s), 19:02 (1,142s) & 19:11 (1,151s) | 1,173s (19:33) | 6.14 |
u-ai403+ | Older setup | 18:11 (1,091s), 18:56 (1,136s), 19:22 (1,162s) & 18:09 (1,089s) | 1,119.5s (18:40) | 6.43 |
u-ak734* | newer setup | 22:03 (1,323s), 22:20 (1,340s), 22:19 (1,339s) & 22:32 (1,352s) | 1,338.5s (22:19) | 5.38 |
u-ak770 | Older setup but newer revision of GO6 package branch** | 20:57 (1,257s), 19:58 (1,198s), 21:17 (1,277s) & 20:33 (1,233s) | 1,241.25s (20:41) | 5.85 |
u-ak770+ | Older setup but newer revision of GO6 package branch** | 20:30 (1,230s), 20:52 (1,252s), 21:12 (1,272s) & 20:59 (1,259s) | 1,253.25s (20:53) | 5.75 |
u-ak838+ | Older setup but newer revision of GO6 package branch** and detached XIOS | 18:54 (1,134s), 19:07 (1,147s), 20:15 (1,215s) & 19:29 (1,169s) | 1,166.25s (19:26) | 6.17 |
Activating the timings routines is done by
Timing section | Job suite | ||
---|---|---|---|
u-ai403 (older GO6 package branch) | u-ak770 (new GO6 package branch) | u-ak838 (new GO6 package branch & detached XIOS) | |
dia_wri | 2 | 148 | 145 |
Average CPU | 981 | 1,162 | 1,023 |
It looks like we've add more than 2 minutes to the ocean run, with the extra diagnostics, and a significant amount of this is probably added to end of the run. I've tried detaching XIOS again, but it isn't showing much improvement, if any.
I can understand why the extra diagnostics have slower down our GC3.1 configuration, but our GC3.1 + MEDUSA is still looking very slow. Do we need to add further nodes to the ocean?
All the configurations below are
NEMO_IPROC * NEMO_JPROC (CICE_BLKX * CICE_BLKY) | Total nodes | Times for first 4 months | Average time for 1 month | Speed (model yrs/day) |
---|---|---|---|---|
12*9 (30*37) | 68 | 35:10 (2,110s), 33:01 (1,987s), 34:42 (2,082s) & 34:20 (2,060s) | 2,059.75s (34:20) | 3.50 |
12*12 (30*28) | 69 | 30:18 (1,818s), 30:46 (1,846s), 28:59 (1,739s) & 30:01 (1,801s) | 1,801s (30:01) | 4.00 |
18*12 (20*28)+ | 71 | 27:07 (1,627s), 27:12 (1,632s), 28:05 (1,685s) & 26:58 (1,618s) | 1640.5s (27:21) | 4.39 |
18*16 (20*21)+ | 73 | 33:52 (2,032s), 33:28 (2,008s), 33:50 (2,057s) & 32:42 (1,962s) | 2,014.75s (33:35) | 3.57 |
Before Christmas, I got 5.12 model years/day out of UKESM0.4-CN for a 3 month cycle, and you wouldn't expect GC3.1 + MEDUSA to be slower than UKESM0.5-CN.
All the configurations below are
Description | NEMO_IPROC * NEMO_JPROC (CICE_BLKX * CICE_BLKY) | Total nodes | Times for first 4 months | Average time for 1 month | Speed (model yrs/day) |
---|---|---|---|---|---|
UKESM0.4-CN, IOS_NPROC=0 (u-ai432) | 12*9 (30*37) | 67 | 25:06 (1,506s), 24:54 (1,494s), 25:34 (1,534s) & 25:24 (1,524s) | 1,514.5 (25:15) | 4.75 |
UKESM0.4-CN, IOS_NPROC=6 (u-ai432) | 12*9 (30*37) | 68 | 24:51 (1,491s), 24:32 (1,472s), 25:15 (1,515s) & 24:59 (1,462s) | 1,485s (24:45) | 4.85 |
UKESM0.5-CN (u-aj599) | 12*9 (30*37) | 68 | 33:59 (2,039s), 34:56 (2,096s), 34:59 (2,099s) & 34:37 (2,077s) | 2,077.75 (34:38) | 3.47 |
I've done some profiling of UKESM0.4-CN versus UKESM0.5-CN and this suggest the extra time in UKESM0.5-CN comes from OASIS3_GETO2A, which points at the ocean.
Elapsed time in PE 0 in seconds.
Section | NEMO_IPROC * NEMO_JPROC | (Time for 12*9)/(Time for 18*16) | |
---|---|---|---|
12*9 (3 nodes) | 18*16 (8 nodes) | ||
sbc_cpl_rcv | 42 | 326 | 0.129 |
tra_adv_muscl | 281 | 105 | 2.68 |
tra_ldf_iso | 256 | 86 | 2.98 |
trc_sbc | 158 | 84 | 1.83 |
dia_wri | 123 | 65 | 1.89 |
sbc_ice_cice | 78 | 49 | 1.59 |
trc_sms | 82 | 49 | 1.67 |
The ratio of 8 nodes/3 nodes =2.67, so timing regions which scale with cores should have this number in the last column. Both TRA_ADV_MUSCL and TRA_LDF_ISO have ratio numbers similar to this, while the other routines are not as good. The time in the `sbc_cpl_rcv' region is much greater for the 8 node job, which suggests it's waiting around a lot.
I've also looked at the OASIS timers which show the total wall time to be
3 nodes | 8 nodes | |
---|---|---|
toyatm | 1,735s | 1,102s |
toyocn | 2,005s | 1,854s |
which shows that the wall time in the ocean is much greater than the atmosphere for both jobs. I think this extra time waiting for the ocean is for the ocean to write its restart dumps and diagnostics. And throwing extra cores at this will probably slow this process down (because potentially there is more gathering).
I've copied u-ak109 to u-al312 and updated this new job to UM10.7. I've taken the best configuration for u-ak109 from above
And the first four months have taken 36:12 (2,172s), 36:56 (2,216s), 34:46 (2,086s) and 36:38 (2,198s), which is an average of 2,168s (36:08). This is a speed of 3.32 model years/day.
I'm copying u-al312 to u-al873. I've only be able to get this the `split_freq' addition to iodef.xml to work with detached XIOS.
All the configurations below are
Description | Times for 3 month cycles | Average time for one three month cycle | Speed (model yrs/day) |
---|---|---|---|
u-al873, "one_file", full iodef.xml, XIOS on one month cycle | 1:20:19 (4,819s), 1:22:10 (4,930s) & 1:20:30 (4,830s) | 4,860s (1:21:00) | 4.44 |
u-al873, "multiple_file", full iodef.xml, XIOS on one month cycle | 1:26:50 (5,210s), 1:25:52 (5,152s) & 1:23:34 (5014s) | 5,125s (1:25:25) | 4.21 |
u-am012, "one_file", full iodef.xml, XIOS on ten day cycle | 1:24:49 (5,089s), 1:25:53 (5,153s) & 1:25:42 (5,142s) | 5,128 (1:25:28) | 4.21 |
u-am012, "multiple_file", full iodef.xml, XIOS on ten day cycle | 1:24:21 (5,061s), 1:24:20 (5,060s) & 1:24:06 (5,046s) | 5,056s (1:24:16) | 4.27 |
u-am015, "multiple_file", removed groupMEDUSA_cmip6 from iodef.xml, XIOS on one month cycle | 1:21:37 (4,897s), 1:21:58 (4,918s) & 1:21:08 (4,868s) | 4,894s (1:21:34) | 4.41 |
Conclusions
I've created my latest GC3.1 N96/ORCA1, u-am151, and I'm running with
to get an idea of what MEDUSA is adding to the 3 month cycle
NEMO_IPROC*NEMO_JPROC (CICE_BLKX*CICE_BLKY) | Total nodes | Times for 3 month cycles | Average time for one three month cycle | Speed (model yrs/day) |
---|---|---|---|---|
9*8 (40*42) | 68 | 1:04:32 (3,872s), 59:43 (3,583s), 58:08 (3,488s) | 3,648 (1:00:48) | 5.92 |
12*9 (30*37) | 69 | 55:20 (3,320s), 56:48 (3,408s) & 56:31 (3,391s) | 3,373s (56:13) | 6.40 |
12*12 (30*28) | 70 | 57:26 (3,446s), 58:40 (3,520s) & 1:00:14 (3,614s) | 3,527 (58:47) | 6.12 |
I think that the 12*12 run being significantly slower than the 12*9 run was probably down to heavy load on XCS at the time (some of my urgent jobs were queuing), but it probably does suggest that 12*9 is enough ocean cores to achieve optimum speed.
These runs are all done with u-am015
NEMO_IPROC*NEMO_JPROC (CICE_BLKX*CICE_BLKY) | Total nodes | Times for 3 month cycles | Average time for one three month cycle | Speed (model yrs/day) |
---|---|---|---|---|
12*9 (30*37) | 69 | 1:21:37 (4,897s), 1:21:58 (4,918s) & 1:21:08 (4,868s) | 4,894s (1:21:34) | 4.41 |
12*12 (30*28) | 70 | 1:03:33 (3,813s), 1:04:45 (3,885s) & 1:03:21 (3,801s) | 3,833s (1:03:53) | 5.64 |
18*12 (20*28) | 72 | 53:56 (3,236s), 56:17 (3,377s) & 55:20 (3,320s) | 3,311s (55:11) | 6.52 |
18*16 (20*21) | 74 | 56:27 (3,387s), 56:28 (3,388s) & 56:33 (3,393s) | 3,389s (56:29) | 6.37 |
Like the 12*12 run above, I think the 18*16 run was made when the load on XCS was heavy.
I'm copying u-am015 to u-am198 and I'll try running a 6 month cycle.
NEMO_IPROC*NEMO_JPROC (CICE_BLKX*CICE_BLKY) | Total nodes | Times for 6 month cycles | Average time for one six month cycle | Speed (model yrs/day) |
---|---|---|---|---|
12*12 (30*28) | 70 | 2:05:58 (7,558s), 2:08:15 (7,695s) & 2:06:39 (7,599s) | 7,617s (2:06:57) | 5.67 |
18*12 (20*28) | 72 | 1:48:22 (6,502s), 1:52:01 (6,721s) & 1:51:54 (6,714s) | 6,646 (1:50:46) | 6.50 |
The speed of this configuration on a 3 month cycle was 5.64 model years/day, so this doesn't look a lot faster.
I'm copying u-am015 to u-am197 and I'll try running with 3 OpenMP threads
While doing this, Maff thinks his OpenMP in aerosol chemistry branch is almost finished, and good enough for speed tests. I'm copying u-am197 to u-am274 and adding in Maff's branch.
Aerosol OpenMP | NEMO_IPROC*NEMO_JPROC (CICE_BLKX*CICE_BLKY) | Total nodes | Times for 3 month cycles | Average time for one three month cycle | Speed (model yrs/day) |
---|---|---|---|---|---|
No | 18*12 (20*28) | 103 | 52:02 (3,122s), 53:36 (3,216s) & 53:19 (3,199s) | 3,179s (52:59) | 6.79 |
No | 18*16 (20*21) | 105 | 53:28 (3,208s), 54:01 (3,241s) & 52:29 (3,149s) | 3,199s (53:19) | 6.75 |
Yes | 18*18 (20*19) | 107 | 52:27 (3,147s), 50:34 (3,034s) & 51:13 (3,073s) | 3,085s (51:25) | 7.00 |
The main points
I've taken a number of steps and they're not all in a straight line, so I'm starting again at the MEDUSA + GC3.1 job I had a while ago, u-al312, and copying this to u-am254. The configuration is
Length of cycle | Times for three cycles | Average time for one cycle | Speed (model yrs/day) |
---|---|---|---|
One month | 20:03 (1,203s), 22:00 (1,320s) & 20:14 (1,214s) | 1,246s (20:46) | 5.78 |
Three months | 56:52 (3,412s), 56:50 (3,410s) & 56:39 (3,399s) | 3,407 (56:47) | 6.34 |
Six months | Crashed with problem writing qtrCFC11 |
In u-am276, I added the new land-sea mask to u-am254. To this I've added in Maff's OpenMP branch for u-am300 and then removed groupMEDUSA_cmip6 for u-am354.
iodef.xml | NEMO_IPROC*NEMO_JPROC (CICE_BLKX*CICE_BLKY) | Total nodes | Times for 3 month cycles | Average time for one three month cycle | Speed (model yrs/day) |
---|---|---|---|---|---|
Full | 18*12 (20*28) | 72 | 52:43 (3,163s), 56:37 (3,397s) & 56:25 (3,385s) | 3,315s (55:15) | 6.52 |
Full | 18*14 (20*24) | 73 | 52:05 (3,125s), 52:39 (3,159s) & 52:52 (3,172s) | 3,152s (52:32) | 6.85 |
No MEDUSA CMIP6 | 18*14 (20*24) | 73 | 53:01 (3,181s), 53:46 (3,226s) & 56:42 (3,402s) | 3,270 (54:30) | 6.61 |
Full | 18*16 (20*21) | 74 | 52:38 (3,158s), 52:35 (3,155s) & 55:34 (3,334s) | 3,216s (53:36) | 6.72 |
No MEDUSA CMIP6 | 18*16 (20*21) | 74 | 54:48 (3,288s), 54:04 (3,244s) & 54:06 (3,246s) | 3,259 (54:19) | 6.63 |
I've chosen the last configuration above on the basis that it looks like we can't get any extra speed by adding more ocean cores. And given the imbalance between resources for atmosphere and ocean, it's probably best to have one more ocean node than I think we need.
Other things we could do to speed this up are.
I think I ought to be able to get more than the extra 0.5 model years/day that I got when I tried this above (did I really have enough ocean nodes when I did this test). The job u-am375 is a copy of u-am354 and I've added in the 3rd OpenMP thread.
NEMO_IPROC*NEMO_JPROC (CICE_BLKX*CICE_BLKY) | Total nodes | Times for 3 month cycles | Average time for one three month cycle | Speed (model yrs/day) |
---|---|---|---|---|
18*16 (20*21)++ | 106 | 47:39 (2,859s), 52:02 (3,122s) & 47:25 (2,845s) | 2,942 (49:02) | 7.34 |
18*18 (20*19)+ | 107 | 48:53 (2,933s), 47:11 (2,831s) & 47:52 (2,872s) | 2,879s (47:59) | 7.5 |
20*18 (18*19) | 108 | 48:46 (2,926s), 49:02 (2,942s) & 48:48 (2,928s) | 2,932s (48:52) | 7.37 |
24*18 (15*19) | 110 | 47:38 (2,858s), 48:53 (2,933s) & 52:34 (3,154s) | 2,982s (49:42) | 7.24 |
I can probably get about 0.65 model years, by using ~33 extra nodes, but I think it does make more likely to be stuck in long queue times.
We need to archive both the start of December and the start of January each year. This means that we'll need to create an ocean dump every month instead of every 3 months, and we need to be able to archive two dumps - one of which is not at the end of the run. I also need to know how much this is slowing down the run.