Speed tests for UKESM

All the jobs below use u-ar017, which is a copy of Yongming's UKESM0.9 UM10.7, u-aq399. I'm going to run this close to our top performance, so that's with

ATMOS is (48,28) on 2 threads (OMPTHR_ATM=2). This is 48*28*2=2,688 MPI & OpenMP tasks which is 74 full broadwell nodes and 24 out of 36 for a 75th node.
OCEAN is (18,12). This is 6 broadwell nodes. Chosen to be a bit more than we need, because the speed is mostly about optimising options for ATMOS.
XIOS_PROC=6 on one node
A total of 81 2/3 nodes plus anything extra for IOS server, i.e. + (2 * IOS_NPROC).
Where I've used I/O Server, IOS_NPROC > 0, I've set io_external_control & iso_thread_0_call_mpi to true. I tried all the ideal options suggested by Mirek and documented on Adding IO server , but I had a IOS_QUERYBUFFER error with message, `Time out waiting for protocal'. This seems to be a problem with region STASH fields, which I'll leave CRUM to try and fix.
Run for three two month cycles.

It's known that model performance varies a lot, around 10%, mostly depending on HPC load (e.g. Harry found his morning runs were about 8% faster on average than his post lunch runs). These runs have been run at all different times, so we should be careful to not read too much into them.

Description	IOS_NPROC (total nodes)	Timings for 2 month cycles	Average for one 2 month cycle	Speed (model years/day)
Benchmark	0 (82)	1:18:09 (4,689s), 1:24:21 (4,957s) & 1:19:07 (4,747s)	4,798s (1:19:58)	3.00
Add IO server	24 (83)	1:14:05 (4,445s), 1:17:18 (4,638s) & 1:17:19 (4,639s)	4,574s (1:16:14)	3.15
As above & add environment variables to [coupled]* **	24 (83)	1:13:56 (4,436s), 1:16:49 (4,609s) & 1:16:18 (4,578s)	4,541s (1:15:41)	3.17
As above & adding my 3 OpenMP branches	24 (83)	1:10:18 (4,218s), 1:12:49 (4,369s) & 1:12:52 (4,372s)	4,320s (1:12:00)	3.33
As above & remove climate meaning	24 (83)	0:59:39 (3,579s), 1:00:54 (3,654s) & 1:08:50 (4,130s)	3,788s (1:03:08)	3.80
As above, but UM dumps 10 days -> 30 days	24 (83)	58:04 (3,484s), 56:45 (3,405s) & 56:51 (3,411s)	3,433s (57:13)	4.19
As above, but using hugepages in compilation⁺	24 (83)	55:34 (3,334s), 55:40 (3,340s) & 55:59 (3,359s)	3,344s (55:44)	4.31
As above & add environment variables to [coupled]**	24 (83)	57:46 (3,466s), 57:51 (3,471s) & 59:06 (3,546s)	3,494s (58:14)	4.12
As above but removing environment variables	24 (83)	59:45 (3,585s), 58:41 (3,521s) & 56:32 (3,392s)	3,499s (58:19)	4.12

* Add MPICH_GNI_MAX_EAGER_MSG_SIZE=65536, MPICH_GNI_MAX_VSHORT_MSG_SIZE=8192 & MPICH_GNI_ROUTING_MODE=ADAPTIVE_2, as recommended by Mirek
⁺[recon] won't currently work when compiling with hugepages (there's a problem reading in one of the ice files), so I've separated the [recon] and [coupled] stages (while this is failing at UM10.7, it works at UM10.9 so I'm not worrying about it for now).
**Months after doing these tests, I realised that I'd added the MPICH environment variables into [coupled_nrun] and not [coupled_nrun], so I've added them in right place this time. Initially, it looked like this options had slowed the run. However, the HPC definitely seems to have periods when it's slower than others, so I did another benchmark afterwards - which suggest the MPICH arguments are slightly faster.

Marc's pages

Speed tests for UKESM