MEDUSA spin-up speed tests 2

Testing move to XCS and Maff's optimisations

All the runs below use

  • ORCA1
  • (NEMO_IPROC,NEMO_JPROC)=(24,24)
  • (CICE_BLKX,CICE_BLK5)=(15,14)
  • OCN_PPN=24
  • XIOS_NPROC=8
  • Total broadwell nodes=25
Job id Description Times for first 4 years Average time for 1 year Speed (model yrs/day)
Julien's u-aj946Our reference spin-up suite 1:06:50 (4,010s), 1:05:47 (3,947s), 1:04:28 (3,868s) & 1:03:12 (3792s) 3,904.25s (1:05:04) 22.1
u-ak454* Ported to XCS 1:00:58 (3,658s), 1:01:35 (3,695s), 0:58:44 (3,524) & 0:59:48 (3,588s) 3,616.25s (1:00:16) 23.9
u-ak491** XCS & using Richard's version of GO6 package branch, which has Maff's pointer->allocatable optimisations in 1:00:15 (3,615s), 0:56:27 (3,387s), 0:58:06 (3,486s) & 0:56:21 (3,381s) 3,467.25s (0:57:47) 24.9
u-ak506*** XCS, Richard's GO6 package branch & Maff's compiler optimisations 0:50:31 (3,031s), 0:51:12 (3,072s), 0:53:08 (3,188s) & 0:53:04 (3,184s) 3,118.75s (0:51:59) 27.7
*For the fields tested, results are the same on XCS (u-ak454) as on XCE/F (u-aj946).
**For the fields tested, results for u-ak491 are the same as u-ak454. This wasn't necessarily expected since Richard's branch comes from revision 7573 of the GO6 package branch and u-ak454 uses revision 7704 of the GO6 package branch.
***For the fields tested, the results with Maff's extra compliler optimisations are the same (u-ak491).

The reason why all fields aren't tested is that there doesn't seem to be an easy method of comparing all fields in two netCDF files (unlike for the UM, where we have um-cumf). What I've done is create the full restart files at the end of each run and then use ncdiff to find the difference between these files. Rather than checking that all the differences are zero, I've generally just considered SN and TN for difference for the *_restart.nc files and a couple of fields from the *_restart_trc.nc files, e.g. TNCHN and TNDiC.

Can we increase speed with more cores?

Some unnecessary MPI communication has been removed since I last looked at the maximum speed of MEDUSA, so I'd expect that we can increase the PEs before reaching maximum speed.

I'm using u-ak506 to do all these speed tests, with Richard's optimised version of GO6 package branch and Maff's compiler options.

NEMO_IPROC*NEMO_JPROC (CICE_BLKX*CICE_BLKY) Total nodes Times Average time for 1 year Speed (model yrs/day)
24*24 (15*14) 25 0:50:31 (3,031s), 0:51:12 (3,072s), 0:53:08 (3,188s) & 0:53:04 (3,184s) 3,118.75s (0:51:59) 27.7
24*26 (15*13) 27 0:50:44 (3,044s) 28.4
24*28* (15*12) 29 0:47:27 (2,847s) 30.3
24*28* (15*12) 27 48:21 (2,901s), 48:18 (2,898s), 47:48 (2,851s) & 47:22 (2,842s) 2,873s (0:47:53) 30.1
30*24 (12*14) 31 0:48:09 (2,902s) 29.8
30*24 (12*14) 31 45:13 (2,713s), 45:42 (2,742s), 45:06 (2,706s) & 45:33 (2,733s) 2,723.5s (0:45:24) 31.7
30*28 (12*12) 36 43:32 (2,612s), 46:26 (2,786s), 44:15 (2,655s) & 43:17 (2,597s) 2,662.5s (0:44:23) 32.5
40*24 (9*14) 41 45:07 (2,707s), 49:47 (2,987s), 43:19 (2,599s) & 46:20 (2,780s) 2,768.25s (0:46:08) 31.2
*This is the PE decomposition which Maff has recommended, although he was using a configuration with the extra MPI communication.

Recommended configuration

Based on the table above, I'd recommend the u-ak506 configuration with (NEMO_IPROC,NEMO_JPROC)=(30,28), except we should use the latest version of the GO6 package branch rather than Richard's version of this - and push to get the Maff's optimisations in Richard's branch into the GO6 package branch.