MEDUSA spin-up speed tests 1

I'm using u-ad409.

Under populating nodes for ORCA1

It's known that the limiting factor on running NEMO is the memory bandwidth. It means it's quicker to run NEMO by under populating nodes, because while it means that cores are left idle (effectively useless) it make the memory bandwidth per `active' core greater. I'm running some tests to test the impact of this.

The smallest number which is divisable by 32 (cores on Haswell nodes) and 36 (cores on Broadwell nodes) is 288. This number is also divisable by 24 and lots of numbers, and is near the high end of what we think we can run ORCA1 at. Hence it seems to be good choice for number of cores to use

NEMO_IPROC=24 * NEMO_JPROC=12
CICE_BLKX=15, CICE_BLKY=28
xcfl01

The variable OCN_PPN needs adding to rose-suite.conf to do these tests, e.g. add OCN_PPN=24 if you want to use 24 cores per node.

Cores used per Broadwell node	XIOS_NPROC	Total nodes	Times for first 6 months	Average time for 1 month	Speed (model yrs/day)
36	0	8	17:10 (1,030s), 18:15 (1,095s), 19:36 (1,176s), 20:58 (1,258s), 19:52 (1,192s) & 20:20 (1,220s)	1,162s	6.20
36	8	9	11:26 (686s), 11:23 (683s), 11:16 (676s), 11:28 (688s), 10:59 (659s) & 11:15 (675s)	678s	10.6
32	8	10	10:22 (622s), 11:07 (667s), 10:47 (647s), 10:40 (640s), 10:38 (638s) & 10:36 (636s)	642s	11.2
24	8	13	9:51 (591s), 9:32 (572s), 9:46 (586s), 9:45 (585s), 10:02 (602s) & 9:46 (586s)	587s	12.3
18	8	17	9:31 (576s), 9:42 (582s), 9:36 (576s), 9:38 (587s), 9:40 (580s) & 10:07 (607s)	582s	12.4
12	8	25	9:54 (594s), 9:58 (598s), 9:43 (583s), 9:34 (574s), 9:27 (567s) & 9:17 (557s)	579s	12.4

The performance doesn't seem to be improving that much by under populating fewer than 24 cores per broadwell 36 core node, which looks to be about 16% faster while using 44% more resources than than full usage run (both runs have XIOS_NPROC=8).

Increasing usage of cores

I'll use 24 cores per Broadwell node to try and avoid the limitations of memory bandwidth. And I'll stick with XIOS_NPROC=8.

NEMO_IPROC * NEMO_JPROC (CICE_BLKX * CICE_BLKY)	Total nodes (cores used, cores taken)	Times for first 6 months	Average time for 1 month	Speed (model yrs/day)
1216 (3021)	9 (200, 324)	12:35 (755s), 12:24 (744s), 12:43 (763s), 12:12 (732s), 12:19 (739s) & 12:05 (725s)	743s	9.69
2412 (1528)	13 (296, 468)	9:51 (591s), 9:32 (572s), 9:46 (586s), 9:45 (585s), 10:02 (602s) & 9:46 (586s)	587s	12.3
2416 (1521)	17 (392, 612)	8:21 (501s), 8:15 (495s), 8:26 (506s), 8:11 (491s), 8:02 (482s) & 8:17 (497s)	495s	14.5
2424 (1514)	25 (584, 900)	6:54 (414s), 6:37 (397s), 6:51 (411s), 6:58 (418s), 7:11 (431s) & 6:50 (410s)	414s	17.4
3020 (1217)	26 (608, 936)	Fatal error in PMPI_Isend: Invalid rank, error stack
3024 (1214)	31 (728, 1,116)	6:28 (388s), 6:40 (400s), 6:55 (415s), 6:42 (402s), 6:46 (406s) & 6:49 (409s)	403s	17.9
3624 (1014)	37 (872, 1,332)	6:33 (393s), 6:39 (399s), 6:49 (409s), 7:04 (424s), 7:02 (422s) & 7:29 (449s)	416s	17.3

Increasing cores/nodes for XIOS

I've taking the configuration

NEMO_IPROC=24 * NEMO_JPROC=24 (CICE_BLKX=15 * CICE_BLKY=14)
OCN_PPN=24
This uses 24 nodes for ocean, and extra nodes are needed for XIOS

file_definition type="one_file"

XIOS_NPROC	Nodes for XIOS	Times for first 6 months	Average time for 1 month	Speed (model yrs/day)
8	1	6:54 (414s), 6:37 (397s), 6:51 (411s), 6:58 (418s), 7:11 (431s) & 6:50 (410s)	414s	17.4
16	1	7:05 (425s), 7:24 (444s), 7:08 (428s), 7:16 (436s), 7:22 (442s) & 7:33 (423s)	433s	16.6
16	2	7:56 (476s), 7:28 (448s), 8:11 (491s), 8:12 (492s), 8:46 (526s) & 8:00 (480s)	486s	14.8

file_definition type="multiple_file"

XIOS_NPROC	Nodes for XIOS	Times for first 6 months	Average time for 1 month	Speed (model yrs/day)
8	1	6:49 (409s), 6:50 (410s), 6:52 (412s), 6:36 (396s), 6:54 (414s) & 6:48 (408s)	408s	17.6
16	1	6:49 (409s), 6:48 (408s), 6:42 (402s), 6:31 (391s), 6:44 (404s) & 6:41 (401s)	403s	17.9
18	1	6:04 (364s), 6:01 (361s), 6:05 (365s), 6:19 (379s), 6:03 (363s) & 5:59 (359s)	365s	19.7
18 (2nd try)	1	6:28 (388s), 6:13 (373s), 6:30 (390s), 6:26 (386s), 6:33 (393s) & 6:30 (390s)	387s	18.6
19	1	6:29 (389s), 6:33 (393s), 6:13 (373s), 7:03 (423s), 6:31 (391s) & 6:10 (370s)	390s	18.5
20	1	6:23 (383s), 6:08 (368s), 6:18 (378s), 6:15 (375s), 6:03 (363s) & 6:07 (367s)	372s	19.4
20 (2nd try)	1	6:42 (402s), 6:37 (397s), 6:55 (415s), 6:29 (389s), 7:04 (424s) & 6:50 (410s)	406s	17.7
22	1	6:17 (377s), 6:21 (381s), 6:01 (361s), 6:25 (385s), 6:19 (379s) & 6:15 (375s)	376s	19.1
24	1	6:25 (385s), 6:53 (413s), 6:51 (411s), 7:10 (430s), 6:32 (392s) & 6:32 (392s)	404s	17.8
16	2	6:24 (384s), 6:30 (390s), 6:27 (387s), 6:41 (401s), 6:46 (406s) & 6:40 (400s)	395s	18.2
24	2	6:42 (402s), 6:42 (402s), 6:35 (395s), 6:39 (399s), 6:33 (393s) & 6:36 (396s)	398s	18.1
24	3	6:55 (415s), 6:39 (399s), 7:03 (423s), 6:45 (405s), 6:49 (409s) & 6:49 (409s)	410s	17.6

As expected, type="multiple_file", looks to be the faster than type="one_file". Any gains, if any, for two nodes for XIOS look to be small, so I'll stick with one node for now.

It's less clear which is the best number for XIOS_NPROC but 18 it the leading contender based on the numbers above.

Changing the buffer_size for XIOS

For this

NEMO_IPROC=24 * NEMO_JPROC=24 (CICE_BLKX=15 * CICE_BLKY=14)
OCN_PPN=24
type="multiple_file"
XIOS_NPROC=18

buffer_size (*10⁶)	Ratio in first month	Times for first 6 months	Average time for 1 month	Speed (model yrs/day)
15	4.9-10%	6:37 (397s), 6:33 (393s), 6:32 (392s), 6:35 (395s), 6:50 (410s) & 6:55 (415s)	400s	18.0
25	13-24%	6:34 (394s), 6:18 (378s), 6:32 (392s), 6:28 (388s), 6:38 (398s) & 6:35 (395s)	391s	18.4
30	0.0024-1.9%	6:28 (388s), 6:13 (373s), 6:30 (390s), 6:26 (386s), 6:33 (393s) & 6:30 (390s)	387s	18.6
30	1.4-7.6%	6:28 (388s), 6:34 (394s), 6:18 (378s), 6:19 (379s), 6:16 (379s) & 11:53 (713s)	438s	16.4
32	1.4-10%	6:13 (373s), 6:33 (393s), 6:20 (380s), 6:13 (373s), 6:26 (386s) & 6:36 (396s)	384s	18.8
35	3.6-8.4%	6:48 (408s), 6:17 (377s), 6:29 (389s), 6:17 (377s), 6:40 (400s) & 6:52 (412s)	394s	18.3
40	2.7-4.9%	6:22 (382s), 6:14 (374s), 6:24 (384s), 6:19 (379s), 6:41 (401s) & 6:41 (401s)	387s	18.6
60	1.9-17%	6:28 (388s), 8:07 (487s), 6:47 (407s), 7:07 (427s), 6:27 (387s) & 6:30 (390s)	414s	17.4

I'm struggling to get any coherent pattern here, but 30*10⁶ looks to be around the right value.

PE remapping

I've created job u-ad938 as a copy of u-ad409, and added in PE remapping. For this

NEMO_IPROC=24 * NEMO_JPROC=24 (CICE_BLKX=15 * CICE_BLKY=14)
OCN_PPN=24
type="multiple_file"
XIOS_NPROC=18
buffer_size=30*10⁶
MPICH_CPUMASK_DISPLAY=0, MPICH_RANK_REORDER_METHOD=3

I have more detailed notes on this on p104 of my `Ocean' book.

Type of PE remapping	Times for first 6 months	Average time for 1 month	Speed (model yrs/day)
No remapping	7:14 (434s), 7:23 (443s), 7:32 (452s), 7:17 (437s), 7:27 (447s) & 7:14 (434s)	441s	16.3
Row major (same as before, before using MPICH_RANK_ORDER file)	7:15 (435s), 7:47 (467s), 7:23 (443s), 7:15 (435s), 7:34 (454s) & 7:39 (459s)	449s	16.0
Column major	7:45 (465s), 7:56 (476s), 7:51 (471s), 7:42 (462s), 7:33 (453s) & 7:53 (473s)	467s	15.4
Minimising neighbours on other nodes	7:50 (470s), 7:12 (432s), 7:44 (464s), 7:19 (439s), 7:29 (449s) & 7:27 (447s)	450s	16.0

I've been rather puzzled by why all the times above look slow, so I ran u-ad409 again and it ran at an average of 470s - hence it looks like the HPCs are just slow at the moment.

From the numbers above, it doesn't seem that PE remapping helps for this run.

Adding in Matt Glover's optimisation suggestions

Matt has sent me an e-mail on 12 July 2016 with some optimisation suggestions. To test these, I'm going to run u-ad938

NEMO_IPROC=24 * NEMO_JPROC=24 (CICE_BLKX=15 * CICE_BLKY=14)
OCN_PPN=24
type="multiple_file"
XIOS_NPROC=18
buffer_size=30*10⁶

Change	Times for first 6 months	Average time for 1 month	Speed (model yrs/day)
Added -hipa3 to fcflags_nemo_overrides	7:57 (477s), 8:30 (510s), 8:19 (499s), 8:09 (489s), 8:05 (485s) & 8:11 (491s)	8:12 (492s)	14.6
Removing -hipa3	8:36 (516s), 9:20 (560s), 8:41 (521s), 8:47 (527s), 8:17 (497s) & 8:19 (499s)	8:40 (520s)	13.8
Adding -hipa3 and Matt Glover's MEDUSA branch	11:01 (661s), exceeded wall time in 2nd job	11:01 (661s)	10.9

It looks like -hipa3 might save a bit time, although it's unclear why the speed of this run seems to have dropped again (the previous test were done about a month ago). Maybe Matt's branch speeds the code up for a short time, but there's clearly a problem for longer runs. I think we'll need to wait for Matt to return from leave, before progressing further with this.

Returning to Maff's optimisations

Maff and the rest of us have returned to this work (it's now 28 September 2016). There was a problem with one of Maff's changes in his working branch which caused the code to hang, but he's fixed that.

With the same configuration as above, I've run some more tests. Maff's branch is /home/h02/frmf/public_working_copies/dev_r5518_NOC_MEDUSA_Stable.

Change	Times for first 6 months	Average time for 1 month	Speed (model yrs/day)
u-ad409, new benchmark test (performance of HPC does seem to vary fairly significantly over time)	7:13 (433s), 6:53 (413s), 7:18 (438s), 7:05 (425s), 7:20 (440s) & 7:42 (462s)	7:15 (435s)	16.6
u-ad938, Adding -hipa3 and Maff's MEDUSA branch	7:12 (432s), 7:13 (433s), 7:10 (430s), 7:18 (438s), 8:35 (515s) & 7:48 (468s)	7:33 (453s)	15.9
u-ad938, -hipa3, Maff's MEDUSA branch & MED_ATM_FORCING=/data/d02/frmf/ATM-FORCING_eORCA1/SPIN-UP_GA7_forcings	7:11 (431s), 6:46 (406s), 6:19 (379s), 6:14 (374s), 6:38 (398s) & 6:56 (416s)	6:41 (401s)	18.0

Marc's pages