MEDUSA spin-up speed tests 1

I'm using u-ad409.

Under populating nodes for ORCA1

It's known that the limiting factor on running NEMO is the memory bandwidth. It means it's quicker to run NEMO by under populating nodes, because while it means that cores are left idle (effectively useless) it make the memory bandwidth per `active' core greater. I'm running some tests to test the impact of this.

The smallest number which is divisable by 32 (cores on Haswell nodes) and 36 (cores on Broadwell nodes) is 288. This number is also divisable by 24 and lots of numbers, and is near the high end of what we think we can run ORCA1 at. Hence it seems to be good choice for number of cores to use

  • NEMO_IPROC=24 * NEMO_JPROC=12
  • CICE_BLKX=15, CICE_BLKY=28
  • xcfl01

The variable OCN_PPN needs adding to rose-suite.conf to do these tests, e.g. add OCN_PPN=24 if you want to use 24 cores per node.

Cores used per Broadwell node XIOS_NPROC Total nodes Times for first 6 months Average time for 1 month Speed (model yrs/day)
36 0 8 17:10 (1,030s), 18:15 (1,095s), 19:36 (1,176s), 20:58 (1,258s), 19:52 (1,192s) & 20:20 (1,220s) 1,162s 6.20
36 8 9 11:26 (686s), 11:23 (683s), 11:16 (676s), 11:28 (688s), 10:59 (659s) & 11:15 (675s) 678s 10.6
32 8 10 10:22 (622s), 11:07 (667s), 10:47 (647s), 10:40 (640s), 10:38 (638s) & 10:36 (636s) 642s 11.2
24 8 13 9:51 (591s), 9:32 (572s), 9:46 (586s), 9:45 (585s), 10:02 (602s) & 9:46 (586s) 587s 12.3
18 8 17 9:31 (576s), 9:42 (582s), 9:36 (576s), 9:38 (587s), 9:40 (580s) & 10:07 (607s) 582s 12.4
12 8 25 9:54 (594s), 9:58 (598s), 9:43 (583s), 9:34 (574s), 9:27 (567s) & 9:17 (557s) 579s 12.4

The performance doesn't seem to be improving that much by under populating fewer than 24 cores per broadwell 36 core node, which looks to be about 16% faster while using 44% more resources than than full usage run (both runs have XIOS_NPROC=8).

Increasing usage of cores

I'll use 24 cores per Broadwell node to try and avoid the limitations of memory bandwidth. And I'll stick with XIOS_NPROC=8.

NEMO_IPROC * NEMO_JPROC (CICE_BLKX * CICE_BLKY) Total nodes (cores used, cores taken) Times for first 6 months Average time for 1 month Speed (model yrs/day)
12*16 (30*21) 9 (200, 324) 12:35 (755s), 12:24 (744s), 12:43 (763s), 12:12 (732s), 12:19 (739s) & 12:05 (725s) 743s 9.69
24*12 (15*28) 13 (296, 468) 9:51 (591s), 9:32 (572s), 9:46 (586s), 9:45 (585s), 10:02 (602s) & 9:46 (586s) 587s 12.3
24*16 (15*21) 17 (392, 612) 8:21 (501s), 8:15 (495s), 8:26 (506s), 8:11 (491s), 8:02 (482s) & 8:17 (497s) 495s 14.5
24*24 (15*14) 25 (584, 900) 6:54 (414s), 6:37 (397s), 6:51 (411s), 6:58 (418s), 7:11 (431s) & 6:50 (410s) 414s 17.4
30*20 (12*17) 26 (608, 936) Fatal error in PMPI_Isend: Invalid rank, error stack
30*24 (12*14) 31 (728, 1,116) 6:28 (388s), 6:40 (400s), 6:55 (415s), 6:42 (402s), 6:46 (406s) & 6:49 (409s) 403s 17.9
36*24 (10*14) 37 (872, 1,332) 6:33 (393s), 6:39 (399s), 6:49 (409s), 7:04 (424s), 7:02 (422s) & 7:29 (449s) 416s 17.3

Increasing cores/nodes for XIOS

I've taking the configuration

  • NEMO_IPROC=24 * NEMO_JPROC=24 (CICE_BLKX=15 * CICE_BLKY=14)
  • OCN_PPN=24
  • This uses 24 nodes for ocean, and extra nodes are needed for XIOS

file_definition type="one_file"

XIOS_NPROC Nodes for XIOS Times for first 6 months Average time for 1 month Speed (model yrs/day)
8 1 6:54 (414s), 6:37 (397s), 6:51 (411s), 6:58 (418s), 7:11 (431s) & 6:50 (410s) 414s 17.4
16 1 7:05 (425s), 7:24 (444s), 7:08 (428s), 7:16 (436s), 7:22 (442s) & 7:33 (423s) 433s 16.6
16 2 7:56 (476s), 7:28 (448s), 8:11 (491s), 8:12 (492s), 8:46 (526s) & 8:00 (480s) 486s 14.8

file_definition type="multiple_file"

XIOS_NPROC Nodes for XIOS Times for first 6 months Average time for 1 month Speed (model yrs/day)
8 1 6:49 (409s), 6:50 (410s), 6:52 (412s), 6:36 (396s), 6:54 (414s) & 6:48 (408s) 408s 17.6
16 1 6:49 (409s), 6:48 (408s), 6:42 (402s), 6:31 (391s), 6:44 (404s) & 6:41 (401s) 403s 17.9
18 1 6:04 (364s), 6:01 (361s), 6:05 (365s), 6:19 (379s), 6:03 (363s) & 5:59 (359s) 365s 19.7
18 (2nd try) 1 6:28 (388s), 6:13 (373s), 6:30 (390s), 6:26 (386s), 6:33 (393s) & 6:30 (390s) 387s 18.6
19 1 6:29 (389s), 6:33 (393s), 6:13 (373s), 7:03 (423s), 6:31 (391s) & 6:10 (370s) 390s 18.5
20 1 6:23 (383s), 6:08 (368s), 6:18 (378s), 6:15 (375s), 6:03 (363s) & 6:07 (367s) 372s 19.4
20 (2nd try) 1 6:42 (402s), 6:37 (397s), 6:55 (415s), 6:29 (389s), 7:04 (424s) & 6:50 (410s) 406s 17.7
22 1 6:17 (377s), 6:21 (381s), 6:01 (361s), 6:25 (385s), 6:19 (379s) & 6:15 (375s) 376s 19.1
24 1 6:25 (385s), 6:53 (413s), 6:51 (411s), 7:10 (430s), 6:32 (392s) & 6:32 (392s) 404s 17.8
16 2 6:24 (384s), 6:30 (390s), 6:27 (387s), 6:41 (401s), 6:46 (406s) & 6:40 (400s) 395s 18.2
24 2 6:42 (402s), 6:42 (402s), 6:35 (395s), 6:39 (399s), 6:33 (393s) & 6:36 (396s) 398s 18.1
24 3 6:55 (415s), 6:39 (399s), 7:03 (423s), 6:45 (405s), 6:49 (409s) & 6:49 (409s) 410s 17.6

As expected, type="multiple_file", looks to be the faster than type="one_file". Any gains, if any, for two nodes for XIOS look to be small, so I'll stick with one node for now.

It's less clear which is the best number for XIOS_NPROC but 18 it the leading contender based on the numbers above.

Changing the buffer_size for XIOS

For this

  • NEMO_IPROC=24 * NEMO_JPROC=24 (CICE_BLKX=15 * CICE_BLKY=14)
  • OCN_PPN=24
  • type="multiple_file"
  • XIOS_NPROC=18
buffer_size (*106) Ratio in first month Times for first 6 months Average time for 1 month Speed (model yrs/day)
15 4.9-10% 6:37 (397s), 6:33 (393s), 6:32 (392s), 6:35 (395s), 6:50 (410s) & 6:55 (415s) 400s 18.0
25 13-24% 6:34 (394s), 6:18 (378s), 6:32 (392s), 6:28 (388s), 6:38 (398s) & 6:35 (395s) 391s 18.4
30 0.0024-1.9% 6:28 (388s), 6:13 (373s), 6:30 (390s), 6:26 (386s), 6:33 (393s) & 6:30 (390s) 387s 18.6
30 1.4-7.6% 6:28 (388s), 6:34 (394s), 6:18 (378s), 6:19 (379s), 6:16 (379s) & 11:53 (713s) 438s 16.4
32 1.4-10% 6:13 (373s), 6:33 (393s), 6:20 (380s), 6:13 (373s), 6:26 (386s) & 6:36 (396s) 384s 18.8
35 3.6-8.4% 6:48 (408s), 6:17 (377s), 6:29 (389s), 6:17 (377s), 6:40 (400s) & 6:52 (412s) 394s 18.3
40 2.7-4.9% 6:22 (382s), 6:14 (374s), 6:24 (384s), 6:19 (379s), 6:41 (401s) & 6:41 (401s) 387s 18.6
60 1.9-17% 6:28 (388s), 8:07 (487s), 6:47 (407s), 7:07 (427s), 6:27 (387s) & 6:30 (390s) 414s 17.4

I'm struggling to get any coherent pattern here, but 30*106 looks to be around the right value.

PE remapping

I've created job u-ad938 as a copy of u-ad409, and added in PE remapping. For this

  • NEMO_IPROC=24 * NEMO_JPROC=24 (CICE_BLKX=15 * CICE_BLKY=14)
  • OCN_PPN=24
  • type="multiple_file"
  • XIOS_NPROC=18
  • buffer_size=30*106
  • MPICH_CPUMASK_DISPLAY=0, MPICH_RANK_REORDER_METHOD=3

I have more detailed notes on this on p104 of my `Ocean' book.

Type of PE remapping Times for first 6 months Average time for 1 month Speed (model yrs/day)
No remapping 7:14 (434s), 7:23 (443s), 7:32 (452s), 7:17 (437s), 7:27 (447s) & 7:14 (434s) 441s 16.3
Row major (same as before, before using MPICH_RANK_ORDER file) 7:15 (435s), 7:47 (467s), 7:23 (443s), 7:15 (435s), 7:34 (454s) & 7:39 (459s) 449s 16.0
Column major 7:45 (465s), 7:56 (476s), 7:51 (471s), 7:42 (462s), 7:33 (453s) & 7:53 (473s) 467s 15.4
Minimising neighbours on other nodes 7:50 (470s), 7:12 (432s), 7:44 (464s), 7:19 (439s), 7:29 (449s) & 7:27 (447s) 450s 16.0

I've been rather puzzled by why all the times above look slow, so I ran u-ad409 again and it ran at an average of 470s - hence it looks like the HPCs are just slow at the moment.

From the numbers above, it doesn't seem that PE remapping helps for this run.

Adding in Matt Glover's optimisation suggestions

Matt has sent me an e-mail on 12 July 2016 with some optimisation suggestions. To test these, I'm going to run u-ad938

  • NEMO_IPROC=24 * NEMO_JPROC=24 (CICE_BLKX=15 * CICE_BLKY=14)
  • OCN_PPN=24
  • type="multiple_file"
  • XIOS_NPROC=18
  • buffer_size=30*106
Change Times for first 6 months Average time for 1 month Speed (model yrs/day)
Added -hipa3 to fcflags_nemo_overrides 7:57 (477s), 8:30 (510s), 8:19 (499s), 8:09 (489s), 8:05 (485s) & 8:11 (491s) 8:12 (492s) 14.6
Removing -hipa3 8:36 (516s), 9:20 (560s), 8:41 (521s), 8:47 (527s), 8:17 (497s) & 8:19 (499s) 8:40 (520s) 13.8
Adding -hipa3 and Matt Glover's MEDUSA branch 11:01 (661s), exceeded wall time in 2nd job 11:01 (661s) 10.9

It looks like -hipa3 might save a bit time, although it's unclear why the speed of this run seems to have dropped again (the previous test were done about a month ago). Maybe Matt's branch speeds the code up for a short time, but there's clearly a problem for longer runs. I think we'll need to wait for Matt to return from leave, before progressing further with this.

Returning to Maff's optimisations

Maff and the rest of us have returned to this work (it's now 28 September 2016). There was a problem with one of Maff's changes in his working branch which caused the code to hang, but he's fixed that.

With the same configuration as above, I've run some more tests. Maff's branch is /home/h02/frmf/public_working_copies/dev_r5518_NOC_MEDUSA_Stable.

Change Times for first 6 months Average time for 1 month Speed (model yrs/day)
u-ad409, new benchmark test (performance of HPC does seem to vary fairly significantly over time) 7:13 (433s), 6:53 (413s), 7:18 (438s), 7:05 (425s), 7:20 (440s) & 7:42 (462s) 7:15 (435s) 16.6
u-ad938, Adding -hipa3 and Maff's MEDUSA branch 7:12 (432s), 7:13 (433s), 7:10 (430s), 7:18 (438s), 8:35 (515s) & 7:48 (468s) 7:33 (453s) 15.9
u-ad938, -hipa3, Maff's MEDUSA branch & MED_ATM_FORCING=/data/d02/frmf/ATM-FORCING_eORCA1/SPIN-UP_GA7_forcings 7:11 (431s), 6:46 (406s), 6:19 (379s), 6:14 (374s), 6:38 (398s) & 6:56 (416s) 6:41 (401s) 18.0