I'm using u-ad409.
It's known that the limiting factor on running NEMO is the memory bandwidth. It means it's quicker to run NEMO by under populating nodes, because while it means that cores are left idle (effectively useless) it make the memory bandwidth per `active' core greater. I'm running some tests to test the impact of this.
The smallest number which is divisable by 32 (cores on Haswell nodes) and 36 (cores on Broadwell nodes) is 288. This number is also divisable by 24 and lots of numbers, and is near the high end of what we think we can run ORCA1 at. Hence it seems to be good choice for number of cores to use
The variable OCN_PPN needs adding to rose-suite.conf to do these tests, e.g. add OCN_PPN=24 if you want to use 24 cores per node.
Cores used per Broadwell node | XIOS_NPROC | Total nodes | Times for first 6 months | Average time for 1 month | Speed (model yrs/day) |
---|---|---|---|---|---|
36 | 0 | 8 | 17:10 (1,030s), 18:15 (1,095s), 19:36 (1,176s), 20:58 (1,258s), 19:52 (1,192s) & 20:20 (1,220s) | 1,162s | 6.20 |
36 | 8 | 9 | 11:26 (686s), 11:23 (683s), 11:16 (676s), 11:28 (688s), 10:59 (659s) & 11:15 (675s) | 678s | 10.6 |
32 | 8 | 10 | 10:22 (622s), 11:07 (667s), 10:47 (647s), 10:40 (640s), 10:38 (638s) & 10:36 (636s) | 642s | 11.2 |
24 | 8 | 13 | 9:51 (591s), 9:32 (572s), 9:46 (586s), 9:45 (585s), 10:02 (602s) & 9:46 (586s) | 587s | 12.3 |
18 | 8 | 17 | 9:31 (576s), 9:42 (582s), 9:36 (576s), 9:38 (587s), 9:40 (580s) & 10:07 (607s) | 582s | 12.4 |
12 | 8 | 25 | 9:54 (594s), 9:58 (598s), 9:43 (583s), 9:34 (574s), 9:27 (567s) & 9:17 (557s) | 579s | 12.4 |
The performance doesn't seem to be improving that much by under populating fewer than 24 cores per broadwell 36 core node, which looks to be about 16% faster while using 44% more resources than than full usage run (both runs have XIOS_NPROC=8).
I'll use 24 cores per Broadwell node to try and avoid the limitations of memory bandwidth. And I'll stick with XIOS_NPROC=8.
NEMO_IPROC * NEMO_JPROC (CICE_BLKX * CICE_BLKY) | Total nodes (cores used, cores taken) | Times for first 6 months | Average time for 1 month | Speed (model yrs/day) |
---|---|---|---|---|
12*16 (30*21) | 9 (200, 324) | 12:35 (755s), 12:24 (744s), 12:43 (763s), 12:12 (732s), 12:19 (739s) & 12:05 (725s) | 743s | 9.69 |
24*12 (15*28) | 13 (296, 468) | 9:51 (591s), 9:32 (572s), 9:46 (586s), 9:45 (585s), 10:02 (602s) & 9:46 (586s) | 587s | 12.3 |
24*16 (15*21) | 17 (392, 612) | 8:21 (501s), 8:15 (495s), 8:26 (506s), 8:11 (491s), 8:02 (482s) & 8:17 (497s) | 495s | 14.5 |
24*24 (15*14) | 25 (584, 900) | 6:54 (414s), 6:37 (397s), 6:51 (411s), 6:58 (418s), 7:11 (431s) & 6:50 (410s) | 414s | 17.4 |
30*20 (12*17) | 26 (608, 936) | Fatal error in PMPI_Isend: Invalid rank, error stack | ||
30*24 (12*14) | 31 (728, 1,116) | 6:28 (388s), 6:40 (400s), 6:55 (415s), 6:42 (402s), 6:46 (406s) & 6:49 (409s) | 403s | 17.9 |
36*24 (10*14) | 37 (872, 1,332) | 6:33 (393s), 6:39 (399s), 6:49 (409s), 7:04 (424s), 7:02 (422s) & 7:29 (449s) | 416s | 17.3 |
I've taking the configuration
XIOS_NPROC | Nodes for XIOS | Times for first 6 months | Average time for 1 month | Speed (model yrs/day) |
---|---|---|---|---|
8 | 1 | 6:54 (414s), 6:37 (397s), 6:51 (411s), 6:58 (418s), 7:11 (431s) & 6:50 (410s) | 414s | 17.4 |
16 | 1 | 7:05 (425s), 7:24 (444s), 7:08 (428s), 7:16 (436s), 7:22 (442s) & 7:33 (423s) | 433s | 16.6 |
16 | 2 | 7:56 (476s), 7:28 (448s), 8:11 (491s), 8:12 (492s), 8:46 (526s) & 8:00 (480s) | 486s | 14.8 |
XIOS_NPROC | Nodes for XIOS | Times for first 6 months | Average time for 1 month | Speed (model yrs/day) |
---|---|---|---|---|
8 | 1 | 6:49 (409s), 6:50 (410s), 6:52 (412s), 6:36 (396s), 6:54 (414s) & 6:48 (408s) | 408s | 17.6 |
16 | 1 | 6:49 (409s), 6:48 (408s), 6:42 (402s), 6:31 (391s), 6:44 (404s) & 6:41 (401s) | 403s | 17.9 |
18 | 1 | 6:04 (364s), 6:01 (361s), 6:05 (365s), 6:19 (379s), 6:03 (363s) & 5:59 (359s) | 365s | 19.7 |
18 (2nd try) | 1 | 6:28 (388s), 6:13 (373s), 6:30 (390s), 6:26 (386s), 6:33 (393s) & 6:30 (390s) | 387s | 18.6 |
19 | 1 | 6:29 (389s), 6:33 (393s), 6:13 (373s), 7:03 (423s), 6:31 (391s) & 6:10 (370s) | 390s | 18.5 |
20 | 1 | 6:23 (383s), 6:08 (368s), 6:18 (378s), 6:15 (375s), 6:03 (363s) & 6:07 (367s) | 372s | 19.4 |
20 (2nd try) | 1 | 6:42 (402s), 6:37 (397s), 6:55 (415s), 6:29 (389s), 7:04 (424s) & 6:50 (410s) | 406s | 17.7 |
22 | 1 | 6:17 (377s), 6:21 (381s), 6:01 (361s), 6:25 (385s), 6:19 (379s) & 6:15 (375s) | 376s | 19.1 |
24 | 1 | 6:25 (385s), 6:53 (413s), 6:51 (411s), 7:10 (430s), 6:32 (392s) & 6:32 (392s) | 404s | 17.8 |
16 | 2 | 6:24 (384s), 6:30 (390s), 6:27 (387s), 6:41 (401s), 6:46 (406s) & 6:40 (400s) | 395s | 18.2 |
24 | 2 | 6:42 (402s), 6:42 (402s), 6:35 (395s), 6:39 (399s), 6:33 (393s) & 6:36 (396s) | 398s | 18.1 |
24 | 3 | 6:55 (415s), 6:39 (399s), 7:03 (423s), 6:45 (405s), 6:49 (409s) & 6:49 (409s) | 410s | 17.6 |
As expected, type="multiple_file", looks to be the faster than type="one_file". Any gains, if any, for two nodes for XIOS look to be small, so I'll stick with one node for now.
It's less clear which is the best number for XIOS_NPROC but 18 it the leading contender based on the numbers above.
For this
buffer_size (*106) | Ratio in first month | Times for first 6 months | Average time for 1 month | Speed (model yrs/day) |
---|---|---|---|---|
15 | 4.9-10% | 6:37 (397s), 6:33 (393s), 6:32 (392s), 6:35 (395s), 6:50 (410s) & 6:55 (415s) | 400s | 18.0 | 25 | 13-24% | 6:34 (394s), 6:18 (378s), 6:32 (392s), 6:28 (388s), 6:38 (398s) & 6:35 (395s) | 391s | 18.4 |
30 | 0.0024-1.9% | 6:28 (388s), 6:13 (373s), 6:30 (390s), 6:26 (386s), 6:33 (393s) & 6:30 (390s) | 387s | 18.6 |
30 | 1.4-7.6% | 6:28 (388s), 6:34 (394s), 6:18 (378s), 6:19 (379s), 6:16 (379s) & 11:53 (713s) | 438s | 16.4 |
32 | 1.4-10% | 6:13 (373s), 6:33 (393s), 6:20 (380s), 6:13 (373s), 6:26 (386s) & 6:36 (396s) | 384s | 18.8 |
35 | 3.6-8.4% | 6:48 (408s), 6:17 (377s), 6:29 (389s), 6:17 (377s), 6:40 (400s) & 6:52 (412s) | 394s | 18.3 |
40 | 2.7-4.9% | 6:22 (382s), 6:14 (374s), 6:24 (384s), 6:19 (379s), 6:41 (401s) & 6:41 (401s) | 387s | 18.6 |
60 | 1.9-17% | 6:28 (388s), 8:07 (487s), 6:47 (407s), 7:07 (427s), 6:27 (387s) & 6:30 (390s) | 414s | 17.4 |
I'm struggling to get any coherent pattern here, but 30*106 looks to be around the right value.
I've created job u-ad938 as a copy of u-ad409, and added in PE remapping. For this
I have more detailed notes on this on p104 of my `Ocean' book.
Type of PE remapping | Times for first 6 months | Average time for 1 month | Speed (model yrs/day) |
---|---|---|---|
No remapping | 7:14 (434s), 7:23 (443s), 7:32 (452s), 7:17 (437s), 7:27 (447s) & 7:14 (434s) | 441s | 16.3 |
Row major (same as before, before using MPICH_RANK_ORDER file) | 7:15 (435s), 7:47 (467s), 7:23 (443s), 7:15 (435s), 7:34 (454s) & 7:39 (459s) | 449s | 16.0 |
Column major | 7:45 (465s), 7:56 (476s), 7:51 (471s), 7:42 (462s), 7:33 (453s) & 7:53 (473s) | 467s | 15.4 |
Minimising neighbours on other nodes | 7:50 (470s), 7:12 (432s), 7:44 (464s), 7:19 (439s), 7:29 (449s) & 7:27 (447s) | 450s | 16.0 |
I've been rather puzzled by why all the times above look slow, so I ran u-ad409 again and it ran at an average of 470s - hence it looks like the HPCs are just slow at the moment.
From the numbers above, it doesn't seem that PE remapping helps for this run.
Matt has sent me an e-mail on 12 July 2016 with some optimisation suggestions. To test these, I'm going to run u-ad938
Change | Times for first 6 months | Average time for 1 month | Speed (model yrs/day) |
---|---|---|---|
Added -hipa3 to fcflags_nemo_overrides | 7:57 (477s), 8:30 (510s), 8:19 (499s), 8:09 (489s), 8:05 (485s) & 8:11 (491s) | 8:12 (492s) | 14.6 |
Removing -hipa3 | 8:36 (516s), 9:20 (560s), 8:41 (521s), 8:47 (527s), 8:17 (497s) & 8:19 (499s) | 8:40 (520s) | 13.8 |
Adding -hipa3 and Matt Glover's MEDUSA branch | 11:01 (661s), exceeded wall time in 2nd job | 11:01 (661s) | 10.9 |
It looks like -hipa3 might save a bit time, although it's unclear why the speed of this run seems to have dropped again (the previous test were done about a month ago). Maybe Matt's branch speeds the code up for a short time, but there's clearly a problem for longer runs. I think we'll need to wait for Matt to return from leave, before progressing further with this.
Maff and the rest of us have returned to this work (it's now 28 September 2016). There was a problem with one of Maff's changes in his working branch which caused the code to hang, but he's fixed that.
With the same configuration as above, I've run some more tests. Maff's branch is /home/h02/frmf/public_working_copies/dev_r5518_NOC_MEDUSA_Stable.
Change | Times for first 6 months | Average time for 1 month | Speed (model yrs/day) |
---|---|---|---|
u-ad409, new benchmark test (performance of HPC does seem to vary fairly significantly over time) | 7:13 (433s), 6:53 (413s), 7:18 (438s), 7:05 (425s), 7:20 (440s) & 7:42 (462s) | 7:15 (435s) | 16.6 |
u-ad938, Adding -hipa3 and Maff's MEDUSA branch | 7:12 (432s), 7:13 (433s), 7:10 (430s), 7:18 (438s), 8:35 (515s) & 7:48 (468s) | 7:33 (453s) | 15.9 |
u-ad938, -hipa3, Maff's MEDUSA branch & MED_ATM_FORCING=/data/d02/frmf/ATM-FORCING_eORCA1/SPIN-UP_GA7_forcings | 7:11 (431s), 6:46 (406s), 6:19 (379s), 6:14 (374s), 6:38 (398s) & 6:56 (416s) | 6:41 (401s) | 18.0 |