I'm using u-ac283 to do this tests all run on Broadwell
NMPPE*NMPPN (total, nodes) | Model run length | Total time | Speed (model years per day) |
---|---|---|---|
24 * 24 (576 cores, 16 nodes) | 10 days | 13:11 (791s) | 3.03 |
30 * 24 (720 cores, 20 nodes) | 10 days | 11:57 (717s) | 3.35 |
42 * 24 (1,008 cores, 28 nodes) | 10 days | 11:25 (685s) | 3.50 |
36 * 28 (1,008 cores, 28 nodes) | 10 days | 11:15 (675s) | 3.56 |
48 * 24 (1,152 cores, 32 nodes) | 10 days | 9:56 (596s) | 4.03 |
Cores per node | NMPPE*NMPPN (tasks, nodes) | Model run length | Total time | Speed (model years per day) |
---|---|---|---|---|
36 | 48 * 24 (1,152 tasks, 32 nodes) | 10 days | 9:56 (596s) | 4.03 |
32 | 48 * 24 (1,152 tasks, 36 nodes) | 10 days | 9:29 (569s) | 4.22 |
24 | 48 * 24 (1,152 tasks, 48 nodes) | 10 days | 9:19 (559s) | 4.29 |
We really need more than three measurements, but - along with some measurements on p35 of UKESM General II - it looks like an underpopulation to 32 cores per nodes looks fairly beneficial, but dropping to 24 cores per nodes isn't significantly faster.
With 32 cores per node, we can run the most tasks possible at N96 (on one thread), which is 48*28.
Cores per node | NMPPE*NMPPN (tasks, nodes) | Model run length | Total time | Speed (model years per day) |
---|---|---|---|---|
32 | 48 * 28 (1,344 tasks, 42 nodes) | 10 days | NaNs in error term in BiCGstab |
Cores per node | Threads | NMPPE*NMPPN (tasks) | Nodes | Model run length | Total time | Speed (model years per day) |
---|---|---|---|---|---|---|
32 | 1 | 48 * 24 (1,152 tasks) | 36 | 10 days | 9:29 (569s) | 4.22 |
36 | 2 | 36 * 18 (648 tasks) | 36 | 10 days | 9:13 (553s) | 4.35 |
36 | 2 | 30 * 24 (720 tasks) | 40 | 10 days | 8:57 (537s) | 4.47 |
36 | 2 | 36 * 24 (864 tasks) | 48 | 10 days | 8:01 (481s) | 4.99 |
32 | 2 | 48 * 24 (1,152 tasks) | 72 | 10 days | 7:26 (446s) | 5.38 |
My timing tests basically suggests that the priority for increasing speed is generally in the following order
so when you start to max out on what criteria, switch to the next one
I need to find a setup which runs at a reasonable speed, but still uses resources fairly efficiently. I've copied GA7.1 UM10.7, u-al613, to u-am967 for these tests
Nodes (ATM_PROCX*ATM_PROCY) | Threads | Time for one month | Core hours per year | Speed (model years/day) |
---|---|---|---|---|
6 (18*12) | 1 | 1:14:51 (4,491s) | 3,240 | 1.60 |
10 (20*18) | 1 | 49:02 (2,942s) | 3,527 | 2.45 |
10 (20*16)* | 1 | 59:16 (3,556s) | 4,277 | 2.02 |
10 (18*10) | 2 | 57:27 (3,447s) | 4,134 | 2.09 |
14 (16*28)* | 1 | 43:49 (2,629s) | 4,415 | 2.74 |
14 (18*28) | 1 | 46:05 (2,765s) | 4,652 | 2.60 |
14 (18*14) | 2 | 45:45 (2,745s) | 4,617 | 2.62 |
28 (18*28) | 2 | 30:28 (1,828s) | 6,140 | 3.94 |