Timing tests for offline oxidants

N96: testing the number of tasks

I'm using u-ac283 to do this tests all run on Broadwell

NMPPE*NMPPN (total, nodes) Model run lengthTotal time Speed (model years per day)
24 * 24 (576 cores, 16 nodes) 10 days13:11 (791s) 3.03
30 * 24 (720 cores, 20 nodes) 10 days11:57 (717s) 3.35
42 * 24 (1,008 cores, 28 nodes) 10 days11:25 (685s) 3.50
36 * 28 (1,008 cores, 28 nodes) 10 days11:15 (675s) 3.56
48 * 24 (1,152 cores, 32 nodes) 10 days9:56 (596s) 4.03

N96: underpopulation

Cores per node NMPPE*NMPPN (tasks, nodes) Model run lengthTotal time Speed (model years per day)
36 48 * 24 (1,152 tasks, 32 nodes) 10 days9:56 (596s) 4.03
32 48 * 24 (1,152 tasks, 36 nodes) 10 days9:29 (569s) 4.22
24 48 * 24 (1,152 tasks, 48 nodes) 10 days9:19 (559s) 4.29

We really need more than three measurements, but - along with some measurements on p35 of UKESM General II - it looks like an underpopulation to 32 cores per nodes looks fairly beneficial, but dropping to 24 cores per nodes isn't significantly faster.

With 32 cores per node, we can run the most tasks possible at N96 (on one thread), which is 48*28.

Cores per node NMPPE*NMPPN (tasks, nodes) Model run lengthTotal time Speed (model years per day)
32 48 * 28 (1,344 tasks, 42 nodes) 10 days NaNs in error term in BiCGstab

N96: varying threads and underpopulating

Cores per node Threads NMPPE*NMPPN (tasks) Nodes Model run lengthTotal time Speed (model years per day)
32 1 48 * 24 (1,152 tasks) 36 10 days9:29 (569s) 4.22
36 2 36 * 18 (648 tasks) 36 10 days9:13 (553s) 4.35
36 2 30 * 24 (720 tasks) 40 10 days8:57 (537s) 4.47
36 2 36 * 24 (864 tasks) 48 10 days8:01 (481s) 4.99
32 2 48 * 24 (1,152 tasks) 72 10 days7:26 (446s) 5.38

Priorities for speed-up

My timing tests basically suggests that the priority for increasing speed is generally in the following order

  1. more tasks
  2. more threads
  3. underpopulate

so when you start to max out on what criteria, switch to the next one

Finding an efficient setup

I need to find a setup which runs at a reasonable speed, but still uses resources fairly efficiently. I've copied GA7.1 UM10.7, u-al613, to u-am967 for these tests

Nodes (ATM_PROCX*ATM_PROCY) Threads Time for one month Core hours per year Speed (model years/day)
6 (18*12) 1 1:14:51 (4,491s) 3,240 1.60
10 (20*18) 1 49:02 (2,942s) 3,527 2.45
10 (20*16)* 1 59:16 (3,556s) 4,277 2.02
10 (18*10) 2 57:27 (3,447s) 4,134 2.09
14 (16*28)* 1 43:49 (2,629s) 4,415 2.74
14 (18*28) 1 46:05 (2,765s) 4,652 2.60
14 (18*14) 2 45:45 (2,745s) 4,617 2.62
28 (18*28) 2 30:28 (1,828s) 6,140 3.94
* This had ATM_PPN=32

  • (ATM_PROCX,ATM_PROCY)=(20,18)
  • OMPTHR_ATM=1
  • IOS_NPROC=0 (I don't think the I/O server will be beneficial at these speeds, but I should test when I've got more time)
  • Total nodes=10
  • Three one month cycles, which took 50:04 (3,004s), 49:10 (2,950s) & 50:08 (3,008s)
  • This is an average time of 2,987s (49:47)
  • A spped of 2.41 model years/day
  • And is 3,585 core hours/model year