GLOMAP (IBM)
Timings (IBM)
Timings (Cray)
Timings (Cray II)
- Offline oxidants
- Full chemistry
- MEDUSA
- N96 eORCA1 GC3
- Prototype UKESM0.4
- UKESM-CN
- UKESM
- All UKESM configurations
- ARCHER vs XCS
- Hybrid model
Dr Hook

Timing tests for offline oxidants

N96: testing the number of tasks

I'm using u-ac283 to do this tests all run on Broadwell

NMPPE*NMPPN (total, nodes)	Model run length	Total time	Speed (model years per day)
24 * 24 (576 cores, 16 nodes)	10 days	13:11 (791s)	3.03
30 * 24 (720 cores, 20 nodes)	10 days	11:57 (717s)	3.35
42 * 24 (1,008 cores, 28 nodes)	10 days	11:25 (685s)	3.50
36 * 28 (1,008 cores, 28 nodes)	10 days	11:15 (675s)	3.56
48 * 24 (1,152 cores, 32 nodes)	10 days	9:56 (596s)	4.03

N96: underpopulation

Cores per node	NMPPE*NMPPN (tasks, nodes)	Model run length	Total time	Speed (model years per day)
36	48 * 24 (1,152 tasks, 32 nodes)	10 days	9:56 (596s)	4.03
32	48 * 24 (1,152 tasks, 36 nodes)	10 days	9:29 (569s)	4.22
24	48 * 24 (1,152 tasks, 48 nodes)	10 days	9:19 (559s)	4.29

We really need more than three measurements, but - along with some measurements on p35 of UKESM General II - it looks like an underpopulation to 32 cores per nodes looks fairly beneficial, but dropping to 24 cores per nodes isn't significantly faster.

With 32 cores per node, we can run the most tasks possible at N96 (on one thread), which is 48*28.

Cores per node	NMPPE*NMPPN (tasks, nodes)	Model run length	Total time	Speed (model years per day)
32	48 * 28 (1,344 tasks, 42 nodes)	10 days	NaNs in error term in BiCGstab

N96: varying threads and underpopulating

Cores per node	Threads	NMPPE*NMPPN (tasks)	Nodes	Model run length	Total time	Speed (model years per day)
32	1	48 * 24 (1,152 tasks)	36	10 days	9:29 (569s)	4.22
36	2	36 * 18 (648 tasks)	36	10 days	9:13 (553s)	4.35
36	2	30 * 24 (720 tasks)	40	10 days	8:57 (537s)	4.47
36	2	36 * 24 (864 tasks)	48	10 days	8:01 (481s)	4.99
32	2	48 * 24 (1,152 tasks)	72	10 days	7:26 (446s)	5.38

Priorities for speed-up

My timing tests basically suggests that the priority for increasing speed is generally in the following order

more tasks
more threads
underpopulate

so when you start to max out on what criteria, switch to the next one

Finding an efficient setup

I need to find a setup which runs at a reasonable speed, but still uses resources fairly efficiently. I've copied GA7.1 UM10.7, u-al613, to u-am967 for these tests

Nodes (ATM_PROCX*ATM_PROCY)	Threads	Time for one month	Core hours per year	Speed (model years/day)
6 (18*12)	1	1:14:51 (4,491s)	3,240	1.60
10 (20*18)	1	49:02 (2,942s)	3,527	2.45
10 (2016)^	1	59:16 (3,556s)	4,277	2.02
10 (18*10)	2	57:27 (3,447s)	4,134	2.09
14 (1628)^	1	43:49 (2,629s)	4,415	2.74
14 (18*28)	1	46:05 (2,765s)	4,652	2.60
14 (18*14)	2	45:45 (2,745s)	4,617	2.62
28 (18*28)	2	30:28 (1,828s)	6,140	3.94

* This had ATM_PPN=32