Timings for full chemistry

N96

The job for doing this is u-ae884 on broadwell nodes with safe optimisation.

NMPPE*NMPPN (total tasks, nodes) Model run lengthTotal time Speed (model years per day)
24 * 24 (576 tasks, 36 nodes) 10 days23:39 (1,419s) 1.69
36 * 20 (720 tasks, 20 nodes) 10 days24:10 (1,450s) 1.66
36 * 24 (864 tasks, 24 nodes)* 10 days20:31 (1,231s) 1.95

Trying to understand why full chemistry is slower than the runtimes we had before

My previous fullChemN96 estimates were based on mi-ah651 at UM10.2, which were run on the haswell nodes.

Main features NMPPE*NMPPN (total tasks, nodes) Model run lengthTotal time Speed (model years per day)
u-ae884, broadwell, safe 24 * 24 (576 tasks, 16 broadwell nodes) 10 days23:39 (1,419s) 1.69
u-ae884, haswell, safe 24 * 24 (576 tasks, 18 haswell nodes) 10 days22:45 (1,365s) 1.76
u-ae884, haswell, high 24 * 24 (576 tasks, 18 haswell nodes) 10 days20:56 (1,256s) 1.91
mi-ah651, haswell, safe 24 * 24 (576 tasks, 18 haswell nodes) 10 days20:34 (1,234s) 1.94
mi-ah651, haswell, high 24 * 24 (576 tasks, 18 haswell nodes) 10 days17:20 (1,040s) 2.31

Clearly my previous full chemistry run at UM10.2 with optimisation `high' is much quicker than my current full chemistry run at UM10.4 at optimisation `safe' by about 31%.

Looking at the profiling is seems that UM10.4 is slower than my previous UM10.2 runs for three reasons

  • COSP_MAIN is called in UM10.4 and not UM10.2 (it's called from ATMOS_PHYSICS1 and looks to be worth about 100s - adding about 6% to runtime).
  • Extra STASH from UM10.4 job.
  • Difference in optimisation. Compared to `safe'
    • going to `high' optimisation increased speed of UM10.4 by about 9%
    • going to `high' optimisation increased speed of UM10.4 by about 19% (this is most of the 31% difference).

Finding an efficient setup

I need to find a setup which runs at a reasonable speed, but still uses resources fairly efficiently. I've copied GA7.1 + StratTrop UM10.7, u-ak990, to u-am972 for these tests

Nodes (ATM_PROCX*ATM_PROCY) Threads Time for one month Core hours/model year Speed (model years/day)
8 (18*16) 1 1:55:28 (6,928s) 6,646 1.04
12 (18*24)* 1 1:23:19 (4,999s) 7,200 1.44
12 (18*24) 1 1:27:11 (5,231s) 7,513 1.38
18 (36*18) 1 1:11:35 (4,295s) 9,257 1.68
* This didn't have Luke optimisation branch to chemistry solver

It looks like we can drop the speed and get more efficiency, but do we really want to run at less than about 1.4 model years/day. I suspect not so I'll stick with 12 nodes for now and the following settings.

  • (ATM_PROCX,ATM_PROCY)=(24,18)
  • OMPTHR_ATM=1
  • IOS_NPROC=0 (I don't think the I/O server will be beneficial at these speeds, but I should test when I've got more time)
  • Total nodes=12
  • Three one month cycles, which took (PBS epilogue failed), 1:21:27 (4,887s) and 1:18:02 (4,682s)
  • This is an average of 4,785s (1:19:45)
  • A speed of 1.50 model years/day
  • This is 24 * 36 * 12 / 1.50 = 6,912 core hours/model year