Attempting to speed-up Till's N96 eORCA1 GC3 suite

Till's original configuration

Till's configuration is (see his e-mail from 9/3/16)

  • Suite is u-ac043
  • Running on Broadwell
  • ATMOS (offOxN96): 24*24*1 (16 nodes)
    • steps_per_periodim=72, so this is a 20 minute timestep
  • Ocean: 15*12*1 (5 nodes)
    • CICE_BLKX=24; (15*24=360=CICE_COL)
    • CICE_BLKY=28; (12*28=336>CICE_ROW=330)
  • XIOS (detached): 8 (1 node)
  • Total: 22 nodes or 792 cores
  • It takes 13 minutes (780s) to run 10 days (an average over several 10 day periods)
  • I've calculated this as 3.08 model years/day
  • This is 172 node hours for one model year.

I've run this job for 10 days and it's taken 13:34 (814s) and 13:11 (791s). I'm running timing tests twice for each change, because the variation in run time for exactly the same configuration is large, and one timing test can be very misleading.

Analytical estimate

  • modePred.py predicts a speed of 3.221 model years/day for offOxN96 with 576 cores (16 broadwell nodes).
  • Assume a 10% slow down for coupling this suggests 2.90 model years/day. Hence, 3.08 model years/day is quite a bit faster than this
  • I don't have nemoCiceOrca1 for Cray - lets assume about 50% slower than IBM. Hence
    • For 2 nodes: IBM is 21.2 model years/day - assume Cray is 21.2/1.5 = 14.1 model years/day
    • For 1 node: IBM is 11.5 model years/day - assume Cray is 11.5/1.5 = 7.67 model years/day.
    1 node should easily be enough to match atmosphere at around 3 model years/day

Modifications

I've copied the job to u-ac407, and I'll start making modifications

  1. Add in Paul Selwood's chunking branch, branches/dev/paulselwood/vn10.3_aero@17775 (subset of Mark Richardson's which didn't make it into UM10.4)
    • First 10 days have taken: 14:21 (861s) and 12:43 (763s).
  2. Attach XIOS (set XIOS_NPROC=0)
    • First 10 days have taken: 13:30 (810s) and 14:01 (841s).
  3. Reduce ocean nodes to 2
    • 8x9 for ocean
    • CICE_BLKX=45; (8*45=360=CICE_COL)
    • CICE_BLKY=37; (9*37=333>CICE_ROW=330)
    • First 10 days have taken: 12:57 (777s) and 14:10 (850s)
  4. Reduce ocean nodes to 1
    • 9x4 for ocean
    • CICE_BLKX=40; (9*40=360=CICE_COL)
    • CICE_BLKY=83; (4*83=332>CICE_ROW=330)
    • First 10 days have taken: 13:04 (784s) and 14:26 (866s)
  5. Move to 30 min timestep, set steps_per_periodim=48 (Maybe only for UKESM-lr)
    • First 10 days have taken:

Moving to haswell cores

Conveniently the 576 cores that Till has used on the broadwell cores, not only divides by 36 (16 nodes), but also divides by the 32 cores on the haswell cores (18 nodes). If we assume that the ocean will be still be sufficiently fast on 32 haswell cores to not slow things down (and it shouldn't much), we can do a direct comparison.

I've created a new job for this u-ac435, which is a copy of the job above. The configuration is.

  • ATMOS (offOxN96): 24*24 (18 haswell nodes)
    • steps_per_periodim=72, so this is a 20 minute timestep
  • Ocean: 8*4 (1 node)
    • CICE_BLKX=40; (8*45=360=CICE_COL)
    • CICE_BLKY=83; (4*83=332>CICE_ROW=330)
  • XIOS attached
  • Total: 19 nodes or 608 haswell cores
  • First 10 days have taken: 12:21 (741s) and 14:43 (883s). An average time of 812s (13:32).
  • The total time for a 2 month run is 4,750s, which is an average of 792s (13:12) per 10 days.

This suggests (calculations based on 2 month average are shown in bold)

  • Speed of 2.96 model years/day (3.03 model years/day)
  • One model year completed in 8.11 hours (7.92 hours)
  • One model year needs 19 * 8.11 = 154 node-hours (150 node-hours)
  • One model year needs 32 * 154 = 4,928 core-hours (4,815 core-hours)

For ARCHER

  • ATMOS: 576 cores is 24 ARCHER nodes
  • OCEAN: one node might be enough, let's be safe and assume we need 2 nodes or 48 cores.
  • This is a total of 624 cores
  • If this was a cray we expect the same speed as above, so 624 * 8.11 = 5,060 core-hours (624 * 7.92 = 4,942 core-hours)
  • The latest information is that we need to multiply this by 2 for ARCHER, so 2 * 5,060 = 10,120 core-hours (9,884 core-hours)
  • Hence one year needs 10,120 * 15 = 151,800 AUs or 0.152 MAUs (0.148 MAUs)
  • Incredibly, in Colin's e-mail on 9/3/16 he says his estimate was 0.15 MAUs - almost exactly what we have here.

Running at about 2 model years/day

Running at about 3 model years/day might be considered a bit fast for optimum resource efficiency. Let's try running at about 2 model years/day. It's probably more accurate to use scale on the numbers above rather than use my analytical prediction: 2.96 model years for 576 atmosphere cores, so we probably need about (2/3)*576=384 cores for 2 model years/day.

  • 384 cores is 12 haswell nodes
  • 24*16 seems a reasonable PE decomposition
  • First 10 days have taken:16:14 (974s)
  • Two months have taken a total of 1:31:12 (5,472s), which is an average for 15:12 (912s) for 10 days.
  • 912s for 10 days is 2.63 model years/day (a lot faster than 2 model years/day)
  • One model completed in 9.13 hours
  • One model year needs 13 * 9.13 = 119 node-hours
  • One model year needs 32 * 119 = 3,796 core-hours

For ARCHER

  • ATMOS: 384 cores is 24 ARCHER nodes
  • OCEAN: one node is probably enough, so another 24 cores (I was pessimistic before, so I'll be optimistic here)
  • This is a total of 408 cores
  • If this was a cray we expect the same speed as above, so 408 * 9.13 = 3,725 core-hours
  • The latest information is that we need to multiply this by 2 for ARCHER, so 2 * 3,725 = 7,450 core-hours
  • Hence one year needs 7,450 * 15 = 111,750 AUs or 0.11 MAUs

Analytical estimate of UKESM-lr

Take what I think will be a typical setup

  • ATMOS: 512 cores looks a reasonable choice, and we know from running fullChemN96 with 30 min timestep that this runs at about 2.51 model years/day. Assume a 10% slow down for coupling and we have 2.26 model years/day.
  • OCEAN: we expect MEDUSA to slow the ORCA1 code by at 3.5 times, but 2 nodes (64 cores) should still be enough to achieve 2.51 model years/day
  • This is a total of 576 cores for 2.26 model years/day
  • One model year will need 10.62 hours
  • One model year will need 10.62 * 576 = 6,117 core-hours

For ARCHER

  • ATMOS: 512 doesn't divide by 24, so lets take 504 cores and assume that a speed of 2.26 on XC40 architecture is 2.26 * 504/512 = 2.22 model years/day
  • OCEAN: two nodes, which is 48 cores, should still be enough
  • This is a total of 560 cores for 2.22 model years/day on XC40 architecture
  • One model year needs 10.81 hours
  • One model year needs 10.81 * 560 = 6,054 core-hours
  • The latest information is that we need to multiply this by 2 for XC30 architecture (ARCHER), so 2 * 6,054 = 12,108 core-hours
  • Hence one year needs 12,108 * 15 = 181,620 AUs or 0.18 MAUs

Creating something similar to UKESM-lr

Can I create something similar to UKESM-lr, so I can time that. Yongming has u-ab330, which has

  • StratTrop
  • Dust

but is missing

  • JULES land surface (adding Nitrogen and Carbon cycle)
  • TRIFFID (carbon cycle)
  • MEDUSA

Can I swap the ORCA025 ocean in here for Till's ORCA1?