Attempting to speed-up Till's N96 eORCA1 GC3 suite
Till's original configuration
Till's configuration is (see his e-mail from 9/3/16)
- Suite is u-ac043
- Running on Broadwell
- ATMOS (offOxN96): 24*24*1 (16 nodes)
- steps_per_periodim=72, so this is a 20 minute timestep
- Ocean: 15*12*1 (5 nodes)
- CICE_BLKX=24; (15*24=360=CICE_COL)
- CICE_BLKY=28; (12*28=336>CICE_ROW=330)
- XIOS (detached): 8 (1 node)
- Total: 22 nodes or 792 cores
- It takes 13 minutes (780s) to run 10 days (an average over several
10 day periods)
- I've calculated this as 3.08 model years/day
- This is 172 node hours for one model year.
I've run this job for 10 days and it's taken 13:34 (814s) and
13:11 (791s). I'm running timing tests twice for each change,
because the variation in run time for exactly the same configuration is
large, and one timing test can be very misleading.
Analytical estimate
- modePred.py predicts a speed of 3.221 model years/day for offOxN96
with 576 cores (16 broadwell nodes).
- Assume a 10% slow down for coupling this suggests 2.90 model
years/day. Hence, 3.08 model years/day is quite a bit faster than
this
- I don't have nemoCiceOrca1 for Cray - lets assume about 50% slower
than IBM. Hence
- For 2 nodes: IBM is 21.2 model years/day - assume Cray is
21.2/1.5 = 14.1 model years/day
- For 1 node: IBM is 11.5 model years/day - assume Cray is
11.5/1.5 = 7.67 model years/day.
1 node should easily be enough to match atmosphere at around 3 model
years/day
Modifications
I've copied the job to u-ac407, and I'll start making modifications
- Add in Paul Selwood's chunking branch,
branches/dev/paulselwood/vn10.3_aero@17775 (subset of Mark
Richardson's which didn't make it into UM10.4)
- First 10 days have taken: 14:21 (861s) and 12:43 (763s).
- Attach XIOS (set XIOS_NPROC=0)
- First 10 days have taken: 13:30 (810s) and 14:01 (841s).
- Reduce ocean nodes to 2
- 8x9 for ocean
- CICE_BLKX=45; (8*45=360=CICE_COL)
- CICE_BLKY=37; (9*37=333>CICE_ROW=330)
- First 10 days have taken: 12:57 (777s) and 14:10 (850s)
- Reduce ocean nodes to 1
- 9x4 for ocean
- CICE_BLKX=40; (9*40=360=CICE_COL)
- CICE_BLKY=83; (4*83=332>CICE_ROW=330)
- First 10 days have taken: 13:04 (784s) and 14:26 (866s)
- Move to 30 min timestep, set steps_per_periodim=48 (Maybe only
for UKESM-lr)
- First 10 days have taken:
Moving to haswell cores
Conveniently the 576 cores that Till has used on the broadwell
cores, not only divides by 36 (16 nodes), but also divides by
the 32 cores on the haswell cores (18 nodes). If we assume that the ocean
will be still be sufficiently fast on 32 haswell cores to not
slow things down (and it shouldn't much), we can do a direct
comparison.
I've created a new job for this u-ac435, which is a copy of the
job above. The configuration is.
- ATMOS (offOxN96): 24*24 (18 haswell nodes)
- steps_per_periodim=72, so this is a 20 minute timestep
- Ocean: 8*4 (1 node)
- CICE_BLKX=40; (8*45=360=CICE_COL)
- CICE_BLKY=83; (4*83=332>CICE_ROW=330)
- XIOS attached
- Total: 19 nodes or 608 haswell cores
- First 10 days have taken: 12:21 (741s) and 14:43 (883s).
An average time of 812s (13:32).
- The total time for a 2 month run is 4,750s, which is an
average of 792s (13:12) per 10 days.
This suggests (calculations based on 2 month average are
shown in bold)
- Speed of 2.96 model years/day (3.03 model years/day)
- One model year completed in 8.11 hours (7.92 hours)
- One model year needs 19 * 8.11 = 154 node-hours (150 node-hours)
- One model year needs 32 * 154 = 4,928 core-hours
(4,815 core-hours)
For ARCHER
- ATMOS: 576 cores is 24 ARCHER nodes
- OCEAN: one node might be enough, let's be safe and assume we
need 2 nodes or 48 cores.
- This is a total of 624 cores
- If this was a cray we expect the same speed as above, so
624 * 8.11 = 5,060 core-hours (624 * 7.92 = 4,942 core-hours)
- The latest information is that we need to multiply this by 2
for ARCHER, so 2 * 5,060 = 10,120 core-hours (9,884 core-hours)
- Hence one year needs 10,120 * 15 = 151,800 AUs or 0.152 MAUs
(0.148 MAUs)
- Incredibly, in Colin's e-mail on 9/3/16 he says his estimate
was 0.15 MAUs - almost exactly what we have here.
Running at about 2 model years/day
Running at about 3 model years/day might be considered a bit
fast for optimum resource efficiency. Let's try running at about
2 model years/day. It's probably more accurate to use scale on the
numbers above rather than use my analytical prediction:
2.96 model years for 576 atmosphere cores, so we probably need
about (2/3)*576=384 cores for 2 model years/day.
- 384 cores is 12 haswell nodes
- 24*16 seems a reasonable PE decomposition
- First 10 days have taken:16:14 (974s)
- Two months have taken a total of 1:31:12 (5,472s), which is
an average for 15:12 (912s) for 10 days.
- 912s for 10 days is 2.63 model years/day (a lot faster than
2 model years/day)
- One model completed in 9.13 hours
- One model year needs 13 * 9.13 = 119 node-hours
- One model year needs 32 * 119 = 3,796 core-hours
For ARCHER
- ATMOS: 384 cores is 24 ARCHER nodes
- OCEAN: one node is probably enough, so another 24 cores (I was
pessimistic before, so I'll be optimistic here)
- This is a total of 408 cores
- If this was a cray we expect the same speed as above, so
408 * 9.13 = 3,725 core-hours
- The latest information is that we need to multiply this by 2
for ARCHER, so 2 * 3,725 = 7,450 core-hours
- Hence one year needs 7,450 * 15 = 111,750 AUs or 0.11 MAUs
Analytical estimate of UKESM-lr
Take what I think will be a typical setup
- ATMOS: 512 cores looks a reasonable choice, and we know from
running fullChemN96 with 30 min timestep that this runs at about
2.51 model years/day. Assume a 10% slow down for coupling and we
have 2.26 model years/day.
- OCEAN: we expect MEDUSA to slow the ORCA1 code by at 3.5 times,
but 2 nodes (64 cores) should still be enough to achieve 2.51 model
years/day
- This is a total of 576 cores for 2.26 model years/day
- One model year will need 10.62 hours
- One model year will need 10.62 * 576 = 6,117 core-hours
For ARCHER
- ATMOS: 512 doesn't divide by 24, so lets take 504 cores and
assume that a speed of 2.26 on XC40 architecture is 2.26 * 504/512
= 2.22 model years/day
- OCEAN: two nodes, which is 48 cores, should still be enough
- This is a total of 560 cores for 2.22 model years/day on XC40
architecture
- One model year needs 10.81 hours
- One model year needs 10.81 * 560 = 6,054 core-hours
- The latest information is that we need to multiply this by 2
for XC30 architecture (ARCHER), so 2 * 6,054 = 12,108 core-hours
- Hence one year needs 12,108 * 15 = 181,620 AUs or 0.18 MAUs
Creating something similar to UKESM-lr
Can I create something similar to UKESM-lr, so I can time that.
Yongming has u-ab330, which has
but is missing
- JULES land surface (adding Nitrogen and Carbon cycle)
- TRIFFID (carbon cycle)
- MEDUSA
Can I swap the ORCA025 ocean in here for Till's ORCA1?