ARCHER vs XCS

UKESM0.6

The one job we currently have on both the XCS and ARCHER at N96/ORCA1 resolution is UKESM0.6. It says it's UKESM0.6-CN, but the -CN configuration doesn't have the full chemistry (aka StratTrop), and this does.

I've copied the XCS version of UKESM0.6-CN, u-an130, to u-ar648, and the ARCHER version of UKESM0.6-CN, u-an561, to u-ar563. The one change I've made to this configuration is to detach XIOS, which makes the run a bit faster. Common configurations for both jobs are

  • UM10.6
  • (ATM_PROCX,ATM_PROCY)=(48,24) (The most possible with this halo setting is (48,28), so this is close to that)
  • OMPTHR_ATM=1 (one thread)
  • IOS_NPROC=0 (no use of IO server)
  • Total ATMOS tasks of 48*28=1,152, which is 32 XCS nodes and 48 ARCHER nodes
  • (NEMO_IPROC,NEMO_JPROC)=(12,9)
  • (CICE_BLKX,CICE_BLKY)=(30,37)
  • Total OCEAN tasks of 12*9=108, which is 3 XCS nodes and 5 ARCHER nodes (one only half used)
  • XIOS_NPROC=6, which is one incomplete XCS nodes and one incomplete ARCHER node
  • Total CPUS used is 1,296 for both, which 36 XCS nodes and 54 ARCHER nodes
  • Run for one two month cycle

The performance of the jobs before Dr Hook was added and afterwards is summarised below

XCSARCHER
Jobs suitesu-ar648u-ar563
Jobs are copied fromu-an130u-an561
Performance without Dr Hook
Elapsed time 1:39:22 (5,962s)1:50:19 (6,619s)
Speed (model years/day) 2.422.18
Performance with Dr Hook
Elapsed time 2:02:54 (7,374s)2:05:17 (7,517s)
Speed (model years/day) 1.951.92

I've only run one two month cycle in each case, and we know the noise on any give Cray run seems to be about 10%, so we can't use this numbers to give a precise comparisons of machines. However for these runs, ARCHER was 11% slower than XCS for runs without Dr Hook and 1.9% slower than XCS for runs with Dr Hook.

Comparing XCS and ARCHER hardware

XCSARCHER
Compute nodes Broadwell (2.1 GHz) Ivy Bridge (two 2.7 GHz, 12-core E5-2697 v2)
Local memory in NUMA region 64 GB (3.56 GB per core) 32 GB (2.67 GB per core)
Aries interconnect Dragonfly Dragonfly (4 compute nodes are connected to each Aries router; 188 nodes are grouped into a cabinet; and two cabinets make up a group).

My knowledge of hardware is rubbish, although gradually improving. The XCS is using a later version of the Intel chip and has more local memory per core (we know that memory bandwidth is a limiting factor, particulary for dynamical core of UM and NEMO). The clock speed for the XCS is less. We had expected the XCS to be quicker than ARCHER, and the runs I've done above are bit quicker on XCS, but another comparison for a GC3.1 N216/ORCA025 run found ARCHER to be quicker (see p34 and p36 of my UKESM General IV).

Top level profiling

The times in the table below are all mean total time, unless `itself is specified, where total time is the time spent within a routine and all the routines called by that routine. If `itself' is specified then the time is just the time in a given routine. I'm only showing the main routines in terms of time.

XCS

Routines
UM_SHELL (7,268s)
U_MODEL_4A (7,264s)
ATM_STEP_4A* (6,266s) MEANCTL (483s) DUMPCTL (326s) OASIS3_ PUTA2O (106s) OASIS3_ GETO2A (43s)
ATMOS _PHYS- ICS1 (1,128s) ATMOS _PHYS- ICS2 (365s) EG_ SL_ HELM- HOLTZ (373s) TR_ SET_ PHYS _4A* (437s) EG_ CORR- ECT_ TRAC- ERS_ PRIES- TLEY (429s) SL_ TRAC- ER1 _4A (685s) EG_ SL_ FULL_ WIND (173s) UKCA_ MAIN1 (1,474s) STASH (631s) SWAP_ BOUNDS routines (2,045s) ACUMPS (426s) UM_ WRITDUMP (326s) ICE_ SHEET_ MASS (103s) OASIS3_ GET (43s)
STWORK (630s) See profile for SWAP_ BOUNDS routines GENERAL_ GATHER_FIELD (565s)
STASH_GATHER_ FIELD (585s)
GATHER_FIELD (566s)
GATHER_ FIELD_ MPL (389s, itself) Itself (177s)

ARCHER

Routines
UM_SHELL (7,375s)
U_MODEL_4A (7,369s)
ATM_STEP_4A* (6,169s) MEANCTL (571s) DUMPCTL (388s) OASIS3_ PUTA2O (144s) OASIS3_ GETO2A (34s)
ATMOS _PHYS- ICS1 (1,023s) ATMOS _PHYS- ICS2 (334s) EG_ SL_ HELM- HOLTZ (461s) TR_ SET_ PHYS _4A* (327s) EG_ CORR- ECT_ TRAC- ERS_ PRIES- TLEY (374s) SL_ TRAC- ER1 _4A (524s) EG_ SL_ FULL_ WIND (185s) UKCA_ MAIN1 (1,314s) STASH (521s) SWAP_ BOUNDS routines (2,143s) ACUMPS (502s) UM_ WRITDUMP (388s) ICE_ SHEET_ MASS (137s) OASIS3_ GET (32s)
STWORK (519s) See profile for SWAP_ BOUNDS routines GENERAL_ GATHER_FIELD (697s)
STASH_GATHER_ FIELD (712s)
GATHER_FIELD (699s)
GATHER_ FIELD_ MPL (396s, itself) Itself (302s)

Profiling SWAP_BOUNDS routines

XCS

Routines Total mean time
SWAP_BOUNDS & SWAP_BOUNDS_DP (1,288 + 677 = 1,965s) SWAP_BOUNDS_MV (80s, itself) 2,045s
SWAP_BOUNDS_NS_DP (1,361s) SWAP_BOUNDS_EW_DP (498s) 1,939s
SWAP_BOUNDS_NS_DDT_DP (1,360s, itself) SWAP_BOUNDS_EW_DDT_DP (497s) 1,937s

ARCHER

Routines Total mean time
SWAP_BOUNDS & SWAP_BOUNDS_DP (1,029 + 1,037 = 2,066s) SWAP_BOUNDS_MV (77s, itself) 2,143s
SWAP_BOUNDS_NS_DP (1,325s) SWAP_BOUNDS_EW_DP (614s) 2,016s
SWAP_BOUNDS_NS_DDT_DP (1,324s, itself) SWAP_BOUNDS_EW_DDT_DP (614s, itself) 2,015s

Summary

  • They're fairly similar
  • Time in ICE_SHEET_MASS really shows the delay caused by ocean and this is seen to be bigger for the ARCHER, but it's small for both.
  • The times in most routines is less for ARCHER than XCS, except the following:
    • EG_SL_HELMHOLTZ and SWAP_BOUNDS routines, where the extra time in the SWAP_BOUNDS routines probably accounts for the extra time in EG_SL_HELMHOLTZ.
      • This suggests that on less nodes, ARCHER may well be faster.
  • MEANCTL (climate meaning)
  • ARCHER makes greater use of SWAP_BOUNDS_DP than XCS.

Reducing MPI tasks

The above suggests that ARCHER is slower than the XCS at message passing, but its compute seems to be faster. To test this I think it's worth running a job with half the MPI tasks of the jobs above, so they'll be less message passing. As I'm only running with half the resources, I probably only have enough time to run for one month rather than two. The run configurations are

  • UM10.6
  • (ATM_PROCX,ATM_PROCY)=(24,24)
  • OMPTHR_ATM=1 (one thread)
  • IOS_NPROC=0 (no use of IO server)
  • Total ATMOS tasks of 24*28=1,152, which is 16 XCS nodes and 24 ARCHER nodes
  • (NEMO_IPROC,NEMO_JPROC)=(12,9)
  • (CICE_BLKX,CICE_BLKY)=(30,37)
  • Total OCEAN tasks of 12*9=108, which is 3 XCS nodes and 5 ARCHER nodes (one only half used)
  • XIOS_NPROC=6, which is one incomplete XCS nodes and one incomplete ARCHER node
  • Total CPUS used is 1,296 for both, which 36 XCS nodes and 54 ARCHER nodes
  • Run for a one month cycle

The performance of the jobs before Dr Hook was added and afterwards is summarised below

XCSARCHER
Jobs suitesu-ar648u-ar563
Performance without Dr Hook
Elapsed time (s) (s)
Speed (model years/day) 0:50:55 (3,055s)01:07:26 (4,046s)
Performance with Dr Hook
Elapsed time 1:07:53 (4,073s)1:27:21 (5,241s)
Speed (model years/day)

This jobs that ARCHER is

  • 32% slower without Dr Hook, and
  • 29% slower with Dr Hook

As a percentage slow down, this is a lot of higher than when I used double the MPI tasks in the atmosphere.

Top level profiling

XCS

Routines
UM_SHELL (3,958s)
U_MODEL_4A (3,954s)
ATM_STEP_4A* (3,361s) MEANCTL (247s) DUMPCTL (193s) OASIS3_ PUTA2O (55s) OASIS3_ GETO2A (31s)
ATMOS _PHYS- ICS1 (589s) ATMOS _PHYS- ICS2 (207s) EG_ SL_ HELM- HOLTZ (185s) TR_ SET_ PHYS _4A* (244s) EG_ CORR- ECT_ TRAC- ERS_ PRIES- TLEY (214s) SL_ TRAC- ER1 _4A (342s) EG_ SL_ FULL_ WIND (83s) UKCA_ MAIN1 (749s) STASH (315s) SWAP_ BOUNDS routines (1,223s) ACUMPS (216s) UM_ WRITDUMP (193s) ICE_ SHEET_ MASS (53s) OASIS3_ GET (30s)
STWORK (315s) See profile for SWAP_ BOUNDS routines GENERAL_ GATHER_FIELD (324s)
STASH_GATHER_ FIELD (333s)
GATHER_FIELD (324s)
GATHER_ FIELD_ MPL (225s, itself) Itself (99s)

ARCHER

Routines
UM_SHELL (6,260s)
U_MODEL_4A (6,254s)
ATM_STEP_4A* (5,350s) MEANCTL (420s) DUMPCTL (282s) OASIS3_ PUTA2O (123s) OASIS3_ GETO2A (29s)
ATMOS _PHYS- ICS1 (1,012s) ATMOS _PHYS- ICS2 (325s) EG_ SL_ HELM- HOLTZ (394s) TR_ SET_ PHYS _4A* (266s) EG_ CORR- ECT_ TRAC- ERS_ PRIES- TLEY (323s) SL_ TRAC- ER1 _4A (436s) EG_ SL_ FULL_ WIND (137s) UKCA_ MAIN1 (1,187s) STASH (397s) SWAP_ BOUNDS routines (1,767s) ACUMPS (370s) UM_ WRITDUMP (282s) ICE_ SHEET_ MASS (118s) OASIS3_ GET (27s)
STWORK (396s) See profile for SWAP_ BOUNDS routines GENERAL_ GATHER_FIELD (511s)
STASH_GATHER_ FIELD (523s)
GATHER_FIELD (512s)
GATHER_ FIELD_ MPL (285s, itself) Itself (227s)

Profiling SWAP_BOUNDS routines

XCS

Routines Total mean time
SWAP_BOUNDS & SWAP_BOUNDS_DP (689 + 471 = s) SWAP_BOUNDS_MV (63s, itself) 1,223s
SWAP_BOUNDS_NS_DP (793s) SWAP_BOUNDS_EW_DP (315s) 1,171s
SWAP_BOUNDS_NS_DDT_DP (792s, itself) SWAP_BOUNDS_EW_DDT_DP (314s) 1,169s

ARCHER

Routines Total mean time
SWAP_BOUNDS & SWAP_BOUNDS_DP (884 + 809 = 1,693s) SWAP_BOUNDS_MV (74s, itself) 1,767s
SWAP_BOUNDS_NS_DP (1,087s) SWAP_BOUNDS_EW_DP (508s) 1,669s
SWAP_BOUNDS_NS_DDT_DP (1,086s, itself) SWAP_BOUNDS_EW_DDT_DP (508s) 1,668s

Summary

Despite what I was expecting, as a percentage, ARCHER is much slower than XCS with less MPI tasks - around 30% slower compared to something between about 2-11% for the double the atmosphere MPI tasks. ARCHER has less memory per core than the XCS and this will be more significant with less MPI tasks, because each PE will have a larger domain to work on. I'm guessing this is the reason why ARCHER performance compare to XCS is poorer for less MPI tasks.

GA7.1 + StratTrop

Luke Abraham has produced the following plot to compare the performance of GA7.1 + StratTrop on ARCHER compared to XCS full populated (36PPN) and only using 24 cores per node (24PPN).