The one job we currently have on both the XCS and ARCHER at N96/ORCA1 resolution is UKESM0.6. It says it's UKESM0.6-CN, but the -CN configuration doesn't have the full chemistry (aka StratTrop), and this does.
I've copied the XCS version of UKESM0.6-CN, u-an130, to u-ar648, and the ARCHER version of UKESM0.6-CN, u-an561, to u-ar563. The one change I've made to this configuration is to detach XIOS, which makes the run a bit faster. Common configurations for both jobs are
The performance of the jobs before Dr Hook was added and afterwards is summarised below
XCS | ARCHER | |
Jobs suites | u-ar648 | u-ar563 |
Jobs are copied from | u-an130 | u-an561 |
Performance without Dr Hook | ||
---|---|---|
Elapsed time | 1:39:22 (5,962s) | 1:50:19 (6,619s) |
Speed (model years/day) | 2.42 | 2.18 |
Performance with Dr Hook | ||
Elapsed time | 2:02:54 (7,374s) | 2:05:17 (7,517s) |
Speed (model years/day) | 1.95 | 1.92 |
I've only run one two month cycle in each case, and we know the noise on any give Cray run seems to be about 10%, so we can't use this numbers to give a precise comparisons of machines. However for these runs, ARCHER was 11% slower than XCS for runs without Dr Hook and 1.9% slower than XCS for runs with Dr Hook.
XCS | ARCHER | |
Compute nodes | Broadwell (2.1 GHz) | Ivy Bridge (two 2.7 GHz, 12-core E5-2697 v2) |
Local memory in NUMA region | 64 GB (3.56 GB per core) | 32 GB (2.67 GB per core) |
Aries interconnect | Dragonfly | Dragonfly (4 compute nodes are connected to each Aries router; 188 nodes are grouped into a cabinet; and two cabinets make up a group). |
My knowledge of hardware is rubbish, although gradually improving. The XCS is using a later version of the Intel chip and has more local memory per core (we know that memory bandwidth is a limiting factor, particulary for dynamical core of UM and NEMO). The clock speed for the XCS is less. We had expected the XCS to be quicker than ARCHER, and the runs I've done above are bit quicker on XCS, but another comparison for a GC3.1 N216/ORCA025 run found ARCHER to be quicker (see p34 and p36 of my UKESM General IV).
The times in the table below are all mean total time, unless `itself is specified, where total time is the time spent within a routine and all the routines called by that routine. If `itself' is specified then the time is just the time in a given routine. I'm only showing the main routines in terms of time.
Routines | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
UM_SHELL (7,268s) | |||||||||||||
U_MODEL_4A (7,264s) | |||||||||||||
ATM_STEP_4A* (6,266s) | MEANCTL (483s) | DUMPCTL (326s) | OASIS3_ PUTA2O (106s) | OASIS3_ GETO2A (43s) | |||||||||
ATMOS _PHYS- ICS1 (1,128s) | ATMOS _PHYS- ICS2 (365s) | EG_ SL_ HELM- HOLTZ (373s) | TR_ SET_ PHYS _4A* (437s) | EG_ CORR- ECT_ TRAC- ERS_ PRIES- TLEY (429s) | SL_ TRAC- ER1 _4A (685s) | EG_ SL_ FULL_ WIND (173s) | UKCA_ MAIN1 (1,474s) | STASH (631s) | SWAP_ BOUNDS routines (2,045s) | ACUMPS (426s) | UM_ WRITDUMP (326s) | ICE_ SHEET_ MASS (103s) | OASIS3_ GET (43s) |
STWORK (630s) | See profile for SWAP_ BOUNDS routines | GENERAL_ GATHER_FIELD (565s) | |||||||||||
STASH_GATHER_ FIELD (585s) | |||||||||||||
GATHER_FIELD (566s) | |||||||||||||
GATHER_ FIELD_ MPL (389s, itself) | Itself (177s) |
Routines | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
UM_SHELL (7,375s) | |||||||||||||
U_MODEL_4A (7,369s) | |||||||||||||
ATM_STEP_4A* (6,169s) | MEANCTL (571s) | DUMPCTL (388s) | OASIS3_ PUTA2O (144s) | OASIS3_ GETO2A (34s) | |||||||||
ATMOS _PHYS- ICS1 (1,023s) | ATMOS _PHYS- ICS2 (334s) | EG_ SL_ HELM- HOLTZ (461s) | TR_ SET_ PHYS _4A* (327s) | EG_ CORR- ECT_ TRAC- ERS_ PRIES- TLEY (374s) | SL_ TRAC- ER1 _4A (524s) | EG_ SL_ FULL_ WIND (185s) | UKCA_ MAIN1 (1,314s) | STASH (521s) | SWAP_ BOUNDS routines (2,143s) | ACUMPS (502s) | UM_ WRITDUMP (388s) | ICE_ SHEET_ MASS (137s) | OASIS3_ GET (32s) |
STWORK (519s) | See profile for SWAP_ BOUNDS routines | GENERAL_ GATHER_FIELD (697s) | |||||||||||
STASH_GATHER_ FIELD (712s) | |||||||||||||
GATHER_FIELD (699s) | |||||||||||||
GATHER_ FIELD_ MPL (396s, itself) | Itself (302s) |
Routines | Total mean time | ||
---|---|---|---|
SWAP_BOUNDS & SWAP_BOUNDS_DP (1,288 + 677 = 1,965s) | SWAP_BOUNDS_MV (80s, itself) | 2,045s | |
SWAP_BOUNDS_NS_DP (1,361s) | SWAP_BOUNDS_EW_DP (498s) | 1,939s | |
SWAP_BOUNDS_NS_DDT_DP (1,360s, itself) | SWAP_BOUNDS_EW_DDT_DP (497s) | 1,937s |
Routines | Total mean time | ||
---|---|---|---|
SWAP_BOUNDS & SWAP_BOUNDS_DP (1,029 + 1,037 = 2,066s) | SWAP_BOUNDS_MV (77s, itself) | 2,143s | |
SWAP_BOUNDS_NS_DP (1,325s) | SWAP_BOUNDS_EW_DP (614s) | 2,016s | |
SWAP_BOUNDS_NS_DDT_DP (1,324s, itself) | SWAP_BOUNDS_EW_DDT_DP (614s, itself) | 2,015s |
The above suggests that ARCHER is slower than the XCS at message passing, but its compute seems to be faster. To test this I think it's worth running a job with half the MPI tasks of the jobs above, so they'll be less message passing. As I'm only running with half the resources, I probably only have enough time to run for one month rather than two. The run configurations are
The performance of the jobs before Dr Hook was added and afterwards is summarised below
XCS | ARCHER | |
Jobs suites | u-ar648 | u-ar563 |
Performance without Dr Hook | ||
---|---|---|
Elapsed time | (s) | (s) |
Speed (model years/day) | 0:50:55 (3,055s) | 01:07:26 (4,046s) |
Performance with Dr Hook | ||
Elapsed time | 1:07:53 (4,073s) | 1:27:21 (5,241s) |
Speed (model years/day) |
This jobs that ARCHER is
As a percentage slow down, this is a lot of higher than when I used double the MPI tasks in the atmosphere.
Routines | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
UM_SHELL (3,958s) | |||||||||||||
U_MODEL_4A (3,954s) | |||||||||||||
ATM_STEP_4A* (3,361s) | MEANCTL (247s) | DUMPCTL (193s) | OASIS3_ PUTA2O (55s) | OASIS3_ GETO2A (31s) | |||||||||
ATMOS _PHYS- ICS1 (589s) | ATMOS _PHYS- ICS2 (207s) | EG_ SL_ HELM- HOLTZ (185s) | TR_ SET_ PHYS _4A* (244s) | EG_ CORR- ECT_ TRAC- ERS_ PRIES- TLEY (214s) | SL_ TRAC- ER1 _4A (342s) | EG_ SL_ FULL_ WIND (83s) | UKCA_ MAIN1 (749s) | STASH (315s) | SWAP_ BOUNDS routines (1,223s) | ACUMPS (216s) | UM_ WRITDUMP (193s) | ICE_ SHEET_ MASS (53s) | OASIS3_ GET (30s) |
STWORK (315s) | See profile for SWAP_ BOUNDS routines | GENERAL_ GATHER_FIELD (324s) | |||||||||||
STASH_GATHER_ FIELD (333s) | |||||||||||||
GATHER_FIELD (324s) | |||||||||||||
GATHER_ FIELD_ MPL (225s, itself) | Itself (99s) |
Routines | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
UM_SHELL (6,260s) | |||||||||||||
U_MODEL_4A (6,254s) | |||||||||||||
ATM_STEP_4A* (5,350s) | MEANCTL (420s) | DUMPCTL (282s) | OASIS3_ PUTA2O (123s) | OASIS3_ GETO2A (29s) | |||||||||
ATMOS _PHYS- ICS1 (1,012s) | ATMOS _PHYS- ICS2 (325s) | EG_ SL_ HELM- HOLTZ (394s) | TR_ SET_ PHYS _4A* (266s) | EG_ CORR- ECT_ TRAC- ERS_ PRIES- TLEY (323s) | SL_ TRAC- ER1 _4A (436s) | EG_ SL_ FULL_ WIND (137s) | UKCA_ MAIN1 (1,187s) | STASH (397s) | SWAP_ BOUNDS routines (1,767s) | ACUMPS (370s) | UM_ WRITDUMP (282s) | ICE_ SHEET_ MASS (118s) | OASIS3_ GET (27s) |
STWORK (396s) | See profile for SWAP_ BOUNDS routines | GENERAL_ GATHER_FIELD (511s) | |||||||||||
STASH_GATHER_ FIELD (523s) | |||||||||||||
GATHER_FIELD (512s) | |||||||||||||
GATHER_ FIELD_ MPL (285s, itself) | Itself (227s) |
Routines | Total mean time | ||
---|---|---|---|
SWAP_BOUNDS & SWAP_BOUNDS_DP (689 + 471 = s) | SWAP_BOUNDS_MV (63s, itself) | 1,223s | |
SWAP_BOUNDS_NS_DP (793s) | SWAP_BOUNDS_EW_DP (315s) | 1,171s | |
SWAP_BOUNDS_NS_DDT_DP (792s, itself) | SWAP_BOUNDS_EW_DDT_DP (314s) | 1,169s |
Routines | Total mean time | ||
---|---|---|---|
SWAP_BOUNDS & SWAP_BOUNDS_DP (884 + 809 = 1,693s) | SWAP_BOUNDS_MV (74s, itself) | 1,767s | |
SWAP_BOUNDS_NS_DP (1,087s) | SWAP_BOUNDS_EW_DP (508s) | 1,669s | |
SWAP_BOUNDS_NS_DDT_DP (1,086s, itself) | SWAP_BOUNDS_EW_DDT_DP (508s) | 1,668s |
Despite what I was expecting, as a percentage, ARCHER is much slower than XCS with less MPI tasks - around 30% slower compared to something between about 2-11% for the double the atmosphere MPI tasks. ARCHER has less memory per core than the XCS and this will be more significant with less MPI tasks, because each PE will have a larger domain to work on. I'm guessing this is the reason why ARCHER performance compare to XCS is poorer for less MPI tasks.
Luke Abraham has produced the following plot to compare the performance of GA7.1 + StratTrop on ARCHER compared to XCS full populated (36PPN) and only using 24 cores per node (24PPN).