ORCA025 scaling (see Colin's e-mail from 18 Sep 2014). Data is originally from Tim Graham, so I assume it's on our current HPCs.
.
From the figure above, I'm getting the table below
Number of Cores | Days for one model year - inverse of next column (seconds for for one model) | Speed (model years per day) |
---|---|---|
90 | 1.11 (96,000s) | 0.9 |
195 | 0.513 (44,308s) | 1.95 |
260 | 0.370 (32,000s) | 2.7 |
315 | 0.303 (26,182s) | 3.3 |
650 | 0.175 (15,158s) | 5.7 |
We running our own ocean runs. Richard has found me Dave Storky's G05.0 standard job (amhih), which is NEMO-CICE, and I've copied this to jabha. One month in this job appears to be 28 days, so I've explicitly run it for 30 days to represent one month.
When changing the `Number of PEs for NEMO East-West' and `Number of PEs for NEMO North-South', I need to make sure that
According to Richard
(Number of columns for CICE East-West) / (Number of PEs for NEMO East-West) > 32/33
and
(Number of rows for CICE North-South) / (Number of PEs for NEMO North-South) > 32/33
which suggests that (Number of PEs for NEMO East-West) < 1440 / 32.5 = 44.3, and (Number of rows for CICE North-South) < 1020 / 32.5 = 31.4.
The (Number of PEs for NEMO East-West) is 1440 and the (Number of PEs for NEMO East-West) must be a factor of this. The factors of 1440 are 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, 16, 18, 20, 24, 30, 32, 36, 40, 45, 48, ...
The (Number of PEs for NEMO North-South) is 1020 and the (Number of PEs for NEMO North-South) must be a factor of this. The factors of 1020 are 2, 3, 4, 5, 6, 10, 12, 15, 17, 20, 30, 34, ...
(NEMO East-West)*(NEMO North South) (total) | Model run length | Total time (s) | Speed (model years per day) |
---|---|---|---|
16 * 12 (192) | 30 days | 4,991 | 1.44 |
32 * 12 (384) | 30 days | 2,633 | 2.73 |
32 * 20 (640) | 30 days | 1,707 | 4.22 |
32 * 30 (960) | 30 days | 1,319 | 5.46 |
48 * 20 (960) | 30 days | Illegal instruction | |
36 * 30 (1,080) | 30 days | 1,211 | 5.95 |
40 * 30 (1,200) | 30 days | Symbol resolution failed for nemo.exe | |
36 * 34 (1,224) | 30 days |
It seems that the code crashes when either
The plot at top of page, from Tim Graham, shows much faster times than I'm getting for my ORCA025. Maybe the plot above is for NEMO only, and doesn't include CICE?
Wrong, Tim's results do contain CICE. He says it's because some diagnostics take a long time, see next section.
According to Tim, `There were some extra diagnostics added at GO5 that weren’t in my runs. I suspect that calculating these may be quite slow. You can turn them off as follows:
I've created jabhb, although I've mostly used times taken directly from Tim ( /net/home/h05/hadtd/My_Code/Python_workspace/Ocean_resolution_plots/ORCA025_1month_times)
(NEMO East-West)*(NEMO North South) (total) | Model run length | Total time (s) | Speed (model years per day) |
---|---|---|---|
80+ | 1 month | 8,193 | 0.879 |
128+ | 1 month | 5,294 | 1.36 |
160+ | 1 month | 4,279 | 1.68 |
192+ | 1 month | 3,651 | 1.97 |
256+ | 1 month | 2,697 | 2.67 |
32 * 12 (384) | 30 days | 2,005 | 3.59 |
32 * 20 (640) | 30 days | 1,336 | 5.39 |
640+ | 1 month | 1,276 | 5.64 |
960+ | 1 month | 1,039 | 6.93 |
36 * 30 (1,080) | 30 days | 998 | 7.21 |
The speeds after Tim's changes are shown below where nemoCiceOrca025 is with Tim's changes and nemoCiceOrca025-2 is the GO5 standard job.
The questions are
I think we can almost follow the dark blue line, but multiply the cores by (1+1/32) to allow for XIOS.
I've coped Tim Graham's amwmn to jabhc and combined some of my times with his (/net/home/h05/hadtd/My_Code/Python_workspace/Ocean_resolution_plots/ORCA1_2year_times)
(NEMO East-West)*(NEMO North South) (total) | Model run length | Total time (s) | Speed (model years per day) |
---|---|---|---|
64+ | 2 years | 9,183 | 18.8 |
64+ | 2 years | 9,046 | 19.1 |
128+ | 2 years | 5,327 | 32.4 |
192+ | 2 years | 4,037 | 42.8 |
12*16 (192) | 2 years | 3,678 | 47.0 |
256+ | 2 years | 3,449 | 50.1 |
320+ | 2 years | 3,004 | 57.5 |
The one time I've done (12*16 = 192 cores) does look to be faster - maybe some improvements with time. I think Tim's runs were done a while ago.
And Tim's scaling plot