GLOMAP

OpenMP and SMT in Atmosphere only model

All runs are CLASSIC at N96 (faced), which a PE grid of 16 * 8 (128). SMT is simultaneous multithreading, and it is common to have two concurrent threads per CPU core.

Four threads

I've tried running on 4 threads, but at about timestep 681

  Signal received: SIGFPE - Floating-point exception
    Signal generated for floating-point exception:
      FP invalid operation

Not clear, but probably crashing in HALO_EXCHANGE:SWAP_BOUNDS_NS_DP (see faced000.faced.d14322.t171306.leave).

I've tried running again and this time a float-point exception at around timestep 71, possibly in routine UC_TO_UB (see faced000.faced.d14323.t085124.leave).

Two threads without SMT (128 tasks on 8 nodes/256 cores)

Similar problem to above around timestep 1,657 (see faced000.faced.d14349.t115553.leave).

JC think it might by very roughly about 80% of the time of two threads with SMT. It completed 1,657/2160 = 76.7% of the run in 1,746s - which suggested it would have taken 1746*2160/1657=2,276s, i.e. 2276/2568=88.6% of the time for a normal two threaded run with SMT.

The top profile tree

One thread with SMT (128 tasks on 2 nodes/64 cores)

Routines
UM_SHELL (4,661s)
U_MODEL_4A (4,659s)
ATM_STEP_4A* (4,559s)
ATMOS _PHYS- ICS1 (1,611s) EG_ COR- RECT_ TRAC- ERS (258s) ATMOS _PHYS- ICS2 (706s) EG_ SL_ HELM- HOLTZ (470s) TR_ SET_ PHYS _4A* (100s) EG_ SISL_ INIT (113s) SL_ TRAC- ER1_ 4A (200s) EG_ SL_ MOI- STURE (79s) EG_SL_ FULL_WIND (259s) EG_Q_ TO_MIX (41s) ATM_ STEP_ STASH (181s)  ⇓ 
See profile for ATMOS_ PHYSICS1 and EG_CORRECT _TRACERS below See profile for ATMOS_ PHYSICS2 and EG_SL_ HELMHOLTZ below EG_ SISL_ INIT_ UVW (107s) EG_SL_WIND_U, EG_SL_WIND_V & EG_SL_WIND_W (75 + 75 + 88 = 238s) EG_SWAP_ BOUNDS_DP (207s) STASH (208s)
Itself (71s) EG_INTERPOLATION _ETA (353s) DEP- ARTURE_ POINT _ETA (161s) See profile for SWAP_ BOUNDS _DP below STWORK (208s)
EG_ CUBIC_ LAG- RANGE (159s, itself) EG_VERT_ WEIGHTS_ ETA (39s, itself) MONO_ ENFORCE (20s, itself) Itself (70s)         SPA- TIAL (53s) PP_ HEAD (73s) EXP- PXI (49s, itself)

One thread without SMT (128 tasks on 4 nodes/128 cores)

Routines
UM_SHELL (2,746s)
U_MODEL_4A (2,744s)
ATM_STEP_4A* (2,674s)
ATMOS _PHYS- ICS1 (1,031s) EG_ COR- RECT_ TRAC- ERS (123s) ATMOS _PHYS- ICS2 (438s) EG_ SL_ HELM- HOLTZ (228s) TR_ SET_ PHYS _4A* (48s) EG_ SISL_ INIT (49s) SL_ TRAC- ER1_ 4A (107s) EG_ SL_ MOI- STURE (83s) EG_SL_ FULL_WIND (154s) EG_Q_ TO_MIX (27s) ATM_ STEP_ STASH (115s)  ⇓ 
See profile for ATMOS_ PHYSICS1 and EG_CORRECT _TRACERS below See profile for ATMOS_ PHYSICS2 and EG_SL_ HELMHOLTZ below EG_ SISL_ INIT_ UVW (46s) EG_SL_WIND_U, EG_SL_WIND_V & EG_SL_WIND_W (45 + 45 + 52 = 142s) EG_SWAP_ BOUNDS_DP (107s) STASH (132s)
Itself (33s) EG_INTERPOLATION _ETA (200s) DEP- ARTURE_ POINT _ETA (98s) See profile for SWAP_ BOUNDS _DP below STWORK (131s)
EG_ CUBIC_ LAG- RANGE (97s, itself) EG_VERT_ WEIGHTS_ ETA (18s, itself) MONO_ ENFORCE (12s, itself) Itself (50s)         SPA- TIAL (34s) PP_ HEAD (46s) EXP- PXI (31s, itself)

Two threads with SMT (128 tasks on 4 nodes/128 cores) - normal OpenMP & SMT selection

Routines
UM_SHELL (2,568s)
U_MODEL_4A (2,566s)
ATM_STEP_4A* (2,498s)
ATMOS _PHYS- ICS1 (949s) EG_ COR- RECT_ TRAC- ERS (112s) ATMOS _PHYS- ICS2 (404s) EG_ SL_ HELM- HOLTZ (240s) TR_ SET_ PHYS _4A* (48s) EG_ SISL_ INIT (50s) SL_ TRAC- ER1_ 4A (102s) EG_ SL_ MOI- STURE (79s) EG_SL_ FULL_WIND (135s) EG_Q_ TO_MIX (21s) ATM_ STEP_ STASH (112s)  ⇓ 
See profile for ATMOS_ PHYSICS1 and EG_CORRECT _TRACERS below See profile for ATMOS_ PHYSICS2 and EG_SL_ HELMHOLTZ below EG_ SISL_ INIT_ UVW (47s) EG_SL_WIND_U, EG_SL_WIND_V & EG_SL_WIND_W (39 + 38 + 45 = 122s) EG_SWAP_ BOUNDS_DP (107s) STASH (130s)
Itself (33s) EG_INTERPOLATION _ETA (177s) DEP- ARTURE_ POINT _ETA (83s) See profile for SWAP_ BOUNDS _DP below STWORK (130s)
EG_ CUBIC_ LAG- RANGE (80s, itself) EG_VERT_ WEIGHTS_ ETA (19s, itself) MONO_ ENFORCE (14s, itself) Itself (36s)         SPA- TIAL (33s) PP_ HEAD (46s) EXP- PXI (30s, itself)
*should also link to SWAP_BOUNDS_DP, like many other returns.

Profiling for ATMOS_PHYSICS1 and EG_CORRECT_TRACERS

One thread with SMT (128 tasks on 2 nodes/64 cores)

ATMOS_PHYSICS1 (1,610s) EG_CORRECT_TRACERS (258s)
RAD_CTL (387s) MICROPHYS_CTL (279s) NI_GWD_CTL (303s)  ⇓   ⇓  EG_MASS_ CONSERVATION (187s)
LW_RAD (285s) SW_RAD (93s) LS_PPN (259s) G_ WAVE _5A (257s) GW_ USSP (45s) GLOBAL _2D_ SUMS (48s, itself) Itself (130s)
RADIANCE_CALC (366s) LS_PPNC (241s) SWAP_ BOUNDS (see table below)
SOLVE_BAND_K_EQV (301s) GREY_ OPT_ PROP (40s) LSP_ICE (211s) Itself (30s)
MCICA_ SAMPLE (233s) SCALE_ ABSORB (42s) OPT_ PROP_ AEROSOL (32s, itself) LSP_ INIT (46s) LSP_ FALL (40s)
MONOCHR- OMATIC_ RADIANCE (203s) Itself (32s) Itself (22s) Itself (35s)
MONOCHR- OMATIC_ RADIANCE _TSEQ (176s)
MCICA_ COLUMN (175s)
TWO_COEFF (118s)
TRANS_ SOURCE_ COEFF (66s) Itself (20s)
Itself (42s)
  

One thread without SMT (128 tasks on 4 nodes/128 cores)

ATMOS_PHYSICS1 (1,031s) EG_CORRECT_TRACERS (123s)
RAD_CTL (232s) MICROPHYS_CTL (193s) NI_GWD_CTL (219s)  ⇓   ⇓  EG_MASS_ CONSERVATION (91s)
LW_RAD (171s) SW_RAD (56s) LS_PPN (180s) G_ WAVE _5A (184s) GW_ USSP (35s) GLOBAL _2D_ SUMS (32s, itself) Itself (65s)
RADIANCE_CALC (218s) LS_PPNC (171s) SWAP_ BOUNDS (see table below)
SOLVE_BAND_K_EQV (185s) GREY_ OPT_ PROP (20s) LSP_ICE (155s) Itself (16s)
MCICA_ SAMPLE (142s) SCALE_ ABSORB (30s) OPT_ PROP_ AEROSOL (15s, itself) LSP_ INIT (31s) LSP_ FALL (32s)
MONOCHR- OMATIC_ RADIANCE (124s) Itself (26s) Itself (16s) Itself (29s)
MONOCHR- OMATIC_ RADIANCE _TSEQ (108s)
MCICA_ COLUMN (108s)
TWO_COEFF (77s)
TRANS_ SOURCE_ COEFF (44s) Itself (s)
Itself (s)
  

Two threads with SMT (128 tasks on 4 nodes/128 cores) - normal OpenMP & SMT selection

ATMOS_PHYSICS1 (949s) EG_CORRECT_TRACERS (122s)
RAD_CTL (215s) MICROPHYS_CTL (186s) NI_GWD_CTL (176s)  ⇓   ⇓  EG_MASS_ CONSERVATION (90s)
LW_RAD (153s, 153s) SW_RAD (53s, 53s) LS_PPN (174s) G_ WAVE _5A (150s) GW_ USSP (25s) GLOBAL _2D_ SUMS (38s, itself) Itself (63s)
RADIANCE_CALC (200s, 199s) LS_PPNC (163s) SWAP_ BOUNDS (see table below)
SOLVE_BAND_K_EQV (155s, 154s) GREY_ OPT_ PROP (33s, 33s) LSP_ICE (95s, 96s) Itself (68s)
MCICA_ SAMPLE (122s, 122s) SCALE_ ABSORB (20s, 20s) OPT_ PROP_ AEROSOL (29s, 29s, itself) LSP_ INIT (23s, 23s) LSP_ FALL (19s, 19s)
MONOCHR- OMATIC_ RADIANCE (107s, 107s) Itself (15s, 15s) Itself (11s, 11s) Itself (16s, 17s)
MONOCHR- OMATIC_ RADIANCE _TSEQ (94s, 94s)
MCICA_ COLUMN (94s, 94s)
TWO_COEFF (67s, 67s)
TRANS_ SOURCE_ COEFF (37s, 37s) Itself (12s)
Itself (23s)
  

Profiling for ATMOS_PHYSICS2 and EG_SL_HELMHOLTZ

One thread with SMT (128 tasks on 2 nodes/64 cores)

ATMOS_PHYSICS2 (706s) EG_SL_HELMHOLTZ (470s)
NI_CONV_CTL (325s) NI_IMP_CTL (136s) SWAP_BOUNDS, SWAP_BOUNDS_2D_MV & SWAP_BOUNDS_MV (see table below) EG_BICGSTAB (272s) EG_HELM_RHS_STAR (132s)
GLUE_CONV_5A (265s) IMP_SOLVER (64s) EG_PRECON (198s) EG_SISL_INIT* (113s)
Itself (105s) MID_CONV_5A (92s) TRI_SOR_DP_DP (198s) EG_SISL_INIT_UVW (107s)
Itself (31s) Itself (143s) Itself (71s)

One thread without SMT (128 tasks on 4 nodes/128 cores)

ATMOS_PHYSICS2 (s) EG_SL_HELMHOLTZ (s)
NI_CONV_CTL (187s) NI_IMP_CTL (87s) SWAP_BOUNDS, SWAP_BOUNDS_2D_MV & SWAP_BOUNDS_MV (see table below) EG_BICGSTAB (123s) EG_HELM_RHS_STAR (74s)
GLUE_CONV_5A (154s) IMP_SOLVER (42s) EG_PRECON (86s) EG_SISL_INIT* (49s)
Itself (56s) MID_CONV_5A (54s) TRI_SOR_DP_DP (86s) EG_SISL_INIT_UVW (46s)
Itself (17s) Itself (58s) Itself (33s)

Two threads with SMT (128 tasks on 4 nodes/128 cores) - normal OpenMP & SMT selection

ATMOS_PHYSICS2 (404s) EG_SL_HELMHOLTZ (240s)
NI_CONV_CTL (173s) NI_IMP_CTL (79s) SWAP_BOUNDS, SWAP_BOUNDS_2D_MV & SWAP_BOUNDS_MV (see table below) EG_BICGSTAB (140s) EG_HELM_RHS_STAR (69s)
GLUE_CONV_5A (126s, 123s) IMP_SOLVER (38s) EG_PRECON (95s) EG_SISL_INIT* (50s)
Itself (47s, 46s) MID_CONV_5A (42s, 42s) TRI_SOR_DP_DP (95s) EG_SISL_INIT_UVW (47s)
Itself (13s, 13s) Itself (64s) Itself (33s)
* EG_SISL_INIT is also called from ATM_STEP_4A.

Profiling for SWAP_* routines

One thread with SMT (128 tasks on 2 nodes/64 cores)

Routines Total mean time
EG_SWAP_BOUNDS_DP (207s) ATMOS_PHYSICS1, ATMOS_PHYSICS2, G_WAVE_5A, ... 207 + ...
SWAP_BOUNDS & SWAP_BOUNDS_DP (653 + 227 = 880s) SWAP_BOUNDS_MV (169s, itself) 1,049s
SWAP_BOUNDS_EW_DP (452s) SWAP_BOUNDS_NS_DP (425s, itself) 1,046s
SWAP_BOUNDS_EW_H1_DP (281s, itself) Itself (171s) 1,046s

One thread without SMT (128 tasks on 4 nodes/128 cores)

Routines Total mean time
EG_SWAP_BOUNDS_DP (107s) ATMOS_PHYSICS1, ATMOS_PHYSICS2, G_WAVE_5A, ... 107 + ...
SWAP_BOUNDS & SWAP_BOUNDS_DP (375 + 121 = 496s) SWAP_BOUNDS_MV (119s, itself) 615s
SWAP_BOUNDS_EW_DP (248s) SWAP_BOUNDS_NS_DP (245s, itself) 612s
SWAP_BOUNDS_EW_H1_DP (168s, itself) Itself (80s) 612s

Two threads with SMT (128 tasks on 4 nodes/128 cores) - normal OpenMP & SMT selection

Routines Total mean time
EG_SWAP_BOUNDS_DP (107s) ATMOS_PHYSICS1, ATMOS_PHYSICS2, G_WAVE_5A, ... 107 + ...
SWAP_BOUNDS & SWAP_BOUNDS_DP (339 + 118 = 457s) SWAP_BOUNDS_MV (103s, itself) 560s
SWAP_BOUNDS_EW_DP (234s) SWAP_BOUNDS_NS_DP (222s, itself) 559s
SWAP_BOUNDS_EW_H1_DP (153s, itself) Itself (81s) 559s