Submission with --install-only option

The options to submit 2UM jobs

Coupling 2 UMs, called Snr and Jnr, together has presented a fairly unique problem in that both coupling components share the same the variables. This presents a problem with the rose submission, because the variable name is used to provide the meta name which determines

  • which namelist variable is written to
  • where the variable is displayed in the rose gui, and also the trigger conditions for whether variable should be displayed or not

And so something is needed to determine if a variable is intended for Snr or Jnr. Two options have been explored before the option which involved the --install-only option was tried.

1) Two rose apps for two UMs and using namelists from Jnr recon

This is the first option which has been used since we moved away from the UMUI. Each UM has its own rose app, and so it's fairly straightforward to create the Snr and Jnr code and run their reconfiguration jobs, [recon] and [recon2]. The Snr rose app is used for the [coupled] job, but we still require the namelist files for Jnr. These are copied from the Jnr reconfiguration, where any occurrence of the variable `recon2' is replaced with `coupled', and the run_target_end is altered to agree with that of Snr (found in SHARED). This is all done through an extra function in um_script_functions called usf_cp_recon2_input. This function also deals with renaming the ancillary environment variables (although this doesn't seem to have been necessary for the next two options, so probably wasn't necessary here).

Apart from clearing being a bodge, this has recently presented problems when trying to do a restart or modifying any of the Jnr variables and wrongly thinking that a Jnr recon isn't required - because without a Jnr recon the Jnr namelist aren't updated.

The other thing about these jobs is that they used an extra branch, vn10.2_take_version2, to modify the UM code for Jnr only, so that it read a different version of namelists, e.g. reading IDEALISE2 rather than IDEALISE, and some different environment variables, e.g. STASHMASTER2 rather than STASHMASTER.

Jobs for this have been created with ~mstringe/bin/rose/vn10.2/extraUm.scr, and include mi-ah650, mi-ai256 and mi-aj628.

2) One rose app for two UMs

Typically all components of a coupled run are contained with one app, and this ensures that all the namelist data are created at the same time and within the same work directory. We've been having meetings with Paul Creswell from the UM systems team and Andrew Clark and Matt Shin from the Rose team, and they were keen to have Snr and Jnr in one app, and they wanted one executable to be used for both Snr and Jnr (hence removing the vn10.2_take_version2 branch).

Paul coupled a N96 Snr to a N96 Jnr and showed that it was possible to use one executable for both. He added the `env' argument to the Jnr aprun option to change any environment variables which were needed to be different for Jnr compared to Snr. He used the same namelist data for both Snr and Jnr, so this still left the problem of how to handle to variable having a different value for Snr compared to Jnr.

A significant proportion of the variables for Snr and Jnr are likely to be different, as Jnr is designed to run the full chemistry scheme and Snr is not. As soon as Snr is moved to N216, which we want to do, the variables for Snr and Jnr in namelist SIZES are notably very different. Matt suggested that two variables with the same could be used in the same app if the category options {snr} and {jnr} are added to the namelist names.

I was able to show that this worked, in the job mi-ai467, which was created with ~mstringe/bin/rose/vn10.2/extraUm_1app.scr. Besides the many duplicate error messages, the main problem with this is that the addition of {snr} and {jnr} breaks all the STASH menu options from even being available, because from ->rose edit ->Metadata the option for ->um is missing. The STASH options are fairly essential, so modifications to rose edit would need to be made for this to work (and the error messages should be removed even if they don't seem to affect anything).

A slight drawback of having both Snr and Jnr in the same app is when running either Snr or Jnr reconfiguration, the environment variables for the other UM still need defining if they appear is rose-app.conf or the job will crash because they are not defined.

Ancillary environment variables

I'm not sure of this but it seems that - while the ancillary environment variables need to be defined (to avoid an undefined crash, I seem to remember?) - the ancillaries required for [coupled] are stored in atmos.xhist for Snr and junio.xhist for Jnr. If this is the case, and it seems to be, it isn't necessary to redefine all the ancillary environment variables. If it turned out that a few needed to be different for Snr or Jnr, this could be handled by added the ancillary environment variable to env arguments in the Jnr part of the aprun command.

Environment variables for rose-app.conf

Adding the {snr} and {jnr} category options caters for namelist variables having the same name for both Snr and Jnr. It doesn't cater for the environment variables, found under section [env] in app/coupled/rose-app.conf and app/jnr_um/rose-app.conf. These include INPUT_DATA which either needs renaming in app/jnr_um/rose-app.conf or replacing with its actually value. Another variable is SPECTRAL_FILE_DIR, but seems to be the same for both Snr and Jnr so hasn't been a problem so far.

Another variable which has been a problem is ASTART, although this appears lower down in the rose-app.conf files rather than being defined in [env] section.

3) Two rose apps for 2 UMs and using the --install-only option

Matt dragged Ben Fitz into one of our regular meetings, which also included Richard and me, and Ben and Matt were able to suggest another option. This is to have two apps for Snr and Jnr, but include a command something like

rose task-run --app-key=jnr_um --instally-only

in the [coupled] job near the start. (The command above is what I eventually came up with after some rough notes from Matt.) This command allows the namelist files for Jnr to be created without running the rest of the job. It avoids the bodge used in our previous two rose apps for 2 UMs options, and the associated problems with that bodge. And it still uses much of the aprun work used in the second option, while retaining the the familar UM app interface for both Snr and Jnr, and having the STASH menu available for both.

Jobs done this way include mi-al072 (N96<->N96) and mi-al137 (N216<->N96), and are created using ~mstringe/bin/rose/vn10.2/extraUm_instOnly.scr.

Jnr namelist files

It does mean that it's important that the namelist files for Jnr have a different name to those for Snr in app/jnr_um/rose-app.conf - I've added the extension _JNR to them.

The env arguments for [recon_jnr]

The environment variables which need changing for the Jnr reconfiguration are summarised by it's entry in suite.rc

    [[recon_jnr]]
        inherit = None, CRAY
        pre-command scripting = . {{ANCIL_VERSIONS_JNR}}
        [[[environment]]]
            RUNID_JNR        = {{ RUNID_USR_JNR }}
            RECONA_FNAME     = RECONA_JNR
            SHARED_FNAME     = SHARED_JNR
            SIZES_FNAME      = SIZES_JNR
            ATMOS_LAUNCHER   = aprun
            RECON_EXEC       = um2-recon.exe
            ROSE_TASK_APP    = jnr_um

where RECONA_FNAME, SHARED_FNAME and SIZES_FNAME are used in the bin/um-recon script, and RECON_EXEC and ATMOS_LAUNCHER are used to alter some default options. RUNID_JNR is used in app/jnr_um/rose-app.conf to define astart, streqlog and some filenames.

The env arguments for Jnr in [coupled]

The environment variables which need changing for Jnr are currently stored under [[ATMOS_JNR]] and are stored in the environment variable ENV_JNR as shown in the code below

    [[ATMOS_JNR]]
        inherit = None, UM_DIRECTIVES
        [[[environment]]]
            UM_ATM_NPROCX_JNR   = {{ATM_PROCX_JNR}}
            UM_ATM_NPROCY_JNR   = {{ATM_PROCY_JNR}}
            FLUME_IOS_NPROC_JNR = {{IOS_NPROC_JNR}}
            RUNID_JNR           = {{ RUNID_USR_JNR }}
            STDOUT_FILE_JNR     = "pe_output_jnr/junio.fort6.pe"
            SIZES_JNR           = SIZES_JNR
            ENV_JNR             = "ATMOS_JUNIOR=true UM_ATM_NPROCX=$UM_ATM_NPROCX_JNR 
UM_ATM_NPROCY=$UM_ATM_NPROCY_JNR IOS_NPROC=$FLUME_IOS_NPROC_JNR RUNID=$RUNID_JNR 
STDOUT_FILE=$STDOUT_FILE_JNR IDEALISE=IDEALISE_JNR IOSCNTL=IOSCNTL_JNR 
NAMELIST=NAMELIST_JNR STASHC=STASHC_JNR ERROR_FLAG=errflag_jnr 
HISTORY=$DATAM/$RUNID_JNR.xhist HISTORY_TEMP=thist_jnr HOUSEKEEP=hkfile_jnr"

The reason the environment variables RUNID_JNR, STDOUT_FILE_JNR and SIZES_JNR are also defined outside the ENV_VAR argument is that they are also required by bin scripts in the general environment.

The code below shows how the $ENV_JNR argument is put into the aprun argument for Jnr.

    [[coupled]]
        inherit = None, ATMOS_NEMOCICE
	pre-command scripting = "module load cray-netcdf-hdf5parallel; module load stat; 
. {{ANCIL_VERSIONS}}; rose task-run --app-key=jnr_um --install-only"
	[[[directives]]]
	    -l select={{N_NODES}}:coretype={{CORE}}
        [[[environment]]]
            RUNID = {{ RUNID_USR }}
            RUNID_JNR = {{ RUNID_USR_JNR }}
            L_JNR = {{ RUN_JNR }}
            L_OCEAN = {{ RUN_OCEAN }}
            ROSE_TASK_APP = coupled
            CONTINUE      = $( \
if [[ $CYLC_TASK_CYCLE_POINT == $CYLC_SUITE_INITIAL_CYCLE_POINT ]] && \
[[ $CYLC_TASK_TRY_NUMBER -eq 1 ]]; then echo ""; else echo "true"; fi )
	    LAUNCH_MPI_ATMOS="-n $ATM_NPROC -ss -d 1 -j 1"
	    LAUNCH_MPI_ATMOS_JNR="-n $ATM_NPROC_JNR -d 1 -j 1 env $ENV_JNR"

The scripts needed in bin

Like all the options, the perl script create_namcouple.pl is needed to create the namcouple file. And all the options require um-coupled_2um to submit and couple both UMs, although this script is different for all options. The scripts um2-recon, um-env-jnr and um-env-ocean are only needed for the first submit option. Dummy versions of these files are retained for these for files for submit options 2) and 3) because it means we can use the build branches vn10.2_snr_build_cfg and vn10.2_jnr_build_cfg for all options.

Problem with aprun when using both env and -ss options

For some unknown reason using the -ss option in the aprun command when the env argument is present causes the aprun command to fail. Hence, we've had to ditch the -ss option for Jnr for now for submit options 2) and 3).

Variables which should be shared between Snr and Jnr

I think the fields that should be shared are

  • Start and end times
    • BASIS/MODELBASIS/model_basis_time (integer)
    • RUNLEN/TASKEND/run_target_end (integer)
  • The coupler environment
    • COUPLER
  • The fields in [env] in rose-app.conf, e.g.
    • DATAM
    • DATAW
    • SPECTRAL_FILE_DIR (NOT for HISTORY, which becomes HISTORY_JNR for Jnr)
  • Coupling fields
    • oasis_couple_freq_ac,oasis_couple_freq_ca (BUT NOT oasis_coupled_freq_ao, ...)
  • Dumping fields
    • i_dump_output (values 1,2,3)
    • dump_packim (values 1,2,3)
    • dump_frequency_units (values 1,2,3)
    • dumpfreqim (integer)

There's problems with sharing SPECTRAL_FILE_DIR

[FAIL] {% set UM_UM_SPECTRAL_FILE_DIR=/home/h01/frum/vn10.2/ctldata/
spectral/ga3_0 %}	<-- Jinja2Error

The two problems are: the obvious Jinja error and we want $UMDIR on HPC.

If I try combining COUPLER there is also a problem

[INFO] install: suite.rc
[INFO] No suites unregistered.
[INFO] REGISTER mi-al072: /home/h06/mstringe/cylc-run/mi-al072
[INFO] symlink: /home/h06/mstringe/cylc-run/mi-al072 <= /home/h06/
mstringe/.cylc/mi-al072
[FAIL] cylc validate -v --strict mi-al072 # return-code=1, stderr=
[FAIL] Jinja2Error:
[FAIL]   File "<template>", line 54, in top-level template code
[FAIL] TypeError: unsupported operand type(s) for -: 'StrictUndefined' 
and 'StrictUndefined'

I think the problems are probably the `/' in the SPECTRAL_FILE_DIR and `-' in OASIS3-MCT for COUPLER.

The general problems with sharing variables for Snr and Jnr by storing them in rose-suite.conf are

  • It break some of the trigger argument in main rose-meta.conf file (the trigger argument don't recognise values of environment variables e.g. dumptimesim should only be available if i_dump_output == 3, but it is shown because i_dump_output=$I_DUMP_OUTPUT_UM_UM)
  • There are fail-if arguments which don't work because we've moved variables out of rose-app.conf to rose-suite.conf, such as the one for UM_UM_DUMP_FREQUENCY_UNITS
    fail-if=this == 1 and namelist:nlstcgen=steps_per_periodim % 24 != 0;
    #       =# Model steps must coincide with whole hours to use Hours unit
    #       =this == 2 and namelist:nlstcgen=steps_per_periodim % 24 != 0;
    #       =# Model steps must coincide with whole hours to use Days unit
    

Sharing OASIS_COUPLE_FREQ_AC and OASIS_COUPLE_FREQ_CA has also been a problem because the entries

  • In rose-suite.conf: UM_UM_OASIS_COUPLE_FREQ_AC=1,0
  • In suite.rc: OASIS_COUPLE_FREQ_AC_UM_UM = {{UM_UM_OASIS_COUPLE_FREQ_AC}}
  • In app/.../rose-app.conf: oasis_couple_freq_ac=$OASIS_COUPLE_FREQ_AC_UM_UM

has resulted in the entry in the following entry in SHARED

oasis_couple_freq_ac=(1, 0),
oasis_couple_freq_ao=1,0,
oasis_couple_freq_aw=0,0,
oasis_couple_freq_ca=(1, 0),

where the brackets cause an error reading namelist.

Given all these current problems, it probably isn't worth trying to share variables between Snr and Jnr at this stage. I've rename my script that does this to ~/bin/rose/vn10.2/extraUm_instOnly_sharedVars.scr.

Follow-up

It's possible to get round most of these problems with quotes around the variable, which means turning OASIS_COUPLE_FREQ_AC and OASIS_COUPLE_FREQ_CA into strings rather than integer arrays. And to preserve the $ in UM_UM_SPECTRAL_FILE_DIR='$UMDIR/vn10.2/ctldata/spectral/ga3_0' I've needed to add a backslash.

Sharing the dump variables gets vary complicated, because so many triggers are involved. For example, nlstcgen=l_meaning_sequence is triggers by the value of nlstcgen=i_dump_output and the filenames in the meaning will want different names for Snr and Jnr (and possibily different settings), so we don't want these variables shared between Snr and Jnr. Maybe share the other variables, but don't share dump variables for now.

Do would need to identify Snr/Jnr/Neither within rose-app.conf?

Should I have some environment variable, such as ATM_COMP, which identifies the UM in Rose as either Snr, Jnr or neither of these. The rose-meta.conf entry would be something like

[env=ATM_COMP]
compulsory=true
description=Select component
ns=env/coupled
trigger=namelist:coupling_control=l_overw_jnr2snr: senior;
trigger=namelist:coupling_control=l_overw_snr2jnr: junior;
trigger=namelist:coupling_control=l_strip_ukca: senior;
value-titles=Normal, Senior, Junior
values=, senior, junior

The ATM_COMP option seems to work well, so I've gone with this.

Running hybrid model again in the same directory

This has been a problem because Snr has started from timestep 0, as it should, but Jnr has started from where it last finished. Consequently, Jnr finishes its first cycle almost immediately and Snr is left waiting for data to be passed from Jnr and eventually it reaches its wallclock limit.

I believe the problem is that the history file for Jnr is not being deleted, so I'll add some code in for this.