Draft CF data model
Proposed version 0.8
In CF trac
ticket 88, proposed by Mark Hedley and accepted on 5th August
2012, it has been decided that CF should adopt a data model. The data
model will be a logical abstraction of the concepts of CF data and
metadata, and the relationships that exist between these concepts, but
will not define an application programming interface (API) for CF.
Adopting a data model is believed to offer the following benefits:
- Providing an orientation guide to the CF Conventions Document.
- Guiding the development of software compatible with CF.
- Facilitating the creation of an API which
"behaves" or "feels" like CF and is intuitive to use.
- Providing a reference point
for gap analysis and conflict analysis of the CF specification.
- Providing a communication tool for discussing CF concepts and
proposals for changes to the CF specification.
- Setting the groundwork to expand CF beyond netCDF files.
The present document proposes a data model corresponding to
the CF
metadata standard (version 1.5). The data model avoids prescribing
more than is needed for interpreting CF as it stands, in order to
avoid inconsistency with future developments of CF. This document is
illustrated by
the accompanying
UML diagram of the data model.
As well as describing the CF data model, this document also
comments on how it is implemented in netCDF. Since the CF data model
could be implemented in file formats other than netCDF, it would be
logically better to put the information about CF-netCDF in a separate
document, but when introducing the data model for the first time, we
feel that this document would be harder to understand if it omitted
reference to the netCDF information. We propose that these functions
should be separated in a later version of the data model. Some parts
of the CF standard arise specifically from the requirements or
restrictions of the netCDF file format, or are concerned with
efficient ways of storing data on disk; these parts are not logically
part of the data model and are only briefly mentioned in this
document.
In this document, we use the word "construct" because we feel
it to be a more language-neutral term than "object" or "structure".
The constructs of this data model might correspond to objects in an OO
language.
Field construct
The central concept of the data model is a field construct. In
a dataset contained in a single netCDF file, each data variable
usually corresponds to a field construct, but a field construct might
be a combination of several data variables. In a dataset comprising
several netCDF files, a field construct may span data variables in
more than one file, for instance from different ranges of a time
coordinate (to be introduced by Gridspec in CF version 1.7). Rules for
aggregating data variables from one or several files into a single
field construct are needed but are not defined by CF version 1.5; such
rules are regarded as the concern of data processing software.
This data model makes a central assumption that each field
construct is independent. Data variables stored in CF-netCDF files are
often not independent, because they share coordinate variables.
However, we view this solely as a means of saving disk space, and we
assume that software will be able to alter any field construct in
memory without affecting other field constructs. For instance, if the
coordinates of one field construct are modified, it will not affect
any other field construct. Explicit tests of equality will be required
to establish whether two data variables have the same
coordinates. Such tests are necessary in general if CF is applied to a
dataset comprising more than one file, because different variables may
then reside in different files, with their own coordinate
variables. In a netCDF file, tests for the equality of coordinates
between different data variables may be simplified if the data
variables refer to the same coordinate variable.
Each field construct may have
- An ordered list of zero or more domain axis
constructs.
- A data array whose shape is determined by the domain axesin
the order listed, optionally omitting any domain axes of size
one. If there are no domain axes of greater size than one, the
data array may be a scalar. If there are no domain axes then data
array must be a scalar. Domain axes of size one can be omitted
because their position in the order of domain axes makes no
difference to the order of data elements in the array. The
elements of the data array must all be of the same data type,
which may be numeric, character or string.
- An unordered collection of dimension coordinate constructs.
- An unordered collection of auxiliary coordinate constructs.
- An unordered collection of cell measure constructs.
- A cell methods construct, which refers to the domain
axes (but not their sizes).
- An unordered collection of coordinate reference
constructs.
- Other properties, which are metadata that do not refer to
the domain axes, and serve to describe the data the field
contains. Properties may be of any data type (numeric, character
or string) and can be scalars or arrays. They are attributes in
the netCDF file, but we use the term "property" instead because
not all CF-netCDF attributes are properties in this sense.
- A list of ancillary fields. This corresponds to the
CF-netCDF ancillary_variables attribute, which identifies
other fields that provide metadata.
All the components of the field construct bar the data array are
optional.
Collectively, the domain axis, dimension coordinate, auxiliary
coordinate, cell measure and cell method constructs describe
the domain in which the data resides. Thus a field construct
can be regarded as a domain with data in that domain.
The CF-netCDF formula_terms (see also coordinate
reference constructs) and
ancillary_variables attributes make links between field constructs.
These links are fragile.
If a field construct is written to a file, it is not required that any
other field constructs to which it is linked are also written to the file.
If an operation alters one field
construct in a way which could invalidate a relationship with another field
construct, the link should be broken. The user of software will have to be
aware of these relationships and remake them if applicable and useful.
Domain axis construct
A domain axis construct must contain
- A size (an integer greater than zero), which can be equal
to one.
Dimension coordinate construct
A dimension coordinate construct indicates the physical meaning and
locations of the cells for a unique domain axis of the field.
A dimension coordinate construct may contain
- A scalar or one-dimensional numerical coordinate array of
the size specified for the domain axis. The elements of the
coordinate array must all be of the same numeric data type, they
must all have different non-missing values, and they must be
monotonically increasing or decreasing. Dimension coordinate
constructs cannot have string-valued coordinates. In this data
model, a CF-netCDF string-valued coordinate variable or
string-valued scalar coordinate variable corresponds to an
auxiliary coordinate construct (not a dimension coordinate
construct), with a domain axis which is not associated with a
dimension coordinate construct.
- A two-dimensional boundary coordinate array, whose
slow-varying (second in Fortran) dimension equals the size
specified by the domain axis construct, and whose fast-varying
dimension is two, indicating the extent of the cell. For
climatological time dimensions, the bounds are interpreted in a
special way indicated by the cell methods.
- Properties (in the same sense as for the field construct) serving
to describe the coordinates.
In this data model we permit a domain axis not to have a coordinate
array if there is no appropriate numeric monotonic coordinate. That is
the case for a dimension that runs over ocean basins or area types,
for example, or for a domain axis that indexes timeseries at scattered
points. Such domain axes do not correspond to a continuous physical
quantity. (They will be called index dimensions in CF version
1.6.)
Auxiliary coordinate construct
An auxiliary coordinate construct provides auxiliary information for
interpreting the cells of an ordered list of one or more domain
axes of the field.
An auxiliary coordinate construct must contain
- A coordinate array whose shape is determined by the domain axes in
the order listed, optionally omitting any domain axes of size one. The
elements of the coordinate array must all be of the same data type
(numeric, character or string), but they do not have to be distinct or
monotonic. Missing values are not allowed (in CF version 1.5).
and may also contain
- A boundary coordinate array with all the dimensions, in the same
order, as the coordinate array, and a fastest-varying dimension (first
dimension in Fortran) equal to the number of vertices of each cell.
- Properties serving to describe the coordinates.
Auxiliary coordinate constructs correspond to auxiliary coordinate
variables named by the coordinates attribute of a data
variable in a CF-netCDF file. CF recommends there to be auxiliary
coordinate constructs of latitude and longitude if there is
two-dimensional horizontal variation but the horizontal coordinates
are not latitude and longitude. As for dimension constructs,
auxiliary coordinate constructs of different field constructs are
independent in the data model.
Cell measure construct
A cell measure construct provides information about the size, shape or
location of the cells defined by an ordered list of one or
more domain axes of the field.
A cell measure construct may contain
- Properties to describe itself.
and must contain
- A measure property, which indicates which metric of the space
it supplies e.g. cell areas.
- A units property consistent with the measure property
e.g. m2.
- A numeric array of metric values whose shape is determined by the
domain axes in the order listed, optionally omitting any domain
axes of size one. The array must all be of the same data type. It
is assumed that the metric does not depend on any of the domain
axes of the field which are not specified, along which the values
are implicitly propagated.
In CF-netCDF files, cell measures constructs correspond to variables
named by the cell_measures attribute of the data variable.
As for dimensions, cell measures constructs of different field
constructs are independent in the data model.
Cell methods construct
The cell methods construct describes how the data values represent
variation of the quantity within cells. It corresponds to
the cell_methods attribute of the data variable in CF-netCDF
files. It is an ordered list, because the methods specified are not
necessarily commutative. Each entry of the list specifies either one
or more dimensions, or a CF standard name (to describe variation with
respect to a quantity which is not recorded as a dimension of the
field), and a method e.g. mean (CF Appendix E). Special
methods indicate climatological time processing.
Coordinate reference construct
A coordinate reference construct relates the field's coordinate values
to locations in a planetary reference frame.
The field's domain may contain various coordinate systems, each of
which is constructed from a subset of the field's coordinate
constructs. For example, the domain of a four-dimensional (X-Y-Z-T)
field may contain horizontal, vertical and temporal coordinate
systems. There may be more than one of each of these, if there is more
than one coordinate construct applying to a particular spatiotemporal
dimension (for example, there could be both latitude-longitude and
projection X-Y horizontal coordinate systems). In general, a
coordinate system may be constructed from any subset of the coordinate
constructs, yet the data model does not require coordinate constructs
to be explicitly or exclusively associated with any coordinate system.
Each of the field's coordinate systems can optionally be associated
with a coordinate reference construct. This contains the dimension or
auxiliary coordinate constructs to which it applies and provides
additional information that is not contained within the coordinate
system's dimension or auxiliary coordinate constructs.
Contents of a coordinate reference construct
A coordinate reference construct contains:
- The field's dimension and auxiliary coordinate constructs that
define the coordinate system to which the coordinate reference
construct applies.
- The coordinate values are not relevant to the coordinate
reference construct, only their properties.
- Zero or one definitions of a datum, defining the zeros of the
dimension and auxiliary coordinate constructs which define the
coordinate system. The datum may be explicitly indicated via
parameters, or it may be implied by the metadata of the contained
dimension and auxiliary coordinate constructs.
- The datum may contain the definition of a geophysical surface
which corresponds to the zero of a vertical coordinate
construct, and this may be required for both horizontal and
vertical coordinate systems.
- Zero or one coordinate conversions, which define a formula for
converting coordinate values taken from the dimension or auxiliary
coordinate constructs to a different coordinate system. The
conversion formula's definition may comprise scalars (which may be
descriptive strings or may have units); any dimension or auxiliary
coordinate constructs of the field; or other field constructs.
- In the case of horizontal (X-Y) coordinates this conversion is
either a map projection, which converts between Cartesian
coordinates and spherical or ellipsoidal coordinates on the
vertical datum, or a conversion between different spherical
coordinate systems (as in the case of rotated-pole
coordinates) or different ellipsoidal coordinate systems.
In the case of vertical (Z) coordinates the conversion is
between a dimensionless coordinate construct and a dimensional
coordinate construct (such as height, depth or pressure),
again with respect to the vertical datum.
Only parts of the coordinate reference construct definition may be
relevant to any of the dimension and auxiliary coordinate constructs
contained within. The relevant parts are determined by inspection. For
example, for a coordinate reference construct which contained
horizontal projection, latitude and longitude coordinate constructs, a
datum comprising a reference ellipsoid would apply to all of them, but
projection parameters would only apply to projection coordinate
constructs.
- In CF-netCDF, the additional information of a coordinate reference
construct that is not found in the dimension and auxiliary
coordinate constructs is stored in a grid mapping variable or a
formula_terms coordinate attribute, for horizontal or
vertical coordinate variables respectively. Although these two
cases are arranged differently in CF-netCDF, each one contains a
datum or a coordinate conversion formula (or both) and so may be
mapped to a coordinate reference construct.
Relationship to ISO 19111 coordinate reference systems
The coordinate reference construct is closely related to the concept
of a coordinate reference system (CRS) as used by the ISO 19111
definition of geographic information systems. A CRS also anchors
coordinate values to the real world (or some other reference frame)
and consists, in general, of three pieces of information:
- A reference point, line or surface defining the location(s) where coordinate values are zero, i.e. the datum.
- An indicator of the direction away from this datum (such as "up", "east", "north").
- A unit of measure that relates the coordinate values to the distance from the datum.
In the CF data model all the information required to construct a CRS
is present, although the information may be spread across a number of
CF constructs. This partitioning is partly for historical reasons and
partly for the convenience of applications that may wish to consume
CF-compliant data without the need to understand a full CRS. For
example, in the CF data model the direction and unit of measure parts
of the CRS are typically defined as part of the coordinate
constructs. The remaining information required to construct the CRS
(i.e. any required datums and coordinate conversions) are provided by
the coordinate reference construct.
Other properties
The other properties recognised by this CF data model correspond to attributes
listed in CF Appendix A.
For field constructs, the allowed properties are
comment,
history,
institution,
long_name,
references,
source,
standard_error_multiplier,
standard_name,
title,
units.
Some of these can be global attributes in a CF-netCDF file.
In this data model, it is assumed that any relevant global attribute
is also an
attribute of every data variable, although it is superseded if the data
variable has its own attribute.
Each field construct in the model has its own independent set of properties.
For dimensions and auxiliary coordinate constructs, the allowed properties are
axis,
calendar,
leap_month,
leap_year,
long_name,
month_lengths,
positive,
standard_name,
units.
Coordinate constructs of time are optionally climatological;
this property is indicated by the presence of the climatology
attribute.
In any field, any given value of the axis attribute can occur
no more than once among all the dimension and auxiliary coordinates of
that field.
The CF data model allows field, dimension
and auxiliary coordinate constructs
to have other properties not defined by CF, provided they do not
conflict with CF, but since they are not part of the
CF standard, the data model does not provide any interpretation of them.
The attributes
valid_max,
valid_min and
valid_range
of data variables and coordinate variables are checks on the validity of
the values, which could be verified on input and written on output.
In this CF data model we assume they do not constrain any manipulations
which might be done on the data in memory,
and they are not part of the data model.
The attributes
_FillValue and
missing_value
of data variables specify how missing data is indicated in the data array.
This data model supports the idea of missing data, but does not depend on
any particular method of indicating it, so these attributes
are not part of the model.
The attributes
add_offset,
compress,
flag_masks,
flag_meanings,
flag_values and
scale_factor
are all used in methods of compressing the data to save space
in CF-netCDF files,
with or without loss of information.
They are not part of this data model because these operations do not
logically alter the data,
except that the compress attribute implies two alternative
interpretations of coordinates (compressed or uncompressed).
The "feature type" attribute and associated new conventions,
to be introduced in CF version 1.6,
will provide a way of packing multiple
fields of the same kind of discrete sampling geometry
(timeseries, trajectories, etc.) into a single CF-netCDF data variable,
in order to save space, since a multidimensional representation with
common coordinate variables is typically very wasteful in such cases.
This is a kind of compression. The data model would regard each instance
of the feature type as an independent field construct.
However, the "feature type" attribute itself is also a metadata property
that would be a property of the field construct and part of the data model.
The attributes
bounds,
cell_measures,
cell_methods,
climatology,
Conventions,
coordinates,
formula_terms and
grid_mapping
have various special or structural functions in the CF-netCDF file format.
Their functions and
the relationships they indicate are reflected in the structure
of this data model,
and these attributes do not correspond directly to
properties in the data model.
9th September 2014
Version 0.7 of 17th December 2012
Version 0.6 of 12th December 2012
Version 0.5 of 16th October 2012
Version 0.4 of 5th August 2012
Version 0.3 of 6th February 2012
Version 0.2 of 1st August 2011
Original version 0.1 of 10th January 2011
Jonathan Gregory,
David Hassell, Mark Hedley, Jon Blower