Making the World Climate Research Programme’s CMIP6 accessible via Eratos

Monica Liu
May 22, 2023
7 min read

Updated: May 4, 2024

“Climate models are one of the primary means for scientists to understand how the climate has changed in the past and may change in the future. These models simulate the physics, chemistry and biology of the atmosphere, land and oceans in great detail, and require some of the largest supercomputers in the world to generate their climate projections.”

- Carbon Brief

CMIP stands for Coupled Model Intercomparison Projects and has been operating since 1995. As part of the World Climate Research Programme (WCRP), CMIP is currently in its 6th iteration and has coordinated climate model experiments involving multiple international modelling teams worldwide. The main goal of CMIP is to advance the scientific understanding of Earth which the team at Eratos could not be more on board for!

However, as is the case with many large and complex data models, accessing them is not always simple, let alone applying them. That’s why a small group at Eratos, led by Monica Liu decided to map CMIP6’s ontology to Eratos thereby making the access and use of CMIP6 climate models simpler and easier to help us transform the world by understanding it. The applications are diverse from government, industry smart cities to research and beyond. This article explains the process Monica and the team undertook.

High-level Ontology Mapping

To begin creating the underlying ontology for CMIP datasets in Eratos, our team had to create a high-level mapping of the important CMIP6 labels into core Eratos concepts:

Why we created a high-level ontology map

It was important to do an initial high-level mapping to ensure that we included all of the Earth System Grid Federation (ESGF) CMIP6 labels in our discovery to cover every label type that’s used in the CMIP6 datasets. This step allowed us to map or create any required concept in Eratos to be sure we were covering every detail that was necessary to fully describe the datasets. It also encouraged us to complete preliminary research into each CMIP6 label and gather the necessary disparate resources.

Unit and variable creation:

building context from the ground up

In order to build out the CMIP6 ontology, the base units and variables were first analysed and created as Eratos ‘units’ and ‘variables’. The CMIP6-specific variables are available through a platform provided by the ‘World Climate Research Programme’ shown in the image below. This tool allowed us to export the list of CMIP6 variable objects.

CMIP6 Data Request (Variable Search)

Each variable contains an ID, CMIP6 specific key, the Climate and Forecast (CF) standard name (used across a majority of climate datasets), the unit of measurement (or scale factor/ratio) and a description. Additionally, some of the CMIP6 specific keys mapped directly to Coordinated Regional Climate Downscaling Experiment (CORDEX) variables which is a WCRP framework that many climate scientists adhere to. Following these observations, we realised we would also need to link the CMIP6 variables to their CF standard names and descriptions, as well as possible CORDEX variables for extra context.

CF Standard Name Table

CORDEX Variables

Our research then led us to create the below dependency mapping for the CMIP6 variables.

Eratos allows for intelligent linking of resources through the use of unique ‘Eratos Resource Names’ (ERNs), and so we began to create the individual units needed for each variable. Through interrogating the units included in the CMIP6 variable’s export, we concluded there were 67 unique units or scale factors used for measurement. Many resources were used to define the Standard International (SI) base and derived units (see 'Unit References' at the bottom of the page).

After creating the dependent units in Eratos, we created the CMIP6 variables, linking them back to their CF standard name, possible CORDEX key and descriptions using the CMIP6 data request variable export and querying the CF conventions standard name table and the CORDEX variables requirement table.

By linking contextual data we built a knowledge base agnostic of naming convention

In order to begin building out the ontology in Eratos we needed to start with the lowest level of dependent data for the CMIP6 datasets, which are the units and variables and work our way up.

As a result, we were able to identify all of the described CMIP6 variables and their corresponding units which are required to be able to describe the datasets and what types of measurement or scaling factors they use.

This step also allows us to build further knowledge and context into the provided CMIP6 variables by incorporating additional resources.

The CF standard name conventions and the CORDEX variables are both highly utilised in the climate sector, and so it’s important to link these with context so that no matter which naming convention a climate scientist uses, through Eratos they only need to know one and they can find the dataset they’re looking for.

Dependency mapping of CMIP6 Institution ID, Source ID, Experiment ID and Sub-experiment ID

Following the creation of CMIP6 variables in Eratos, we proceeded to create a dependency map from the CMIP6 source and experiment IDs found under the CMIP6 controlled vocabularies GitHub repo. (See screenshot below of the dependency map.)

(For a more detailed and interactive view of the above image, see our mapping here.)

From the diagram, we were able to create the dependent context for each source and experiment in CMIP6.

We then began to build out the CMIP6 source ID ontology. Conscious that CMIP is a collaboration of multiple organisations, in order to create the CMIP6 source ID’s as Eratos ‘models’ we had to create and map the CMIP6 institutions in Eratos as ‘creators’, as well as the licences, which were mapped into Eratos ‘licence’ objects. The CMIP6 institution IDs are available via this GitHub repo, and the licences are available via this GitHub repo. After generating the unique ERN’s for the CMIP6 institutions and licences, we were able to begin creating the source IDs.

We used the string denoting the activity ID’s, and created CMIP-specific fields in our model schema for the cohort, licence exceptions contact, release year and model components including the realm and nominal resolutions. To generate the CMIP6 source ID’s we used the GitHub provided objects and mapped them back to the unique ERN’s created for the institutions and licences.

Next, we moved on to create the ontology for the CMIP6 experiment IDs.

To create the experiments as Eratos ‘scenarios’, we first had to create the CMIP6 sub-experiment IDs found on GitHub which we also created as Eratos ‘scenarios’. We saved the unique ERN mapping for the sub-experiments and then moved on to create the experiments.

Once again, we used the string denoting the activity ID’s and then created CMIP-specific fields in our scenario schema to capture the start year, end year, tier, the minimum number of years per simulation and the required and additional model components to include the source types. To generate the CMIP6 experiment ID’s we used this document and linked the unique ERN’s for the sub-experiments and parent experiments.

Completing the CMIP picture

In this step, we were able to highlight any dependencies amongst the remaining CMIP6 labels to ensure they were all included in our Eratos ontology. Not only is this context powerful in its own right, but it also allowed us to fill in the final pieces of important information needed to best describe the CMIP6 datasets.

So far our focus has been on the CMIP6 ontology and there's still a way to go to map the rest of the CMIP versions but we are not the experts in this field. We want to facilitate engagement and collaboration around the use of CMIP which is currently occurring with the Australian Federated Climate Data Initiative (FCDI).

So if you are an expert and would like to contribute to our Community by helping create more utility of CMIP, CORDEX and other Climate Science standards please get in touch, we welcome any chance to collaborate and contribute to the furthering of Climate Science.

The results

This process has allowed us to build the knowledge and context required to fully describe the ESGF CMIP6 datasets and make them accessible via the Eratos platform.

When describing datasets, it is important to build out the base ontology so that as much relevant information as possible is available to potential users. This provides a more complete understanding of the dataset they are dealing with in an efficient and effective manner.

As we described, there are many disparate resources that describe the CMIP6 core labels used in the datasets, which would take an inexperienced user an enormous amount of time and effort to find and understand. By creating a high context ontology in Eratos, we are able to provide that information to users, ready to use and in one location. The end result is a much more comprehensive and efficient understanding of the desired data and therefore a greatly improved user experience when utilising the datasets, as well as expediting the user's journey into analysis and solutions.

The process so far has been an exciting challenge and we’re only just getting started. In the future, we plan on moving the CMIP6 activities and ‘mip eras’ into Eratos collections and creating unique linking for the source types, which is still under investigation. This will be another step forward in creating even greater context and knowledge surrounding the CMIP6 datasets, and possibly even link to the earlier CMIP projects.

Through this exercise, we found that the CMIP6 variable export was missing 66 variables that are used throughout the CMIP6 datasets, and so we’ll continue researching to bring context to those missing variables and possible units. However, now that the majority of the initial ontology has been created on Eratos, we are ready to make the CMIP6 datasets accessible for preliminary use, which will link back to all of the underlying information and data.

To experiment with the CMIP6 ontology via Eratos please register for an account. Or if you already have an account and want to suggest other datasets or models that would be useful to our Community please let our support team know.

Share
Authors	Monica Liu
Cover	USGS
CMIP6 Specific References	CMIP6 Guidance for Data Users General Documentation CMIP6 Data Request (Variable Search) World Climate Research Programme CMIP6 Controlled Vocabularies Documents
Unit references	UDUNITS Manual & Database Guide for the Use of the International System of Units (SI) International System of Units (SI) Catalogue (explanations & tables) Units of measurement definitions Flux Density Units Lists of SI base, derived and other units: Ibiblio list of SI Derived & Compatible Units ChemistryGod lists of SI units POSC Units of Measure Dictionary v2.2 Wikipedia: SI derived unit Wikipedia: Template SI other units US National Institute of Standards and Technology Gordon England SI Derived Units
VARIABLE (NON-CMIP6) REFERENCES	CORDEX Variables CF Standard Naming Conventions NetCDF Climate and Forecast (CF) Metadata Conventions

The Eratos Platform utilises Amazon Web Services (AWS) to leverage best in class cloud scaling architecture, including Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (AWS EKS), Amazon Elastic File System (Amazon EFS) and Amazon Simple Storage Service (S3).