Automated analysis of massive climate model data

Lead Research Organisation: Imperial College London
Department Name: Mathematics

Abstract

As we move into an era where climate models produce tens of petabytes of data, how do we turn that ocean of data into scientific insight?

Conducting new analyses of climate simulations is a core mechanism for developing understanding of the climate system. As computers become larger and the models behind these simulations become ever more sophisticated, the ability of scientists to work effectively with the data is frustrated. CMIP5 was estimated to produce 3.3 petabytes of data (1000 state-of-the art hard drives) and CMIP6 has a projected data volume of 18 petabytes. Key countries including the UK, the US and Germany are currently rebuilding their climate model software on the basis of more sophisticated numerics. This will produce more accurate simulations, but also data sets which are more complex to process correctly.

However, scientific advances strongly depend on diversity of effort: it is essential that small groups of scientists in diverse institutions can test innovative ideas against climate model data sets. At the same time as the data volume is increasing and the numerics are becoming more complex, it becomes more and more essential that small groups of scientists and students can compute new derived quantities.

A climate statistic is a mathematical statement, which a climate scientist can typically express in a few lines of mathematics. The current approach to the evaluation of this statement is for a scientist to spend weeks or months developing a bespoke script and tuning it to the separate data structure of each climate model to which it is to be applied. This is labour-intensive and requires reworking for each new statistic and each new model. Most critically, there is no effective mechanism for users of the results to verify that the statistic is correctly evaluated. Furthermore, this approach typically requires the data to be downloaded by each research group, an increasingly infeasible task.

The missing link in this process is the ability to take the mathematical statement of the statistic and automatically and efficiently evaluate it correctly in the light of the discrete data representation of each model. The student on this project will make a major contribution to the solution of this problem by producing a system which generates climate data query software from the high-level mathematical specification of the diagnostic to be calculated. They will leverage the existing Firedrake project (http://firedrakeproject.org) to automatically derive mathematically correct parallel algorithms. The resulting system will be:

Efficient: rather than spending months on coding, climate scientists will be able to move directly from formulating the question to studying the outputs.

Model portable: the same mathematical statement can be run on different models. This is essential for reliable and trustable intercomparisons.

Verifiably correct: the statistics will be correctly calculated from the underlying numerics, this will be testable through extensive test suites, and the scientist will be able to publish the actual mathematical code in their papers, so the provenance of their results is established and testable.

Distributed: statistics can be calculated and processed where the data is archived, without downloading huge data sets.

If individual scientists are to continue to do innovative work with climate model data on which the users of climate science can rely, solving the problems this project addresses is essential.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
NE/S007415/1 01/10/2019 30/09/2027
1650310 Studentship NE/S007415/1 01/10/2019 05/02/2024 Reuben Nixon-Hill