Using data to improve public health: COVID-19 secondment
Lead Research Organisation:
University of Aberdeen
Department Name: Physics
Abstract
The coronavirus disease 2019 (COVID-19) caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has led to a worldwide increase in hospitalisations and deaths since it emerged in December 2019. The effects of COVID-19 depend very much on each patient and range from asymptomatic to fatal cases. The duration of symptoms is also very heterogeneous, lasting between a few days for some patients and several weeks for others that develop the so-called 'long COVID'. Older age is a well-known risk factor for both severe and long COVID. This has been associated with a debilitated immune response caused by ageing processes. Pre-existing diseases such as hypertension, diabetes, cardiovascular disease, or cancer also increase the risk of severe infection in patients with COVID-19. However, severe COVID-19 has also been observed for many seemingly healthy middle-aged individuals. Understanding of the risk factors for severe COVID-19 remains limited and the reasons why susceptibility to the virus varies so widely in the population are poorly understood. More research is needed to unveil the biological mechanisms of severity so that highly susceptible individuals and pathways to novel treatments can be identified.
Recent studies have shown that the molecules in biofluids such as blood, urine or faeces are altered in people with cardiovascular disease, diabetes, or chronic inflammation. These conditions represent risk factors for severe COVID-19 and we hypothesise that biofluid molecules can be used as metabolic biomarkers to predict whether a patient infected by SARS-CoV-2 is likely to be seriously affected.
The central idea of the proposed research is to use metabolic biomarkers to predict the severity of COVID-19 and the likelihood of long COVID for individuals that have not necessarily been diagnosis with a pre-existing health condition. To this end, we will use pre-pandemic data from several cohort studies which, in addition to basic information on age, sex, ethnicity, etc, contain hundreds of metabolic biomarkers for thousands of individuals. To understand the link between these characteristics and the impact of COVID-19, we will use symptoms data for those individuals in the cohort studies that had COVID-19. The data will be analysed with statistical methods to identify associations between the characteristics of individuals before the pandemic and the severity of the disease. This analysis will be complemented with computer programs developed to predict if the infection of an individual will have serious effects based on his/her characteristics before the pandemic. Machine learning techniques will be used to train computer programs to automatically recognise metabolic features that represent a risk for severe COVID-19.
The project can be beneficial both in terms of basic science and applications. Indeed, the proposed research will enhance our understanding of how metabolic biomarkers may explain the susceptibility to severe COVID-19. From an applied viewpoint, using the information encoded by numerous metabolic biomarkers to train machine learning models can improve our ability to identify individuals for whom COVID-19 may have serious consequences.
Recent studies have shown that the molecules in biofluids such as blood, urine or faeces are altered in people with cardiovascular disease, diabetes, or chronic inflammation. These conditions represent risk factors for severe COVID-19 and we hypothesise that biofluid molecules can be used as metabolic biomarkers to predict whether a patient infected by SARS-CoV-2 is likely to be seriously affected.
The central idea of the proposed research is to use metabolic biomarkers to predict the severity of COVID-19 and the likelihood of long COVID for individuals that have not necessarily been diagnosis with a pre-existing health condition. To this end, we will use pre-pandemic data from several cohort studies which, in addition to basic information on age, sex, ethnicity, etc, contain hundreds of metabolic biomarkers for thousands of individuals. To understand the link between these characteristics and the impact of COVID-19, we will use symptoms data for those individuals in the cohort studies that had COVID-19. The data will be analysed with statistical methods to identify associations between the characteristics of individuals before the pandemic and the severity of the disease. This analysis will be complemented with computer programs developed to predict if the infection of an individual will have serious effects based on his/her characteristics before the pandemic. Machine learning techniques will be used to train computer programs to automatically recognise metabolic features that represent a risk for severe COVID-19.
The project can be beneficial both in terms of basic science and applications. Indeed, the proposed research will enhance our understanding of how metabolic biomarkers may explain the susceptibility to severe COVID-19. From an applied viewpoint, using the information encoded by numerous metabolic biomarkers to train machine learning models can improve our ability to identify individuals for whom COVID-19 may have serious consequences.
Technical Summary
This project will develop computational methods to predict the severity and duration of COVID-19 using data on metabolic biomarkers from cohort studies and machine learning. Highly accurate predictions are crucial to identify the individuals that are most at risk of serious effects of COVID-19. The data to be used consists of sociodemographic information (age, sex, ethnicity, etc), information on health conditions before COVID-19, and metabolic markers from biofluids including blood, urine and faeces. Incorporating metabolomic data into the analysis is expected to significantly enhance our ability to predict the severity of COVID-19 compared to methods that focus on, e.g., sociodemographic data only.
The project will study both the severity of COVID-19 and the duration of symptoms. The specific aims of the project are the following:
Aim 1. To identify metabolic biomarkers associated with severe COVID-19 and long COVID.
Aim 2. To train computer programs to predict the susceptibility of individuals to severe COVID-19 and long COVID.
In practice, the aims will be separately addressed for the severity of COVID-19 and the duration of symptoms. The aim of the project, however, is to integrate the results for both characteristics and provide a general view on how metabolomics can help understand the manifestations of COVID-19.
The severity of COVID-19 will be quantified in terms of whether or not patients show symptoms. For Aim 1, associations between the characteristics of individuals and the presence/absence of symptoms will be explored using statistical methods which will include graphical visualisation, hypothesis testing or logistic regression. Feature selection and dimensionality reduction strategies will be used to identify relevant features in terms of symptoms. For Aim 2, machine learning models will be trained to automatically classify individuals into symptomatic and asymptomatic classes. A variety of machine learning techniques will be implemented; partial least squares discriminant analysis, support vector machines or artificial neural networks are expected to be particularly suitable to deal with the high dimensionality and correlated character of metabolomic data.
Several descriptions will be considered for the duration of symptoms which require different degrees of statistical power to be feasible. If the data gives enough statistical power, the most natural approach will be to consider the duration as a continuous random variable. In this case, Aim 1 will be fulfilled by using regression methods to assess the statistical significance of the different predictor variables for each individual. A range of machine learning methods will be explored to train a predictor for the duration of symptoms. Suitable candidates may include partial least squares regression, principal component regression or artificial neural networks. An alternative description of durations that will require less statistical power will consist in discretising the duration into several categories. For example, into short (=10 days) and long (>10 days) duration to describe short and long COVID, respectively. In this case, Aims 1 and 2 can be achieved using methods similar to those described above for the analysis of the presence or absence of symptoms.
The project will study both the severity of COVID-19 and the duration of symptoms. The specific aims of the project are the following:
Aim 1. To identify metabolic biomarkers associated with severe COVID-19 and long COVID.
Aim 2. To train computer programs to predict the susceptibility of individuals to severe COVID-19 and long COVID.
In practice, the aims will be separately addressed for the severity of COVID-19 and the duration of symptoms. The aim of the project, however, is to integrate the results for both characteristics and provide a general view on how metabolomics can help understand the manifestations of COVID-19.
The severity of COVID-19 will be quantified in terms of whether or not patients show symptoms. For Aim 1, associations between the characteristics of individuals and the presence/absence of symptoms will be explored using statistical methods which will include graphical visualisation, hypothesis testing or logistic regression. Feature selection and dimensionality reduction strategies will be used to identify relevant features in terms of symptoms. For Aim 2, machine learning models will be trained to automatically classify individuals into symptomatic and asymptomatic classes. A variety of machine learning techniques will be implemented; partial least squares discriminant analysis, support vector machines or artificial neural networks are expected to be particularly suitable to deal with the high dimensionality and correlated character of metabolomic data.
Several descriptions will be considered for the duration of symptoms which require different degrees of statistical power to be feasible. If the data gives enough statistical power, the most natural approach will be to consider the duration as a continuous random variable. In this case, Aim 1 will be fulfilled by using regression methods to assess the statistical significance of the different predictor variables for each individual. A range of machine learning methods will be explored to train a predictor for the duration of symptoms. Suitable candidates may include partial least squares regression, principal component regression or artificial neural networks. An alternative description of durations that will require less statistical power will consist in discretising the duration into several categories. For example, into short (=10 days) and long (>10 days) duration to describe short and long COVID, respectively. In this case, Aims 1 and 2 can be achieved using methods similar to those described above for the analysis of the presence or absence of symptoms.
Organisations
- University of Aberdeen (Lead Research Organisation)
- University of Oxford (Collaboration)
- London School of Hygiene and Tropical Medicine (LSHTM) (Collaboration)
- University College London (Collaboration)
- KING'S COLLEGE LONDON (Collaboration)
- UNIVERSITY OF EDINBURGH (Collaboration)
- University of Glasgow (Collaboration)
- University of Bristol (Collaboration)
People |
ORCID iD |
| Francisco Perez Reche (Principal Investigator / Fellow) |
Publications
Pérez-Reche F
(2024)
Age-specific all-cause mortality trends in the UK: Pre-pandemic increases and the complex impact of COVID-19
in Public Health
Pérez-Reche F
(2025)
ESPClust: unsupervised identification of modifiers for the effect size profile in omics association studies
in Bioinformatics
Taylor K
(2024)
Incidence of diabetes after SARS-CoV-2 infection in England and the implications of COVID-19 vaccination: a retrospective cohort study of 16 million people.
in The lancet. Diabetes & endocrinology
Walker VM
(2024)
COVID-19 and Mental Illnesses in Vaccinated and Unvaccinated People.
in JAMA psychiatry
| Description | Used by Scottish Government in relation to excess deaths associated with COVID-19. |
| First Year Of Impact | 2022 |
| Sector | Healthcare,Government, Democracy and Justice |
| Impact Types | Policy & public services |
| Description | Evidence to the COVID-19 Recovery Committee of the Scottish Parliament |
| Geographic Reach | National |
| Policy Influence Type | Contribution to a national consultation/review |
| URL | https://www.scottishparliament.tv/meeting/covid-19-recovery-committee-march-10-2022 |
| Description | Impact & Engagement Accelerator Fund |
| Amount | £10,000 (GBP) |
| Organisation | University of Aberdeen |
| Sector | Academic/University |
| Country | United Kingdom |
| Start | 02/2025 |
| End | 07/2025 |
| Description | Institutional Research Leave |
| Amount | £20,000 (GBP) |
| Organisation | University of Aberdeen |
| Sector | Academic/University |
| Country | United Kingdom |
| Start | 01/2024 |
| End | 06/2024 |
| Title | A machine learning classification pipeline for metabolomic data |
| Description | A pipeline has been developed to identify sets of metabolites with high discriminatory power between two or more groups of individuals. This can be viewed as a method for feature selection optimized for metabolomic data. More explicitly, the method involves a feature selection procedure that results in a hierarchical sequence of models. The end of the pipeline gives a minimal model based on a small selection of metabolites with the highest power to distinguish between groups. We are applying our pipeline to distinguish groups of individuals from several cohorts that were infected by SARS-CoV-2 and had different symptom characteristics. The analysis pipeline will be published when the analyses of symptoms have been finalised. |
| Type Of Material | Data analysis technique |
| Year Produced | 2022 |
| Provided To Others? | No |
| Impact | N/A |
| Title | ESPClust: unsupervised identification of modifiers for the effect size profile in omics association studies |
| Description | Omics association studies conventionally employ univariate statistical analyses to explore the relationship between an outcome and individual omics variables. Commonly, adjustments are made for confounding effects linked to non-Omics covariates, such as age or BMI. Nevertheless, a prevailing limitation in most methodologies is the neglect of the potential of these covariates to serve as effect modifiers. This oversight may result in a partial comprehension of the significance of certain non-Omics covariates concerning the omics data. In response to this challenge, an unsupervised method has been devised to discern non-Omics covariates capable of simultaneously modifying the effect size of associations involving multiple omics variables and an outcome. Moreover, this method facilitates the identification of omics variables whose associations are notably influenced by non-Omics covariates, thereby contributing to a more comprehensive understanding of the interplay between these variables. |
| Type Of Material | Data analysis technique |
| Year Produced | 2025 |
| Provided To Others? | Yes |
| Impact | N/A. The method is currently being internally tested on several omics datasets. |
| URL | https://github.com/fjpreche/ESPClust |
| Description | Longitudinal Health & Wellbeing National Core Study for COVID-19 |
| Organisation | King's College London |
| Country | United Kingdom |
| Sector | Academic/University |
| PI Contribution | Analysis of data from several study cohorts to understand several aspects of COVID-19. In particular, I lead a project to understand the link between pre-pandemic metabolomics of COVID-19 patients and experienced symptoms. This project directly contributes to the Convalescence long COVID Study |
| Collaborator Contribution | Provide sociodemographic and biological data for participants in the cohort studies. Also provide expertise in medical, biological and data science. |
| Impact | No outputs are to be reported yet. |
| Start Year | 2021 |
| Description | Longitudinal Health & Wellbeing National Core Study for COVID-19 |
| Organisation | London School of Hygiene and Tropical Medicine (LSHTM) |
| Country | United Kingdom |
| Sector | Academic/University |
| PI Contribution | Analysis of data from several study cohorts to understand several aspects of COVID-19. In particular, I lead a project to understand the link between pre-pandemic metabolomics of COVID-19 patients and experienced symptoms. This project directly contributes to the Convalescence long COVID Study |
| Collaborator Contribution | Provide sociodemographic and biological data for participants in the cohort studies. Also provide expertise in medical, biological and data science. |
| Impact | No outputs are to be reported yet. |
| Start Year | 2021 |
| Description | Longitudinal Health & Wellbeing National Core Study for COVID-19 |
| Organisation | University College London |
| Country | United Kingdom |
| Sector | Academic/University |
| PI Contribution | Analysis of data from several study cohorts to understand several aspects of COVID-19. In particular, I lead a project to understand the link between pre-pandemic metabolomics of COVID-19 patients and experienced symptoms. This project directly contributes to the Convalescence long COVID Study |
| Collaborator Contribution | Provide sociodemographic and biological data for participants in the cohort studies. Also provide expertise in medical, biological and data science. |
| Impact | No outputs are to be reported yet. |
| Start Year | 2021 |
| Description | Longitudinal Health & Wellbeing National Core Study for COVID-19 |
| Organisation | University of Bristol |
| Country | United Kingdom |
| Sector | Academic/University |
| PI Contribution | Analysis of data from several study cohorts to understand several aspects of COVID-19. In particular, I lead a project to understand the link between pre-pandemic metabolomics of COVID-19 patients and experienced symptoms. This project directly contributes to the Convalescence long COVID Study |
| Collaborator Contribution | Provide sociodemographic and biological data for participants in the cohort studies. Also provide expertise in medical, biological and data science. |
| Impact | No outputs are to be reported yet. |
| Start Year | 2021 |
| Description | Longitudinal Health & Wellbeing National Core Study for COVID-19 |
| Organisation | University of Edinburgh |
| Country | United Kingdom |
| Sector | Academic/University |
| PI Contribution | Analysis of data from several study cohorts to understand several aspects of COVID-19. In particular, I lead a project to understand the link between pre-pandemic metabolomics of COVID-19 patients and experienced symptoms. This project directly contributes to the Convalescence long COVID Study |
| Collaborator Contribution | Provide sociodemographic and biological data for participants in the cohort studies. Also provide expertise in medical, biological and data science. |
| Impact | No outputs are to be reported yet. |
| Start Year | 2021 |
| Description | Longitudinal Health & Wellbeing National Core Study for COVID-19 |
| Organisation | University of Glasgow |
| Country | United Kingdom |
| Sector | Academic/University |
| PI Contribution | Analysis of data from several study cohorts to understand several aspects of COVID-19. In particular, I lead a project to understand the link between pre-pandemic metabolomics of COVID-19 patients and experienced symptoms. This project directly contributes to the Convalescence long COVID Study |
| Collaborator Contribution | Provide sociodemographic and biological data for participants in the cohort studies. Also provide expertise in medical, biological and data science. |
| Impact | No outputs are to be reported yet. |
| Start Year | 2021 |
| Description | Longitudinal Health & Wellbeing National Core Study for COVID-19 |
| Organisation | University of Oxford |
| Country | United Kingdom |
| Sector | Academic/University |
| PI Contribution | Analysis of data from several study cohorts to understand several aspects of COVID-19. In particular, I lead a project to understand the link between pre-pandemic metabolomics of COVID-19 patients and experienced symptoms. This project directly contributes to the Convalescence long COVID Study |
| Collaborator Contribution | Provide sociodemographic and biological data for participants in the cohort studies. Also provide expertise in medical, biological and data science. |
| Impact | No outputs are to be reported yet. |
| Start Year | 2021 |
| Description | Cafe Controversial: Ideas of Death |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | Regional |
| Primary Audience | Public/other audiences |
| Results and Impact | A monologue on the history of mortality due to infectious diseases and recent results on mortality trends in the UK. |
| Year(s) Of Engagement Activity | 2024 |
| URL | https://www.explorathon.co.uk/events-programme/cabaret-of-dangerous-ideas-2/ |
| Description | Interviews for national news on mortality trends |
| Form Of Engagement Activity | A press release, press conference or response to a media enquiry/interview |
| Part Of Official Scheme? | No |
| Geographic Reach | National |
| Primary Audience | Public/other audiences |
| Results and Impact | Interviews by BBC Scotland and That's TV. |
| Year(s) Of Engagement Activity | 2024 |
| Description | Press coverage on mortality trends |
| Form Of Engagement Activity | A press release, press conference or response to a media enquiry/interview |
| Part Of Official Scheme? | No |
| Geographic Reach | National |
| Primary Audience | Public/other audiences |
| Results and Impact | Daily Mail, Wales Online and other online sources. |
| Year(s) Of Engagement Activity | 2024 |
| Description | Radio coverage of research on mortality trends |
| Form Of Engagement Activity | A press release, press conference or response to a media enquiry/interview |
| Part Of Official Scheme? | No |
| Geographic Reach | National |
| Primary Audience | Media (as a channel to the public) |
| Results and Impact | Coverage by BBC Radio Scotland, Radio Orkney, BBC Radio Shetland. |
| Year(s) Of Engagement Activity | 2024 |
| Description | What do Gases, Epidemics, Society and Life Have in Common? |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | Local |
| Primary Audience | Public/other audiences |
| Results and Impact | Engaging and hands-on talk on complex systems. |
| Year(s) Of Engagement Activity | 2024 |
| URL | https://techfest.org.uk/images/Aberdeen_Science_Festival_2024_Programme_.pdf |