A translational data integration platform for the stratification of patients based on clinical, laboratory and magnetic resonances imaging

Lead Research Organisation: Queen Mary, University of London
Department Name: William Harvey Research Institute

Abstract

Translational biomedicine studies depend on the integration of multiple datasets that, together, represent the complex plethora of features from patients transiting between health and disease states. The UK has several initiatives which aim to investigate disease onset and progression on a longitudinal basis which are particularly suited for research. The UK Biobank (UKB) has a clinical data collection comprised of more than 500,000 healthy individuals, with aims to collect 100,000 magnetic resonance image scans of various body parts such as brain, heart and abdomen, as well as information about the bone tissue structure and ultrasound of the carotid arteries from participants. This imaging data is being integrated with genetic data and detailed clinical information derived from detailed subject assessments and linked electronic health records. By comparison to the mainly healthy UKB cohort, the Barts Heart Centre has recruited over 14,000 patients since 2014 to create the Barts BioResource (BBR), which aims to create a rich information resource for cardiovascular research, linking omics, imaging and EHR. In order to speed-up translational research using these unprecedented datasets, it is of utmost importance to guarantee the information about the origin of these datasets, the precise methods that they were collected and integrate them in a major unified database system. The UKB and BBR cohorts collectively represent the full spectrum between health and cardiovascular disease.
In parallel, the European Commission (EC) together with the European Association of Pharmaceutical Industries and Associations (EFPIA) funded the eTRIKS project (2012-2018) to deploy a sustainable open-source data and knowledge management platform to support translational research: tranSMART. This system supports a wide variety of data and has been successfully applied to various projects within (e.g. U-BIOPRED, MRC Stratified Medicine projects (PSORT, MATURA, RA-MAP, IMID-BIO, CLUSTER and MASTERPLANS)) and beyond the UK (e.g. AETIONOMY).The new capabilities of tranSMART allow the integration of study metadata; various categorical and numerical data (e.g. red-blood cells counts) along with OMICS data (e.g. gene expression, genomic copy number variation and small nucleotide polymorphisms, peptides & metabolite profiling). The tool tranSMART allows programmatic data access for the generation of computational workflows using a large variety of software.
From the collaboration between the projects eTRIKS and AETIONOMY, a new software concept called BrainMesh raised and prized the best-poster award from the tranSMART Foundation Annual Meeting (2016) at the University of California (San Diego - US); featuring as promising future technology around the tranSMART environment. Together with the new visual analytical features of tranSMART, via the newly developed software component SmartR, BrainMesh adds a completely new dynamic visual analytics concept to tranSMART, such as allowing the visual analysis of clinical and image-derived data in a integrated fashion.
In this proposal, we aim to include the complete UKB and BBR cardiovascular MRI cohorts into dedicated (distinct) tranSMART environments where multiple analytical workflows could be executed in order to stratify patients that share common health data features, paving the way for data mining and discovery in these cohorts and in future projects that desire to use the platform.

Technical Summary

The UKB and BBR cohorts collectively represent the full spectrum between health and cardiovascular disease. Both cohorts will be analysed using parallel tranSMART data warehouse infrastructures, enabling a comparison between healthy and diseased subjects and integration of high level findings from both cohorts. Thus, we will establish a foundation for translational cardiovascular research in the UK with a detailed data provenance schema and common analytical pipeline.
To achieve this, Unified Medical Language System (UMLS) coding standards will be used to standardize EHR data into official nomenclature. In order to analyze and potentially integrate multiple datasets between tranSMART instances, extensive data curation will be necessary, prior experience from the eTRIKS and AETIONOMY IMI projects will mitigate the risks associated with this process. The data will be made available to other applications, via a flexible tranSMART API, including a data constructor feeding a machine learning software layer called Ada. Using Ada's powerful machine-learning algorithms we will stratify patients and by adapting the BrainMesh package for heart data, we will investigate the MR images collected by UKB and BBR.
Docker instances will help to create reproducible and portable data warehouse instrances, while Git versioning will provide clarity in versioning. Software and datasets will be made available in public repositories where appropriate or via Zenodo and referenced by a top-level unique Digital Object Identifiers. Application of FAIR (Findable, Accessible, Interoperable and Reproducible) principles, within the broader scope of each resource access conditions will guarantee the sustainable long-term use of these tools and datasets for future researchers. Allowing us to create a critical mass of highly skilled scientists dedicated to health data research.

Publications

10 25 50
 
Description CORBEL - Coordinated Research Infrastructures
Amount € 5,000 (EUR)
Funding ID PID 5815 
Organisation European Commission 
Sector Public
Country European Union (EU)
Start 11/2018 
End 07/2019
 
Title Python machine learning module 
Description We created a preliminary jupyter-notebook, encoded in python, that uses machine leaning methods from the library scikit-learn in order to give an early diagnosis of Alzheimer's disease. We intend to use this module to analyse the UKBB datasets that we just got access to. 
Type Of Material Computer model/algorithm 
Year Produced 2018 
Provided To Others? No  
Impact No impact yet, but when we final version of the notebook gets published (mid 2019), we expect that other users will be able to run their own predictions by using our method. Users that collected the same type of data (variables) as those used by us, will be able to readily use our method. However, user using different variables to characterise their patients will have to adapt the values during the data preparation step (see outcome: R data manipulation notebook). We intend to use this module to analyse the UKBB datasets that we just got access to. 
 
Title R data manipulation notebook 
Description We created a R/Markdown notebook that allows users to import ADNI (http://adni.loni.usc.edu/) datasets into a routine that prepares the dataset to be used as input of a Machine Learning notebook, and subsequently imports the results from the ML classification in order to be analysed using Data Science libraries such as Tydiverse. 
Type Of Material Data handling & control 
Year Produced 2018 
Provided To Others? No  
Impact The impact is that together with the python notebook described in the other outcome, users will be able to effortless prepare ADNI datasets, or other clinical research datasets, for machine learning classification tasks. We intend to use this module to analyse the UKBB datasets that we just got access to. 
 
Title UKBB tranSMART data model 
Description We developed a preliminary data model to store UK Biobank (UKBB) datasets within the i2b2/tranSMART system. Using this model, users are enabled to convert UKBB flat files into tranSMART standard format files in order to map to the tree topology that we defined for this dataset. 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? No  
Impact No impact yet. But when this data model become finally ready for sharing outside our Application Access, we expect that all the users interested into accessing UK Biobank datasets will have the power to readily map the files to our data model and deploy it on tranSMART in a effortless manner. 
 
Description ELIXIR 
Organisation ELIXIR
Country United Kingdom 
Sector Charity/Non Profit 
PI Contribution We included ELIXIR as a Research Infrastructure provider on our CORBEL grant.
Collaborator Contribution ELIXIR collaborate with us via the associated CORBEL grant. ELIXIR will support the development of our data management plan as well as provide services and tools for the integration of datasets based on a tranSMART data warehouse.
Impact No output available yet, once that the project is still ongoing.
Start Year 2018
 
Description European Clinical Research Infrastructure Network (ECRIN) 
Organisation European Clinical Research Infrastructure Network
Country France 
Sector Charity/Non Profit 
PI Contribution We included ECRIN as a Research Infrastructure provider on our CORBEL grant.
Collaborator Contribution ECRIN collaborate with us via the associated CORBEL grant. ECRIN sets up the legal framework where European cardiovascular datasets can be reused for our project.
Impact No output available yet, once that the project is still ongoing.
Start Year 2018
 
Description European infrastructure for translational medicine (EATRIS) 
Organisation European infrastructure for translational medicine
Country Netherlands 
Sector Academic/University 
PI Contribution We included EATRIS as a Research Infrastructure provider on our CORBEL grant.
Collaborator Contribution EATRIS collaborate with us via the associated CORBEL grant. EATRIS will provide support into setting a dedicated translational bioinformatics platform (tranSMART) for the storage of imaging data linked to corresponding raw files sitting on XNAT.
Impact No output available yet, once that the project is still ongoing.
Start Year 2018
 
Description European research infrastructure for biobanking and biomolecular resources (BBMRI-ERIC) 
Organisation Biobanking and Biomolecular Resources Research Infrastructure
Country Austria 
Sector Charity/Non Profit 
PI Contribution We included BBMRI-ERIC as a Research Infrastructure provider on our CORBEL grant.
Collaborator Contribution BBMRI-ERIC collaborate with us via the associated CORBEL grant. BBMRI-ERIC will search across different biobanks and data collections for datasets containing the cardiovascular data (EHR, OMICS, CT/MRI images);
Impact BBMRI provided us with a list of biobanks containing relevant datasets for our CORBEL grant.
Start Year 2018
 
Description Fiocruz: INOVA-COVID-19 
Organisation Oswaldo Cruz Foundation (Fiocruz)
Country Brazil 
Sector Public 
PI Contribution FIOCRUZ-INOVA: Molecular basis of severe COVID-19 associated comorbidities, a systems biology approach. Coordinator: Dr. Fabricio Alves (PROCC-IOC-FIOCRUZ) This project aims to understand the molecular basis of the comorbidities associated with severe COVID-19 using systems biology. The project will access public and private RNASeq and protein-protein interaction data and, by using network analysis and graph theory algorithms, will define what are the biological networks altered during severe infection considering different comorbidities. Each network will have their main components (genes/proteins) flagged as potential druggability targets. Finally, a list of drugs compatible with such targets will be proposed aiming to repurpose the for severe COVID-19. QMUL: We will provide QMUL hyper-computing facilities (Apocrita) for the storing of OMICs data and for the computing of network analysis to predict targets.
Collaborator Contribution Collaboration to obtain transcriptomics/proteomics data associated with COVID-19 publicly available in literature and biological databases. Application of machine learning algorithms for tissue and comorbidity clustering.
Impact Github repository containing software for analysis: https://github.com/adrianobioinfo/inova
Start Year 2020
 
Description UFPE-CAPES/COVID-19: PlatMAMP 
Organisation Federal University of Pernambuco
Country Brazil 
Sector Academic/University 
PI Contribution This project aims to evaluate previously selected peptides anti-SARS-CoV-2 derived from native brazilian plants. This included peptides reported to be effective to inhibit the growth of HIV, Influenza A e Dengue viruses. The group has an in-house pipeline developed for the rational design of such peptides in order to increase affinity to microbial targets. The group has achieved promising results in similar approaches against multi-drug resistant bacteria. The project aims to evaluate over 200 peptides (in silico) and generate 20 modified ones to be tested in vitro and in vivo against SARS-Cov-2. This includes protein structure analysis, mutagenicity, cytotoxicity and teratogenesis tests, as well as NB3-controlled models for sepsis and SARS-Cov-2 infected mice immunity tests.
Collaborator Contribution QMUL: We will provide QMUL hyper-computing facilities (Apocrita) and also analytical pipelines for protein docking simulations of the modified peptides against their putative targets.
Impact Collaboration is still ongoing.
Start Year 2020
 
Description UKBB 
Organisation UK Biobank
Country United Kingdom 
Sector Charity/Non Profit 
PI Contribution I am developing a data model based on i2b2/tranSMART to represent UKBB data within the software tranSMART.
Collaborator Contribution UK Biobank provided me access to the cardiovascular data collection in order to develop an i2b2/tranSMART data model.
Impact I am still on the process to create data models for the storage of cardiovascular datasets within tranSMART.
Start Year 2018