A translational data integration platform for the stratification of patients based on clinical, laboratory and magnetic resonances imaging

Lead Research Organisation: Queen Mary University of London

Department Name: William Harvey Research Institute

Abstract

Translational biomedicine studies depend on the integration of multiple datasets that, together, represent the complex plethora of features from patients transiting between health and disease states. The UK has several initiatives which aim to investigate disease onset and progression on a longitudinal basis which are particularly suited for research. The UK Biobank (UKB) has a clinical data collection comprised of more than 500,000 healthy individuals, with aims to collect 100,000 magnetic resonance image scans of various body parts such as brain, heart and abdomen, as well as information about the bone tissue structure and ultrasound of the carotid arteries from participants. This imaging data is being integrated with genetic data and detailed clinical information derived from detailed subject assessments and linked electronic health records. By comparison to the mainly healthy UKB cohort, the Barts Heart Centre has recruited over 14,000 patients since 2014 to create the Barts BioResource (BBR), which aims to create a rich information resource for cardiovascular research, linking omics, imaging and EHR. In order to speed-up translational research using these unprecedented datasets, it is of utmost importance to guarantee the information about the origin of these datasets, the precise methods that they were collected and integrate them in a major unified database system. The UKB and BBR cohorts collectively represent the full spectrum between health and cardiovascular disease.
In parallel, the European Commission (EC) together with the European Association of Pharmaceutical Industries and Associations (EFPIA) funded the eTRIKS project (2012-2018) to deploy a sustainable open-source data and knowledge management platform to support translational research: tranSMART. This system supports a wide variety of data and has been successfully applied to various projects within (e.g. U-BIOPRED, MRC Stratified Medicine projects (PSORT, MATURA, RA-MAP, IMID-BIO, CLUSTER and MASTERPLANS)) and beyond the UK (e.g. AETIONOMY).The new capabilities of tranSMART allow the integration of study metadata; various categorical and numerical data (e.g. red-blood cells counts) along with OMICS data (e.g. gene expression, genomic copy number variation and small nucleotide polymorphisms, peptides & metabolite profiling). The tool tranSMART allows programmatic data access for the generation of computational workflows using a large variety of software.
From the collaboration between the projects eTRIKS and AETIONOMY, a new software concept called BrainMesh raised and prized the best-poster award from the tranSMART Foundation Annual Meeting (2016) at the University of California (San Diego - US); featuring as promising future technology around the tranSMART environment. Together with the new visual analytical features of tranSMART, via the newly developed software component SmartR, BrainMesh adds a completely new dynamic visual analytics concept to tranSMART, such as allowing the visual analysis of clinical and image-derived data in a integrated fashion.
In this proposal, we aim to include the complete UKB and BBR cardiovascular MRI cohorts into dedicated (distinct) tranSMART environments where multiple analytical workflows could be executed in order to stratify patients that share common health data features, paving the way for data mining and discovery in these cohorts and in future projects that desire to use the platform.

Technical Summary

The UKB and BBR cohorts collectively represent the full spectrum between health and cardiovascular disease. Both cohorts will be analysed using parallel tranSMART data warehouse infrastructures, enabling a comparison between healthy and diseased subjects and integration of high level findings from both cohorts. Thus, we will establish a foundation for translational cardiovascular research in the UK with a detailed data provenance schema and common analytical pipeline.
To achieve this, Unified Medical Language System (UMLS) coding standards will be used to standardize EHR data into official nomenclature. In order to analyze and potentially integrate multiple datasets between tranSMART instances, extensive data curation will be necessary, prior experience from the eTRIKS and AETIONOMY IMI projects will mitigate the risks associated with this process. The data will be made available to other applications, via a flexible tranSMART API, including a data constructor feeding a machine learning software layer called Ada. Using Ada's powerful machine-learning algorithms we will stratify patients and by adapting the BrainMesh package for heart data, we will investigate the MR images collected by UKB and BBR.
Docker instances will help to create reproducible and portable data warehouse instrances, while Git versioning will provide clarity in versioning. Software and datasets will be made available in public repositories where appropriate or via Zenodo and referenced by a top-level unique Digital Object Identifiers. Application of FAIR (Findable, Accessible, Interoperable and Reproducible) principles, within the broader scope of each resource access conditions will guarantee the sustainable long-term use of these tools and datasets for future researchers. Allowing us to create a critical mass of highly skilled scientists dedicated to health data research.

Funded Value:

£299,527

Funded Period:

Mar 18 - Mar 21

Funder:

MRC

Project Status:

Closed

Project Category:

Fellowship

Project Reference:

MR/S003827/1

Principal Investigator:

Adriano Barbosa Da Silva

Health Category:

Unclassified

Organisations

People	ORCID iD
Adriano Barbosa Da Silva (Principal Investigator / Fellow)	http://orcid.org/0000-0002-5260-2607

Publications

Author Name

Title Publication Date Published

10 25 50

Barbosa-Silva A (2019) Presenting and sharing clinical data using the eTRIKS Standards Master Tree for tranSMART in Bioinformatics

Gu W (2019) Data and knowledge management in translational research: implementation of the eTRIKS platform for the IMI OncoTrack consortium. in BMC bioinformatics

Further Funding
Research Databases and Models
Collaboration


Description	CORBEL - Coordinated Research Infrastructures
Amount	€ 5,000 (EUR)
Funding ID	PID 5815
Organisation	European Commission
Sector	Public
Country	Belgium
Start	11/2018
End	07/2019


Title	Python machine learning module
Description	We created a preliminary jupyter-notebook, encoded in python, that uses machine leaning methods from the library scikit-learn in order to give an early diagnosis of Alzheimer's disease. We intend to use this module to analyse the UKBB datasets that we just got access to.
Type Of Material	Computer model/algorithm
Year Produced	2018
Provided To Others?	No
Impact	No impact yet, but when we final version of the notebook gets published (mid 2019), we expect that other users will be able to run their own predictions by using our method. Users that collected the same type of data (variables) as those used by us, will be able to readily use our method. However, user using different variables to characterise their patients will have to adapt the values during the data preparation step (see outcome: R data manipulation notebook). We intend to use this module to analyse the UKBB datasets that we just got access to.


Title	R data manipulation notebook
Description	We created a R/Markdown notebook that allows users to import ADNI (http://adni.loni.usc.edu/) datasets into a routine that prepares the dataset to be used as input of a Machine Learning notebook, and subsequently imports the results from the ML classification in order to be analysed using Data Science libraries such as Tydiverse.
Type Of Material	Data handling & control
Year Produced	2018
Provided To Others?	No
Impact	The impact is that together with the python notebook described in the other outcome, users will be able to effortless prepare ADNI datasets, or other clinical research datasets, for machine learning classification tasks. We intend to use this module to analyse the UKBB datasets that we just got access to.


Title	UKBB tranSMART data model
Description	We developed a preliminary data model to store UK Biobank (UKBB) datasets within the i2b2/tranSMART system. Using this model, users are enabled to convert UKBB flat files into tranSMART standard format files in order to map to the tree topology that we defined for this dataset.
Type Of Material	Database/Collection of data
Year Produced	2019
Provided To Others?	No
Impact	No impact yet. But when this data model become finally ready for sharing outside our Application Access, we expect that all the users interested into accessing UK Biobank datasets will have the power to readily map the files to our data model and deploy it on tranSMART in a effortless manner.


Description	ELIXIR
Organisation	ELIXIR
Country	United Kingdom
Sector	Charity/Non Profit
PI Contribution	We included ELIXIR as a Research Infrastructure provider on our CORBEL grant.
Collaborator Contribution	ELIXIR collaborate with us via the associated CORBEL grant. ELIXIR will support the development of our data management plan as well as provide services and tools for the integration of datasets based on a tranSMART data warehouse.
Impact	No output available yet, once that the project is still ongoing.
Start Year	2018


Description	European Clinical Research Infrastructure Network (ECRIN)
Organisation	European Clinical Research Infrastructure Network
Country	France
Sector	Charity/Non Profit
PI Contribution	We included ECRIN as a Research Infrastructure provider on our CORBEL grant.
Collaborator Contribution	ECRIN collaborate with us via the associated CORBEL grant. ECRIN sets up the legal framework where European cardiovascular datasets can be reused for our project.
Impact	No output available yet, once that the project is still ongoing.
Start Year	2018


Description	European infrastructure for translational medicine (EATRIS)
Organisation	European infrastructure for translational medicine
Country	Netherlands
Sector	Academic/University
PI Contribution	We included EATRIS as a Research Infrastructure provider on our CORBEL grant.
Collaborator Contribution	EATRIS collaborate with us via the associated CORBEL grant. EATRIS will provide support into setting a dedicated translational bioinformatics platform (tranSMART) for the storage of imaging data linked to corresponding raw files sitting on XNAT.
Impact	No output available yet, once that the project is still ongoing.
Start Year	2018


Description	European research infrastructure for biobanking and biomolecular resources (BBMRI-ERIC)
Organisation	Biobanking and Biomolecular Resources Research Infrastructure
Country	Austria
Sector	Charity/Non Profit
PI Contribution	We included BBMRI-ERIC as a Research Infrastructure provider on our CORBEL grant.
Collaborator Contribution	BBMRI-ERIC collaborate with us via the associated CORBEL grant. BBMRI-ERIC will search across different biobanks and data collections for datasets containing the cardiovascular data (EHR, OMICS, CT/MRI images);
Impact	BBMRI provided us with a list of biobanks containing relevant datasets for our CORBEL grant.
Start Year	2018


Description	Fiocruz: INOVA-COVID-19
Organisation	Oswaldo Cruz Foundation (Fiocruz)
Country	Brazil
Sector	Public
PI Contribution	FIOCRUZ-INOVA: Molecular basis of severe COVID-19 associated comorbidities, a systems biology approach. Coordinator: Dr. Fabricio Alves (PROCC-IOC-FIOCRUZ) This project aims to understand the molecular basis of the comorbidities associated with severe COVID-19 using systems biology. The project will access public and private RNASeq and protein-protein interaction data and, by using network analysis and graph theory algorithms, will define what are the biological networks altered during severe infection considering different comorbidities. Each network will have their main components (genes/proteins) flagged as potential druggability targets. Finally, a list of drugs compatible with such targets will be proposed aiming to repurpose the for severe COVID-19. QMUL: We will provide QMUL hyper-computing facilities (Apocrita) for the storing of OMICs data and for the computing of network analysis to predict targets.
Collaborator Contribution	Collaboration to obtain transcriptomics/proteomics data associated with COVID-19 publicly available in literature and biological databases. Application of machine learning algorithms for tissue and comorbidity clustering.
Impact	Github repository containing software for analysis: https://github.com/adrianobioinfo/inova
Start Year	2020


Description	UFPE-CAPES/COVID-19: PlatMAMP
Organisation	Federal University of Pernambuco
Country	Brazil
Sector	Academic/University
PI Contribution	This project aims to evaluate previously selected peptides anti-SARS-CoV-2 derived from native brazilian plants. This included peptides reported to be effective to inhibit the growth of HIV, Influenza A e Dengue viruses. The group has an in-house pipeline developed for the rational design of such peptides in order to increase affinity to microbial targets. The group has achieved promising results in similar approaches against multi-drug resistant bacteria. The project aims to evaluate over 200 peptides (in silico) and generate 20 modified ones to be tested in vitro and in vivo against SARS-Cov-2. This includes protein structure analysis, mutagenicity, cytotoxicity and teratogenesis tests, as well as NB3-controlled models for sepsis and SARS-Cov-2 infected mice immunity tests.
Collaborator Contribution	QMUL: We will provide QMUL hyper-computing facilities (Apocrita) and also analytical pipelines for protein docking simulations of the modified peptides against their putative targets.
Impact	Collaboration is still ongoing.
Start Year	2020


Description	UKBB
Organisation	UK Biobank
Country	United Kingdom
Sector	Charity/Non Profit
PI Contribution	I am developing a data model based on i2b2/tranSMART to represent UKBB data within the software tranSMART.
Collaborator Contribution	UK Biobank provided me access to the cardiovascular data collection in order to develop an i2b2/tranSMART data model.
Impact	I am still on the process to create data models for the storage of cardiovascular datasets within tranSMART.
Start Year	2018

Abstract

Technical Summary

Organisations

People

ORCID iD

Publications