BBSRC-NSF/BIO: Integrative analysis and Visualisation of Fly Cell Atlas datasets to enable cross-species comparisons

Lead Research Organisation: European Bioinformatics Institute
Department Name: OMICs

Abstract

The fruit fly, Drosophila melanogaster, has for the last century been fundamental to the study of genetics. It is used in many areas of research as the model organism of choice, as it provides the ability to study genetics in the laboratory and apply findings to human genetics. Its use as a model is due to two factors: First, its genetic code can be relatively easily manipulated in the laboratory and this coupled with a short life cycle, provides a means by which a gene or pathway function can be rapidly studied. Secondly, the vast majority of the fundamental biochemical mechanisms and pathways are conserved between fly and humans. Indeed, 75% of the genes that cause human disease are found in fly and, thus, the data collected in the fly can be used to provide insights into the same processes within humans.

The emergence of a new technology, single cell RNA sequencing (scRNA-seq), has provided information as to which genes are switched on or most active from a single cell. Within the fly community this provides the ability to quickly map clusters of cells and cell types to the whole anatomy and link this to both phenotype and function. The increasing number of scRNA-seq datasets from different species has resulted in the development of the Single Cell Expression Atlas (scEA). This is a web portal which enables users to more easily visualise and interpret this data. It is anticipated that the level of fly single cell data will increase from 10 datasets to ~100 in 2020 and further two-fold increase in 2021. Key to the scientific exploitation of this data will be the ability of users to not only effectively analyse the fly data but also to examine the interconnections between fly data and human or mouse datasets.

In this project we will provide the means by which fly datasets can be easily interpreted and also linked to mouse and human datasets via scEA. The scEA currently hosts scRNA-seq data for over 500K assays and this includes data for the Human Cell Atlas (HCA) and Mouse Cell Atlas (MCA), amongst others. This project will enable analysis pipelines to be developed to combine the available and emerging datasets, alongside the necessary computational infrastructure to host the Fly Cell Atlas (FCA) datasets. ScEA will provide users with an easy to navigate web service with exploratory querying capability, in addition to data download capabilities for further data analysis. The service will be fully integrated with the established fly resources, Flybase, Virtual Fly Brain and the Drosophila Resources at Harvard University. This project will also develop a process for annotation of the datasets. This annotation step adds additional scientific information to the data which provides the user with a greater level of biological understanding and so aids the interpretation and analysis. This annotation will expand on the existing FlyBase anatomy ontology which is a structure of controlled vocabularies used to describe the anatomy of the fly this will ensure that there is full compatibility across new and existing resources.

The scEA will develop and provide the means by which the data can be easily visualised and mined for cell types, while also providing the fly community with the ability to contribute their scientific expertise to the annotation. The scEA user interface will be further developed to provide a greater level of cross species query ability as the resulting FCA will be linked within scEA to the HCA, MCA and any further datasets enabling cross species comparisons which will aid in the discovery of novel biological insights.

This project aims to provide the fly community with practical solutions for connecting, re-using and reanalysing datasets and so will close the gap in translating biological discoveries in model organisms, such as the fruit fly, to humans and vice versa. This project will make the results of this comparative analysis rapidly available to the growing user community.

Technical Summary

This proposal is comprised of three main aims: the first, will develop the computational analysis pipelines for scRNA-seq data in Drosophila melanogaster, including batch correction, cell clustering, marker gene detection, trajectory and differential analysis, in addition to cell type annotation. This will create standardised workflows which can be run across the different Fly Cell Atlas (FCA) datasets and the metadata will be curated with using ontology terms and genetic feature identifiers. The annotation stage will encourage and capture curation from the fly community by scientists with expertise in various tissues and cell types.

The second, is to develop the fly-specific functionality of scExpression Atlas allowing easy identification of both the raw and processed data, as well as functionality to visualise cell type expression data in FlyBase and the Drosophila resources at Harvard University. Key to this is the enhancement of FCA data visualisation, gene set enrichment analysis tools will be developed and Anatomograms will be available as embeddable widgets allowing specific experiments to be easily embedded by different websites.

Lastly, comparative analyses will be performed using datasets from FCA, Mouse Cell Atlas and the Human Cell Atlas. Orthologous relationships will be used to map genes from one species to another to generate a combined, integrated dataset. Different methodologies will be explored for dataset comparison and both mappings and ontologies will be extended and improved. The scExpression Atlas user interface will be further developed to enable users to interrogate the data cross species, in addition to analysis by cell type and tissue. In collaboration with the FCA community we will extend the scExpression Atlas APIs, download formats and associated software to allow data re-use and re-analysis and so promote Open Science.

Planned Impact

The fruit fly, Drosophila melanogaster, has for the last century been fundamental to the study of genetics. It is used in many areas of research as the model organism of choice, as it provides the ability to study genetics in the laboratory and apply findings to human genetics. The vast majority of the fundamental biochemical mechanisms and pathways are conserved between fly and humans. Indeed, 75% of the genes that cause human disease are found in fly and, thus, fly data provide insights into the same processes within humans.
The emergence of a new technology, single cell RNA sequencing (scRNA-seq), has provided information as to which genes are switched on or most active within a single cell. This data are generating fundamental new insights into how cells differentiate into specific cell types, and what a cell type represents at the molecular level. The increasing number of scRNA-seq datasets from different species encouraged us to develop the Single Cell Expression Atlas (scEA). This is a web portal which enables users to more easily access and interpret this data. It is anticipated that Drosophila single cell data will increase from 10 datasets to ~100 in 2020 and further two-fold increase in 2021. Key to the scientific exploitation of this data will be the effective analysis of the fly data and the ability to explore interconnections between fly data and human and mouse data.

In this project we will provide the means by which fly datasets can be easily interpreted and linked to mouse and human datasets via scEA. This project will enable analysis pipelines to be developed to combine the available and emerging datasets, alongside the necessary computational infrastructure to host the Fly Cell Atlas (FCA) datasets. ScEA will provide users with an easy to navigate web service with exploratory querying capability, in addition to data download capabilities for further data analysis. The service will be fully integrated with the established fly resources, FlyBase and the Drosophila Resources at Harvard University.

Data sets and derived analysis results will be easily accessible in standard formats to be reused by: (1) wet-lab biologists investigating new experimental hypotheses, and comparing published datasets to their own new results; (2) computational biologists engaged in new development of analysis tools or machine learning applications where access to well curated and standardised data sets is essential.

The availability of the combined Fly Cell Atlas through user-friendly interfaces at Harvard and EMBL-EBI will contribute greatly to all projects investigating transcription at the single cell level. By providing molecular signatures of each cell type, the Fly Cell Atlas data will aid the identification of the cell types altered when genes are mutated, including models of human diseases. Mapping cell types across species will permit verification of the similarities in the underlying cellular defects caused by loss of similar gene function in human and fly.
Establishing a robust atlas of cell types in Drosophila will also aid projects aimed at controlling insects that are vectors of disease or agricultural pests, by providing basic knowledge of the cell types that can be used to target novel control strategies. With a rise in pesticide resistance and the negative environmental impact of pesticides, the understanding of Drosophila biology underpins development of new strategies. Functional interpretation of the genomes of disease-carrying insects and crop pests relies heavily on the extensive experimental data from Drosophila.

Methods developed within this project will be applicable to biological and bioinformatics communities beyond researchers working in fly, mouse or human. With the dissemination of analysis tools in containerised form and their availability in public registries, we expect their usage to expand over a wider spectrum of computational biologists.
 
Description Through this award we have released the first version of the fly cell atlas dataset, making it easily available and query-able by the whole research community. This is highly impactful as researchers can now interact with these data in an easy to comprehend way and allow them to inform their research hypotheses.
Secondly, we have established an ongoing data flow from Single Cell Expression Atlas into FlyBase, so that transcriptomic data can reach more users from the research community.
Finally, we have developed a pipeline to compare strategies for integrating data across species. The software is freely available on github and there is also a preprint describing the work.
Exploitation Route This project is still on-going, but the Fly Cell Atlas data are being interrogated by the research community through both the Single Cell Expression Atlas and FlyBase, to inform new hypotheses on fly biology. Through cross-species comparisons, the fly cell atlas data can be compared to other species, tissue by tissue to cell type by cell type and help uncover important differences and contribute to the investigations of cell type evolution.
Sectors Environment,Healthcare,Pharmaceuticals and Medical Biotechnology

 
Title Single Cell Expression Atlas - Fly Cell Atlas first release 
Description The latest Single Cell Expression Atlas (SCEA) release included the first data from the Fly Cell Atlas (dataset accessions E-MTAB-10628 and E-MTAB-10519) brings together resources including gene expression data to make a comprehensive cell level map of the fly, an important model organism for researchers. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
Impact This release enable researcher to access free of charge single cell Fly Cell Atlas Data, reanalysed to enable greater levels of cross experiment comparision. This is a avlue added resource an so the data is more easily usable to any researcher and provided links out to other databases. 
URL https://www.ebi.ac.uk/gxa/sc/experiments/E-MTAB-10519/results/tsne
 
Title BENGAL: BENchmarking inteGration strAtegies for cross-species anaLysis of single-cell transcriptomics data 
Description A Nextflow DSL2 pipeline to perform cross-species single-cell RNA-seq data integration and assessment of integration results. 
Type Of Technology Software 
Year Produced 2022 
Open Source License? Yes  
Impact Guidelines for choosing the best cross-species integration stragegy: Between evolutionarily distant species: Include one-to-many and many-to-many in-paralogs if possible Choose stronger algorithms, such as SeuratV4 methods Between evolutionarily close species: scVI or harmony balances well species mixing and biology conservation SeuratV4 methods might overfit For whole-body atlas and species without high-quality homology annotation: SAMap outperform almost all other methods For analysis between adult mammals and non-mammals: Data integration no longer generates candid results suitable for joint downstream analysis Independent analysis, such as expression correlation might be more informative 
URL https://github.com/Functional-Genomics/CrossSpeciesIntegration
 
Description Delicious DNA 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Schools
Results and Impact Yr3 school event talking about DNA and science.
Year(s) Of Engagement Activity 2021
 
Description Greek Teacher Training 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Schools
Results and Impact Provision of a workship for training fo greek secondary school teachers on DNA, genomics, Expression Atlas and Fly Cell Atlas.
Year(s) Of Engagement Activity 2021