A computational cloud framework for the study of gene families

Lead Research Organisation: Earlham Institute
Department Name: Research Faculty

Abstract

Life science research is increasingly turning into data-intensive discipline. New high-throughput sequencing technologies produce vast amounts of digital data that needs to be efficiently analyzed in order to discover interesting patterns to make new biological discoveries. The large volume of data produces a problem of its own, as it needs to be stored and analyzed using large computing resources and sophisticated computing skills. Many biology labs struggle to own and maintain large computing clusters for their computing needs. Cloud computing frameworks have emerged as feasible alternative for availing large computer power on a pay-as-you-go model and are increasingly making inroads into mainstream biological data analysis. iPlant-UK is a cloud initiative funded by BBSRC to make large computing resources available for free of cost to UK researchers. iPlant-UK cloud is specifically tailored to meet the computing requirements of life sciences community and provides access to large computing infrastructures through the comforts of web-browser. Through this proposal, we want to develop a computational toolkit for analysis of gene family datasets. An example of a gene family in plants is R genes, also known as Resistance genes that are responsible for pathogen recognition and disease resistance responses in plants. To understand a gene family in a species, one must first catalogue all members of the family, and then understand their function with respect to other each other, as well as related species. These datasets are generated through next generation sequencing techniques and are usually large in volume. We aim to develop specialized software for analysis of gene family datasets on the iPlant-UK compute cloud. This way, we can provide researchers access a specialized tool on a large and free computing resource. Further, we want to simplify the use of the toolkit by providing graphical user interfaces that can be accessed through web-browsers to enable wet-lab biologists to focus on their core research rather than worry about the complex computation on a cloud platform. The code developed in this project will be available in public domain for free under open-source license. By building the workflow in iplant we will ensure its sustainability and visibality beyond this proposal.

Technical Summary

Genomics is moving from studying single reference genomes to study of multiple genomes from same species, allowing us to uncover the pan-genome. The pan-genome or supra-genome describes a core set of genes common in all strains, and a non-core set that is found in only a sub-set or single strain. Commonly this is seen as an expansion of an existing gene family. Study of these gene families is important for various reasons, as each gene family can be understood in terms of its function and its evolution across strains. As an example, NB-LRR genes or resistance genes in plants are responsible for disease resistance and show expansion and contraction across accessions. Critically, new R-genes have the potential of being important targets for plant breeders. As the field is moving towards the study of pan-genomes using next generation sequencing techniques, there is a timely need for appropriate software for data analysis. Given the large size of datasets, it is preferable that the software runs on community accessible cloud resources. We propose to build an open-source, cloud enabled, software toolkit to analyze gene-family datasets. We propose to use BBSRC funded iPlant-UK compute infrastructure as the cloud platform of choice. iPlant-UK is maintained by a dedicated team of experts and offers large compute resources with easy to use graphical interfaces for bioinformaticians and bench-biologists. Further, we plan to use the software for two case-studies - 1) Extract the pan-NB-LRRome for bread wheat and 2) for study of gene families in Tsetse fly in context of their role as disease vectors.

Planned Impact

The principal beneficiaries from this grant are research scientists in academia and industry engaged in study of gene families in various species. Completeness and functional understanding of gene families has many important applications in academia and in industry, particularly in pharmaceutical and agriculture settings. The availability of a toolkit on a publically accessible cloud will increase its usability by global research community, while tackling many hurdles posed by Big Data analysis. The toolkit includes various components that can be used independent of this proposal by researchers analyzing next generation sequencing datasets. All the components will be hosted on high performance computing environment, making them desirable due to vastly decreased execution time. The proposed toolkit includes analyses pipelines that are fully traceable, resulting in sharing and reproducibility of results; this will benefit collaborators and reviewers. The iPlant cloud used in this proposal is free for researchers and is maintained by a dedicated team, resulting in substantial cost benefits to research institutions. This proposal enables sharing of tools and execution platforms, apart from the standard sharing of data, meeting an important goal of funding bodies.

Publications

10 25 50
 
Description As more genomes are sequenced it is becoming apparent that gene duplication and deletion are important drivers in evolution, with rapidly expanding gene families often a signature of their role in an organism's adaptation to the environment. To identify these events, we are building a gene family analysis toolkit which will be deployed on the Cyverse cloud infrastructure for use by the scientific community. Central to its purpose will be the ability to distinguish evolutionary relationships between genes within gene families for currently and newly sequenced species, both at the intra- and inter-species level.

As a pilot study, we have used a collection of landrace bread wheats (the Watkins collection) sequenced by exome capture to explore the diversity of the large Nucleotide Binding Leucine Rich Repeat (NLR) family of plant intracellular immune receptors in the collection. Illumina read data from each wheat line have been assembled using various tools. The resulting contigs have been aligned to their corresponding subgroups within the NLR family. Looking at subgroups that are expanded relative to other monocot species, we can demonstrate that further novel gene duplication events have occurred in specific lines of the Watkins collection. T
Exploitation Route This diversity in NLR genes maybe be driving resistance to pathogens
Sectors Agriculture, Food and Drink

 
Title Watkins core collection re-sequencing data 
Description Re-sequence data for the Watkins collection 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
Impact International collabration 
URL https://grassroots.tools/data/under_license/toronto/
 
Description Computational biology for Genomics 
Organisation IBM
Department IBM UK Labs Ltd
Country United Kingdom 
Sector Private 
PI Contribution We have had scoping meetings and with work with Ritesh Krishna on the project
Collaborator Contribution Initial sharing of expertise
Impact Paper https://doi.org/10.1101/2021.02.04.429826 Code https://github.com/JoshuaColmer/HallCircadian
Start Year 2017
 
Description Paul Bailey talk at PAG at Cyverse workshop 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact As more genomes are sequenced it is becoming apparent that gene duplication and deletion are important drivers in evolution, with rapidly expanding gene families often a signature of their role in an organism's adaptation to the environment. To identify these events, we are building a gene family analysis toolkit which will be deployed on the Cyverse cloud infrastructure for use by the scientific community. Central to its purpose will be the ability to distinguish evolutionary relationships between genes within gene families for currently and newly sequenced species, both at the intra- and inter-species level.
As a pilot study, we are using a collection of landrace bread wheats (the Watkins collection) which have been sequenced by exome capture to explore the diversity of the large Nucleotide Binding Leucine Rich Repeat (NLR) family of plant resistance genes in the collection. Illumina read data from each wheat line have been assembled using various tools. The resulting contigs have been aligned to their corresponding subgroups within the NLR family. Looking at subgroups that are expanded relative to other monocot species, we can demonstrate that further novel gene duplication events have occurred in specific lines of the Watkins collection. The next step will be to understand whether specific genes in the family are under positive selection and therefore which genes have particular functional significance. The assembly and downstream procedures will be placed into a Docker container for use on Cyverse as a tool for exploring the diversity of any gene family in any species with sequence data.
Year(s) Of Engagement Activity 2018
URL https://pag.confex.com/pag/xxvi/meetingapp.cgi/Paper/29841
 
Description Talk at the Covers workshop at the Plant and Animal genome conference 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Presented work on using Cyverse
Year(s) Of Engagement Activity 2018