Genome Annotation for the Masses

Lead Research Organisation: Queen Mary University of London
Department Name: Sch of Biological and Chemical Sciences

Abstract

The hereditary information carried by each living thing is its genome. Stored in the form of the DNA sequences of As, Cs, Gs, and Ts, between 1 and 5% of the genome sequence consists in genes. These genes contain instruction sets for small protein machines that accomplish specific tasks and ultimately determine the organism's shape, size, behavior, lifespan and disease susceptibility.

Determining the genome sequence of an organism is now straightforward. But understanding which genes are responsible for the unique characteristics of the organism remains challenging. This is due in particular to the difficulty of correctly finding the genes in the genome and determining which parts of their sequence encode proteins. Indeed, automatic gene identification software performs poorly, thus evidence for each potential gene model needs to be visually inspected and corrected. Thus preparing the data for even a small research project can take months.

Luckily there is a solution. Thousands of members of the general public have used the internet to contribute their time to help scientific projects such as GalaxyZoo and FoldIt, be it out of curiosity, desire to help the greater good, gain peer recognition or simply to have fun. Results of their contributions include the identification of previously unknown galaxy types and determination of the 3D structures of AIDS proteins.

The proposed project uses a similar approach to encourage members of the general public to help identify genes in the genome and refine their borders. We are constructing a game in which contributors use pattern recognition skills to improve gene models. Contributors will be able to choose to focus their efforts on particular species (e.g.: ants, humans, elephants) or research topics (e.g.: cancer, immunity, longevity, taste or odor perception, behavior). They will earn points and thus peer recognition for their contribtutions, and may be acknowledged in scientific publications or even financially compensated.

This project will thus allow members of the general public to have fun while helping to make the world a better place and facilitate scientific discovery.

Technical Summary

Genomes of emerging model organisms are now be sequenced at almost no cost. The major bottleneck has become obtaining accurate gene models because automated gene prediction programs incorrectly predict start sites, intron-exon boundaries and may even miss or merge whole genes even if large amounts or RNA sequence are available. Fixing and refining gene models is thus required before rigorous analyses can be performed. However, refining a single gene model can take up to several hours and thus remains difficult to justify beyond exceptional cases.

Tasks from other research areas that require human brainpower but are similarly repetitive have been successfully crowd-sourced to members of the general public. GalaxyZoo volunteers have categorized millions of photos of galaxies and thus triggered the characterization of multiple previously unknown galaxy types and other stellar objects. Similarly, players of the FoldIt game earn points by minimizing the free energy of putative protein structures and in some cases perform better than specialized structure prediction algorithms or even expert protein modelers.

Contributors to such projects may be motivated by the intellectual challenge, the desire to learn new skills, to contribute to the greater good, to compete or earn recognition among peers, or in some cases even to earn small amounts of financial compensation. The project proposed here takes inspiration from such crowd-sourcing initiatives. We aim to create an online game to crowd-sources gene model refinement. In doing this our game will provide a key service to biologists by rapidly generating high-quality gene annotations at little or no cost.

Planned Impact

Members of the general public who use our software will learn new biological knowledge and skills. This capacity building will occur thanks to use of educational material we put on the website and to the thought processes required for refining gene models. Additionally, the visibility this project obtains among contributors and the general public will increase public engagement with biological research.

Visibility will be obtained:
* initially through a small online advertising campaign and use of our tool in coursework, and subsequently by strongly encouraging users to advertise their participation to peers on social networks such as Facebook,
* through our international, interdisciplinary team of collaborators,
* thanks to the public relations office at Queen Mary University of London, the Swiss Institute of Bioinformatics (SwissProt) and other collaborating institutes.

Our project will also:
* contribute toward changing organization culture and practices by showing that crowdsourcing practices work,
* accelerate discoveries in fundamental bioscience including those relating to food security and improving human quality of life and health,
* improve the effectiveness of researchers thus indirectly improving society.

Publications

10 25 50
 
Description Having accurate gene predictions isessential for much modern biological research. Unfortunatlyonly possible after visual inspection and manual fixing(curation). This makes projects requiring high quality predictions for thousands of genes impossible beyond work on humans and fruit flies.
We built a tool aims to bring gene feature visualisation and improvement to a larger group of people. With this "crowd-sourcing" approach, we obtain improved gene predictions (which thus improves analyses that depend on them) and 2. educate contributors (currently university-level students).

The software is currently able to offer a complete crowd-sourcing approach for contributors who already have some basic biological knowledge. We hope to expand it at at least three levels:
1. so that it is used in other institutions.
2. to better deal with complex gene predictions (when contributors provide conflicting information)
3. to reduce the learning curve difficulty by improving tutorials for non-biologists.

Additionally, we have created and published a tool that helps visualise problems with gene predictions (genevalidator).
Exploitation Route We have begun to collaborate with others who want to build upon our approach to 1. improve teaching curricula and 2. improve gene prediction quality 3. add more biorelevance to crowd-sourcing initiatives.
We have received a small grant (Drapers' Fund for Innovation in Learning and Teaching; 5000 GBP) to push key features of this further.
Sectors Digital/Communication/Information Technologies (including Software),Education,Other

URL http://afra.sbcs.qmul.ac.uk
 
Description Thanks to the 10,000-fold drop in DNA sequencing costs since 2007, it is far easier to obtain a genome sequencing than before. Obtaining high quality gene predictions remains complex as individual gene predictions need to be verified and often improved by humans. We have developed a basic software to "crowd-source" gene prediction verification and improvement. We have already used it as part of educating undergraduate and masters-level students to teach them 1. about gene structure 2. the tradeoffs in automated analysis 3. comparative genomics. This is being used in multiple institutions world-wide. While the students learn they are contributing to research, in particular having contributed improved gene models for fire ant genomes. We are reaching out to other communities that will be able to take advantage of this tool. Furthermore, our project is open source and the computer code has already been used in several additional projects.
First Year Of Impact 2013
Sector Digital/Communication/Information Technologies (including Software),Education,Other
Impact Types Cultural,Societal

 
Description Nescent working group - curriculum development
Geographic Reach Multiple continents/international 
Policy Influence Type Influenced training of practitioners or researchers
 
Description Software Development best practices in bioinformatics
Geographic Reach Multiple continents/international 
Policy Influence Type Influenced training of practitioners or researchers
Impact We have actively advocated for the respect of best practices for software development in scientific research. This is inline with partner efforts at http://software.ac.uk. This has broad impacts throughout the sciences in which software are used (i.e. almost all of them!), and in particular in genomics/bioinformatics where such approaches remain undervalues (and the potential risk of not pursuing best practices is not yet widely known).
 
Description BBSRC NPIF Case Studentship
Amount £107,034 (GBP)
Funding ID BB/S507556/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 12/2018 
End 11/2022
 
Description Google Summer of Code
Amount $5,500 (USD)
Funding ID Monica Dragan 
Organisation Google 
Department Google Summer of Code
Sector Charity/Non Profit
Country United States
Start 05/2013 
End 10/2013
 
Description Google Summer of Code - Hiten
Amount $5,000 (USD)
Organisation Google 
Department Google Summer of Code
Sector Charity/Non Profit
Country United States
Start 06/2016 
End 09/2016
 
Description Google Summer of Code - Julian Mazzitelli
Amount $5,000 (USD)
Organisation Google 
Department Google Summer of Code
Sector Charity/Non Profit
Country United States
Start 06/2016 
End 09/2016
 
Description Marie Sklodowska Curie Incoming Fellowship H2020-MSCA-IF-2018
Amount € 224,933 (EUR)
Organisation European Commission H2020 
Sector Public
Country Belgium
Start 10/2019 
End 10/2021
 
Description Marie Sklodowska Curie Incoming Fellowship H2020-MSCA-IF-2018 (another)
Amount € 212,933 (EUR)
Funding ID EvolvAnt 
Organisation European Commission H2020 
Sector Public
Country Belgium
Start 03/2020 
End 02/2022
 
Description Marie curie
Amount € 221,606 (EUR)
Funding ID 623713 
Organisation European Commission 
Department Seventh Framework Programme (FP7)
Sector Public
Country European Union (EU)
Start 02/2015 
End 02/2017
 
Description NE/P012574/1
Amount £648,559 (GBP)
Funding ID NE/P012574/1 
Organisation Natural Environment Research Council 
Sector Public
Country United Kingdom
Start 06/2017 
End 04/2020
 
Description NERC big capital
Amount £500,000 (GBP)
Organisation Natural Environment Research Council 
Sector Public
Country United Kingdom
Start 09/2013 
End 03/2015
 
Description Nescent working group
Amount $50,000 (USD)
Organisation National Science Foundation (NSF) 
Department National Evolutionary Synthesis Center
Sector Academic/University
Country United States
Start 01/2013 
End 11/2015
 
Description QMUL - Drapers' Fund for Innovation in Learning and Teaching
Amount £5,000 (GBP)
Organisation Queen Mary University of London 
Sector Academic/University
Country United Kingdom
Start 01/2017 
End 07/2017
 
Description Software sustainability Fellowship
Amount £3,000 (GBP)
Organisation University of Edinburgh 
Department UK Software Sustainability Institute
Sector Academic/University
Country United Kingdom
Start 01/2013 
End 03/2015
 
Title Afra: Gene curation crowdsourcing platform 
Description See software: afra 
Type Of Material Technology assay or reagent 
Year Produced 2014 
Provided To Others? Yes  
Impact See software: afra 
URL Http://afra.sbcs.qmul.ac.uk
 
Title Bionode 
Description See software: Bionode 
Type Of Material Technology assay or reagent 
Year Produced 2014 
Provided To Others? Yes  
Impact See software: Bionode 
URL Http://www.bionode.io
 
Title Flo 
Description Software: flo to transfer gene predictions from one genome assembly to another genome assembly (from same species) 
Type Of Material Technology assay or reagent 
Year Produced 2016 
Provided To Others? Yes  
Impact Makes it easier to use new (higher quality) genome assemblies 
URL https://github.com/wurmlab/flo
 
Title Genevalidator 
Description See software: Genevalidator 
Type Of Material Technology assay or reagent 
Year Produced 2014 
Provided To Others? Yes  
Impact SEe software genevalidator 
URL http://genevalidator.sbcs.qmul.ac.uk
 
Description Bioinformatics for the classroom - Raspberry Pi 
Organisation University of St Andrews
Department School of Biology
Country United Kingdom 
Sector Academic/University 
PI Contribution Joint project development - not yet funded.
Collaborator Contribution Joint project development - not yet funded.
Impact Grant development to bring bioinformatics skills to high-school level students- not yet funded
Start Year 2013
 
Description Chris Dessimoz 
Organisation Swiss Institute of Bioinformatics (SIB)
Country Switzerland 
Sector Charity/Non Profit 
PI Contribution I initiated collaboration to obtain new phd funds
Collaborator Contribution Engaged constructively. Has extensive expertise needed for joint project
Impact Recently funded BBSRC NPIF grant
Start Year 2015
 
Description Collab Marc Robinson Rechavi 
Organisation Swiss Institute of Bioinformatics (SIB)
Country Switzerland 
Sector Charity/Non Profit 
PI Contribution New collaboration - we obtained samples, dissected, extracted RNA. We are leading bioinformatic analysis
Collaborator Contribution Partner contribtued funds for field sampling (which we did), and for gene expression sequencing. They are helping with bioinformatic analysis
Impact not yet
Start Year 2016
 
Description Fellowship at Alan Turing Institue for data science and artificial intelligence 
Organisation Alan Turing Institute
Country United Kingdom 
Sector Academic/University 
PI Contribution I am a fellow - interacting with data-centric peers from other fields
Collaborator Contribution Expertise of others in data science techniques - carrying over expertise into our research. -> synergistic grant application and project ideas.
Impact Collaborative BBSRC grant submission
Start Year 2018
 
Description NERC EOS Cloud 
Organisation Cardiff University
Country United Kingdom 
Sector Academic/University 
PI Contribution Joint grant proposal (NERC) put together and awarded (500,000) - including 85k for our side. Partners contributed equally in terms of vision development - more of the leadership in terms of implementation came from our partners. We are developing a sub aspect of the project.
Collaborator Contribution Joint grant proposal (NERC) put together and awarded (500,000) - including 85k for our side. Partners contributed equally in terms of vision development - more of the leadership in terms of implementation came from our partners. We are developing a sub aspect of the project.
Impact In progress. But we are in talks with related projects funded by other RCUK members.
Start Year 2014
 
Description NERC EOS Cloud 
Organisation UK Centre for Ecology & Hydrology
Country United Kingdom 
Sector Public 
PI Contribution Joint grant proposal (NERC) put together and awarded (500,000) - including 85k for our side. Partners contributed equally in terms of vision development - more of the leadership in terms of implementation came from our partners. We are developing a sub aspect of the project.
Collaborator Contribution Joint grant proposal (NERC) put together and awarded (500,000) - including 85k for our side. Partners contributed equally in terms of vision development - more of the leadership in terms of implementation came from our partners. We are developing a sub aspect of the project.
Impact In progress. But we are in talks with related projects funded by other RCUK members.
Start Year 2014
 
Description NERC EOS Cloud 
Organisation University of Oxford
Department Oxford E-Research Centre
Country United Kingdom 
Sector Academic/University 
PI Contribution Joint grant proposal (NERC) put together and awarded (500,000) - including 85k for our side. Partners contributed equally in terms of vision development - more of the leadership in terms of implementation came from our partners. We are developing a sub aspect of the project.
Collaborator Contribution Joint grant proposal (NERC) put together and awarded (500,000) - including 85k for our side. Partners contributed equally in terms of vision development - more of the leadership in terms of implementation came from our partners. We are developing a sub aspect of the project.
Impact In progress. But we are in talks with related projects funded by other RCUK members.
Start Year 2014
 
Description Nescent: Building non-model species genome curation communities 
Organisation Commonwealth Scientific and Industrial Research Organisation
Country Australia 
Sector Public 
PI Contribution We have contributed software, ideas, meeting time (3 one week workshops), writing time.
Collaborator Contribution The workshops were oriented around curriculum development & software need identification for genomics on emerging model organisms. The partners contributed code, writing, documentation & ideas. This was extremely productive and helped our project in terms of feedback, in terms of in-kind contributions (code), in terms of effort optimisation (the partners include developers of Apollo on which our work is based), and in terms of visibility/impact of our developed tool. Additionally, the meetings resulted in our writing a review article together & put in a grant application together (BBSRC+NSF joint bid).
Impact * joint review publication * source code * joint grant application
Start Year 2013
 
Description Nescent: Building non-model species genome curation communities 
Organisation Lawrence Berkeley National Laboratory
Country United States 
Sector Public 
PI Contribution We have contributed software, ideas, meeting time (3 one week workshops), writing time.
Collaborator Contribution The workshops were oriented around curriculum development & software need identification for genomics on emerging model organisms. The partners contributed code, writing, documentation & ideas. This was extremely productive and helped our project in terms of feedback, in terms of in-kind contributions (code), in terms of effort optimisation (the partners include developers of Apollo on which our work is based), and in terms of visibility/impact of our developed tool. Additionally, the meetings resulted in our writing a review article together & put in a grant application together (BBSRC+NSF joint bid).
Impact * joint review publication * source code * joint grant application
Start Year 2013
 
Description Nescent: Building non-model species genome curation communities 
Organisation National Science Foundation (NSF)
Department National Evolutionary Synthesis Center
Country United States 
Sector Academic/University 
PI Contribution We have contributed software, ideas, meeting time (3 one week workshops), writing time.
Collaborator Contribution The workshops were oriented around curriculum development & software need identification for genomics on emerging model organisms. The partners contributed code, writing, documentation & ideas. This was extremely productive and helped our project in terms of feedback, in terms of in-kind contributions (code), in terms of effort optimisation (the partners include developers of Apollo on which our work is based), and in terms of visibility/impact of our developed tool. Additionally, the meetings resulted in our writing a review article together & put in a grant application together (BBSRC+NSF joint bid).
Impact * joint review publication * source code * joint grant application
Start Year 2013
 
Title Afra: Crowdsourcing genome annotation 
Description As described elsewhere on researchfish - this tool aims to bring gene feature visualisation and improvement to a larger group of people... with two aims: 1. to improve gene predictions (and analyses that depend on them) and 2. to help educate contributors. The software is currently able to offer a complete approach for contributors who already have some basic biological knowledge. 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact * deployed to students (improvements in learning experience) * new collaborations created (dundee, NESCent (US and Australia), TGAC) * creating better gene curations for ants 
URL http://afra.sbcs.qmul.ac.uk
 
Title Bionode 
Description Major challenges when doing bioinformatics work include eliminating redundancy and having to juggle heterogeneous technologies. To facilitate our work (specific aims of funded project) while creating an environment with broader impact, we started Bionode. Bionode provides pipeable UNIX command line tools and JavaScript APIs for bioinformatic analysis workflows. This means that a library written once is available in the command line, on client side (web app), on a high performance compute cluster. Furthermore, this software library is built using the Node.js technology, allowing it to take advantage of large amounts of work by people in internet startups. 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact This project has attracted users and contributors from around the world (others are improving what we set up), while facilitating development & improving maintainability and robustness of the main funded project. 
URL http://www.bionode.io
 
Title GeneValidator 
Description Genomes of emerging model organisms are now being sequenced at very low cost. However, obtaining accurate gene predictions remains challenging. Even the best gene prediction algorithms make substantial errors, leading to further erroneous analysis. Therefore, many predicted genes need to be visually inspected and manually curated, a time consuming process. Here we propose GeneValidator, a tool to identify problematic gene predictions and to guide curation efforts. For each newly predicted protein-coding gene, GeneValidator finds similar sequences in databases of known genes and performs general gene-characteristic comparisons. The resulting report highlights differences between each putative protein-coding gene and similar genes from the database. This allows rapid identification of curation need and guides curators in performing their work. We thus expect GeneValidator to greatly accelerate and enhance the work of biocurators and researchers working with recently sequenced genomes. 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact Publication is in prep. 
URL https://github.com/monicadragan/GeneValidator/
 
Title Sequenceserver 
Description Makes it easier to perform BLAST 
Type Of Technology Software 
Year Produced 2012 
Open Source License? Yes  
Impact (development has continued). 
URL http://www.sequenceserver.com
 
Company Name PRAGMATIC GENOMICS LIMITED 
Description Software as a service company - for our bioinformatics genome analysis software 
Year Established 2021 
Impact Customers in private (biotech, agroindustry), public and third sectors.
Website https://pragmaticgenomics.com
 
Description Passive web recruitment 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Simply by having our website up - even before the software was ready led to ~30 unknown people signing up to contribute curations. We don't know how they found us other than google. And we weren't able to cater to them as well as we would have wanted because our software platform was still too young.
In any case this shows the potential of our crowd-sourcing approach to recruit participants via our online presence.

In any case this shows the potential of our crowd-sourcing approach to recruit participants via our online presence.
Year(s) Of Engagement Activity 2014
URL http://afra.sbcs.qmul.ac.uk