Visual Interactive Pedigree ExploreR (VIPER)

Lead Research Organisation: Edinburgh Napier University
Department Name: Computing

Abstract

This project aims to produce a tool to remove errors in animal pedigree information caused by administrative and data handling faults. Large amounts of animal pedigree and characteristic data are logged and stored during the course of animal breeding studies. However, to be of any use for further programmes or analysis the data needs to be as free of error as possible. Errors in data storage such as recording the wrong father for an animal or unnoticed change in associated gene data are easy to introduce when hundreds or thousands of individual animals are being dealt with. Unfortunately while it is relatively easy to process this data to find the existence of errors, finding and correcting the cause of the errors is more difficult. For example, it isn't straightforward to know if an error is in the pedigree i.e. the child-parent relationships or in the characteristics associated with the animals. An animal may be recorded as having a certain characteristic that on examination may not be possibly inherited from its two recorded parents. So is the recording of one or both of the parents wrong, the recording of the characteristic in the child animal incorrect, or the characteristic in one of the parent animals wrong? To answer this question further examination of the problem animals' relations in the pedigree is necessary. However, in a text or spreadsheet-based document this quickly becomes tedious and confusing even when the operations to detect and show errors in the data are available. However, if we were to switch to a more graphical, user-friendly style of displaying the data then it would be easier to follow relationships in the pedigree. If we added on top the capabilities to interactively show up where errors occurred and where they could possibly be caused from we would have a way of examining the pedigree data and asking questions that would clear up or narrow down errors. Such a way of displaying and interacting with data is called Information Visualisation (IV). Unlike human family trees, most recorded animal pedigrees have a large degree of in-breeding as scientists and breeders try to encourage certain characteristics through selective breeding. This makes the drawing of animal pedigrees more complex as two individuals may end up being related through two or more routes. By extending current IV techniques for this type of data this project will make the interface less complex by interactively showing only selected individuals and their relationships. On top of this the scientists will also wish to view some display of the characteristics associated with the animals and again the complexity can be reduced by viewing only a handful of characteristics at a time. Even so, one male animal can easily sire dozens of children who are in turn related to dozens of female parents and then in turn again may have children of their own - and there may be a several characteristics at a time a scientist is interested in exploring for these animals. Methods for seamlessly moving from showing one part of a pedigree to another will be developed to help scientists explore massive pedigrees. Once an initial interface is built then a means for exploring errors by asking 'what-if' questions will be developed. Possibilities include the ability to 'mask out' problem individuals or problem characteristics to see what effect that has on the pedigree and errors, or to actually edit information and recalculate the effect on the pedigree again. The ability to redo and undo past actions will be needed and in the end the scientist will produce a set of actions that lead to a clean data set, or as close as can be achieved. Throughout the course of the project the work will be tested with scientists who use pedigree data. In the end we will produce a tool that will benefit scientists who work with pedigrees by allowing them to readily clean their data, allowing them to share it usefully with other scientists.

Technical Summary

Pedigree genotype data produced from animal breeding experiments are the basis for the genetic mapping of markers and phenotypes that underpin selective breeding programmes. To be useful for such work the pedigree genotypes need to be free from error, but the size of the datasets means that some pedigree errors, mis-typings and sample mis-identifications are inevitable. Current tools for identifying such errors may show where these problems manifest, but sourcing the cause of the errors is more complex and in current text and table based tools becomes intractable, especially when multiple errors may be at work. To this end, we propose a new Information Visualisation (IV) tool to aid geneticists. IV is the use of graphical and interactive techniques to display and query abstract data sets such as pedigree genotypes. A first phase will develop a tool that shows the pedigree and associated marker data in an intuitive graphical representation and integrate it with existing back-end data cleaning algorithms to show where errors occur in a pedigree. A second phase will incorporate interactive techniques for dynamic feedback that allow geneticists to hypothesise as to the source of data errors within a pedigree and view the effect on the state of the erroneous data. For example, this may include reassigning parent animals, changing specific marker values or masking entire sets of markers. Ultimately a geneticist will be able to arrive at a set of actions that produce an error-free data set. Therefore, the outcome of this research project will be to produce a tool to allow a geneticist to interactively clean up pedigree genotype datasets. Beneficiaries will include the geneticists who produce the initial data and other specialists who will now be able to use the cleaned data set for their own analyses and research.

Planned Impact

The main beneficiaries of this project will be geneticists dealing with large pedigree genotype datasets, who will have a useful tool that will enable them to interactively find and eliminate the cause of error in their data. Previously, errors in such data have reduced its usability and shareability between researchers, and weeding them out is laborious, difficult and time-consuming work. Geneticists can be found in a wide range of establishments, both publicly and privately-funded, ranging from research institutes such as Roslin to commercial animal breeding concerns that generate and turnover millions of pounds through livestock yield and quality improvements. Whilst the target of the research is currently animal pedigrees, researchers in other domains with breeding programmes have shown interest in the planned results of VIPER, for example the Scottish Crop Research Institute (SCRI) at Dundee. Further beneficiaries include user interface developers who will gain new techniques for visualising complex pedigree style structures; we envisage that one of the benefits could be improved data representation for genealogy software that allows members of the general public to explore their family trees and the associated inherited characteristics in families. From a public policy perspective, the need to develop data checking and cleaning tools and mechanisms for data submitted to public data repositories could be demonstrated by the use and success of this tool. This is especially important as such repositories become centralised locations for sourcing research data. The primary means of communicating outputs from the research will be through a project website to enable interested parties to download usable versions of the prototype pedigree cleaning tool along with requisite instructions. Publications such as journals and conferences in the bioinformatics and visualisation areas will be the appropriate conduits to disseminate research findings. Geneticists will be contacted to volunteer as testers of the software, both formally and informally, as development of the tool must be responsive to the needs of those most likely to utilise it. Other fora such as public and industrial outreach events are also available, such as university open days and student recruitment fairs where visualisation based tools make attractive visual talking points. Edinburgh Napier University is also a member of SICSA, a collection of Scottish computer science departments that aims to publicise and commercialise research where we frequently present our work. The Scottish Bioinformatics Forum (SBF) is also an avenue for exposing research to other interested researchers, and Edinburgh Napier has conducted workshops here in the past. It is expected that the software will be released under an open-source licence; however this still leaves scope for licensing and producing bespoke versions of the application with enhanced or tailored capabilities for interested parties in future. Both Edinburgh Napier and Roslin's parent institute, the University of Edinburgh, have full-time commercialisation teams that would be used in this case to advise on suitable licensing and contracting terms. The Edinburgh Napier PI has experience of working on commercialising software, supervising a Proof of Concept award for micro-array analysis and visualisation.
 
Description A well-designed visual representation aids in revealing and removing inheritance errors in pedigree genotypes, specifically by showing family-based patterns to the errors.

The work led to a phd exploring the possibilities for visualising plant pedigrees - plants have more complicated inheritance mechanisms and family trees than animals, and often there is no concept of an individual plant - only varieties.



Pedigree based errors are best placed to be solved by a human expert as automatic data cleaning simply pushes the error reporting down the inheritance path. Automatic cleaning is useful for cleaning sporadic errors that would be tedious manually. In short, clean the worst areas with expert help and the rest automatically.
Exploitation Route Potential use for commercial animal breeders in finding errors within their breeding programmes.
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software)

URL http://www.iidi.napier.ac.uk/c/grants/grantid/13258147
 
Description The tool that has been developed is used by bioloigists at Roslin for cleaning their pedigree genomic data sets.
First Year Of Impact 2011
Sector Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software)
Impact Types Economic

 
Description Roslin Institute 
Organisation University of Edinburgh
Department The Roslin Institute
Country United Kingdom 
Sector Academic/University 
PI Contribution VIPER was a joint enterprise between Roslin and Edinburgh Napier. Roslin supplied bioinformatics expertise and datasets.
Collaborator Contribution Edinburgh Napier developed the visualisations for VIPER based on the data and feedback supplied by the Bioinformatics team at Roslin.
Impact Publications from VIPER are attached to project.
Start Year 2010
 
Title VIPER visualisation software 
Description Java-based application for viewing and cleaning pedigree genotype data. This project aims to produce a tool to remove errors in animal pedigree information caused by administrative and data handling faults. Large amounts of animal pedigree and characteristic data are logged and stored during the course of animal breeding studies. However, to be of any use for further programmes or analysis the data needs to be as free of error as possible. 
Type Of Technology Software 
Year Produced 2012 
Impact No actual Impacts realised to date 
URL http://www.viper-project.org.uk/