Visual Exploration of Species-referenced Repositories (VESpeR)
Lead Research Organisation:
Edinburgh Napier University
Department Name: Computing
Abstract
There exist today a multitude of biological databases containing a wide array of information regarding different species including specimens in museum databases, occurrence information, genome sequence and expression data and image data to name a few. A common feature of these databases is that the information normally corresponds to a particular species (or taxa) and therefore the databases tend to employ some taxonomy to structure the information and access the data. However as yet there is no common taxonomy which is used across these databases to enable reliable linking across the databases.
Matching species across databases is challenging. Different databases can and do use different classifications which use different names to represent the same underlying species or taxa. Tools that aid integration of data from these sources will be of benefit to biologists allowing them to incorporate additional data into their analyses and ensure the quality of the data and the accuracy of results are improved.
The utility of visualising data is well established for tasks such as presentation of information. Visualisations are effective for a range of other tasks such as acting as ad-hoc error-checks for data e.g. spotting a record of a lion placed in the middle of the Pacific Ocean in a geographical information visualisation plot clearly suggests an error in the data. However, the true advantage of visualisation isn't in static presentation but in allowing users to interactively explore and view the effects of changes to constraints and variables, although suitable tools are frequently not available to biologists where they could be most useful.
This project will build on the biological standards developed for taxonomic information and develop a set of web-based visualisation tools for use by a wide range of biologists and end-users of these databases to support them clean, explore and compare the data contained within. The resulting tools will have a wide ranging impact on the quality of data made available and the accessibility of the data to a wide range of users.
Matching species across databases is challenging. Different databases can and do use different classifications which use different names to represent the same underlying species or taxa. Tools that aid integration of data from these sources will be of benefit to biologists allowing them to incorporate additional data into their analyses and ensure the quality of the data and the accuracy of results are improved.
The utility of visualising data is well established for tasks such as presentation of information. Visualisations are effective for a range of other tasks such as acting as ad-hoc error-checks for data e.g. spotting a record of a lion placed in the middle of the Pacific Ocean in a geographical information visualisation plot clearly suggests an error in the data. However, the true advantage of visualisation isn't in static presentation but in allowing users to interactively explore and view the effects of changes to constraints and variables, although suitable tools are frequently not available to biologists where they could be most useful.
This project will build on the biological standards developed for taxonomic information and develop a set of web-based visualisation tools for use by a wide range of biologists and end-users of these databases to support them clean, explore and compare the data contained within. The resulting tools will have a wide ranging impact on the quality of data made available and the accessibility of the data to a wide range of users.
Technical Summary
Visualisation techniques have been recognised as one of the major directions in future research when handling and querying biological data, offering the ability to find patterns and outliers in data which traditional query interfaces cannot match. A case in point is the multitude of species-referenced databases covering data from genomic to biodiversity data linked by taxonomic classifications that hold geographic and temporal-faceted data alongside other data. Many online databases hold collections of such data, often in archive format, but visual querying tools are invariably limited to a map interface of spatial distribution, neglecting the fact that biologists may wish to query or explore other facets of the data such as the classification or temporal distribution. Add onto this the problem of many complementary databases using different taxonomic classifications to reference their specimens and we have a situation where much of the potential utility of this data remains unused.
We therefore propose to develop a suite of web-based visualisation components for taxonomic, temporal and geographic aspects of these data sets that can be placed directly into the workflow of biologists who use such data. These components will be co-ordinated such that selections and actions in one component will be reflected in the data shown in other components. Further we will build a novel cross-taxonomy viewer that will allow users to crosswalk different classifications, allowing them to accurately match specimens between data from different sources. These components will allow biologists to perform tasks such as sanity checking of data, view patterns in geographical, taxonomic or temporal aspects in an inter-related context, and accurately view data even when it spans conflicting taxonomic classifications. This work will thus make a significant contribution to the efficiency and usability of online catalogues for both the providers and end-users of the data they hold.
We therefore propose to develop a suite of web-based visualisation components for taxonomic, temporal and geographic aspects of these data sets that can be placed directly into the workflow of biologists who use such data. These components will be co-ordinated such that selections and actions in one component will be reflected in the data shown in other components. Further we will build a novel cross-taxonomy viewer that will allow users to crosswalk different classifications, allowing them to accurately match specimens between data from different sources. These components will allow biologists to perform tasks such as sanity checking of data, view patterns in geographical, taxonomic or temporal aspects in an inter-related context, and accurately view data even when it spans conflicting taxonomic classifications. This work will thus make a significant contribution to the efficiency and usability of online catalogues for both the providers and end-users of the data they hold.
Planned Impact
VESpeR's work will impact individuals and groups who supply and utilise the data stored within large online species resource databases. The advantages we claim for visualisation are as a presentation and communication medium, for error-checking, and for knowledge discovery.
Used as a presentation medium, the main non-academic beneficiaries of these tools will be the users of species referenced databases such as GBIF, Catalogue of Life, Barcode of Life and EMBL database to whom this data will be communicated through graphical visualisations. Information Visualisation is becoming common place and is now used by the public at large e.g. IV techniques frequently used in financial websites to track shares or as used to communicate voting results. The IV techniques developed in this project will add to the range of user interface techniques available for communicating and exploring information. In this way it can be argued that visualisation is the channel of communication through which data is presented to any and all possible users of the datasets we are targeting, and thus contributes to an increasing public awareness of species related information including biodiversity and its associated effects.
Projects with a large proportion of visualisation work such as VESpeR also make attractive visual material for public engagement at exhibitions and open days. Such projects are also interesting to undergraduate and Masters students looking for rewarding projects to undertake and increase training in this important area.
In order to engage business, we participate in outreach events, for example those organised through the Scottish Informatics and Computer Science Alliance (SICSA), of which Edinburgh Napier is a member. These events attract many delegates from industry looking for potential collaborative ventures with research from academia. We have attended all of these events which have attracted commercial interest through our posters and demonstrations. We also plan to disseminate the research described in this proposal at similar bioinformatics events such as VIZBI an annual meeting organised by biologist to promote visualisation to the biological community at which Prof. Kennedy is this year's Keynote speaker. Similarly, in late 2010 Kennedy was a speaker at the BBSRC/AHRC Workshop on 'The challenges of Visualising Biological Data' held in Bristol.
The work will strengthen existing links between Edinburgh Napier and the partners and supporting institutes, specifically GBIF in Copenhagen and Reading University, who hope to deploy the visualisations resulting from this project. The PI will be responsible for building the network of collaborators to continue to build new relationships and form new partnerships.
In addition the proposed visualisation tools will enhance collaboration between existing providers of species resource databases such as those in the i4Life project to allow them to more easily understand the overlap and differences in content of their repositories thereby improving the provision for the end-users of these databases.
Using the visualisation as an error-checking medium will allow cleaner and more precise data sets to be stored in the databases and thus reduce the potential for error in onwards analyses. Biodiversity data is used for a wide range of non-academic purposes such as conservation planning, eco-tourism, public outreach, infrastructure planning and land management and planning processes and it is only logical that less errors in the data will lead to less errors in subsequent decision-making on such issues. Similarly, it is also in these fields where the impact of any knowledge discovery made using the visualisations may result in statutory policy for example in biodiversity, which ends up affecting the general public as a whole.
The tools developed will allow better exploitation of data across these different repositories by helping reconcile species references.
Used as a presentation medium, the main non-academic beneficiaries of these tools will be the users of species referenced databases such as GBIF, Catalogue of Life, Barcode of Life and EMBL database to whom this data will be communicated through graphical visualisations. Information Visualisation is becoming common place and is now used by the public at large e.g. IV techniques frequently used in financial websites to track shares or as used to communicate voting results. The IV techniques developed in this project will add to the range of user interface techniques available for communicating and exploring information. In this way it can be argued that visualisation is the channel of communication through which data is presented to any and all possible users of the datasets we are targeting, and thus contributes to an increasing public awareness of species related information including biodiversity and its associated effects.
Projects with a large proportion of visualisation work such as VESpeR also make attractive visual material for public engagement at exhibitions and open days. Such projects are also interesting to undergraduate and Masters students looking for rewarding projects to undertake and increase training in this important area.
In order to engage business, we participate in outreach events, for example those organised through the Scottish Informatics and Computer Science Alliance (SICSA), of which Edinburgh Napier is a member. These events attract many delegates from industry looking for potential collaborative ventures with research from academia. We have attended all of these events which have attracted commercial interest through our posters and demonstrations. We also plan to disseminate the research described in this proposal at similar bioinformatics events such as VIZBI an annual meeting organised by biologist to promote visualisation to the biological community at which Prof. Kennedy is this year's Keynote speaker. Similarly, in late 2010 Kennedy was a speaker at the BBSRC/AHRC Workshop on 'The challenges of Visualising Biological Data' held in Bristol.
The work will strengthen existing links between Edinburgh Napier and the partners and supporting institutes, specifically GBIF in Copenhagen and Reading University, who hope to deploy the visualisations resulting from this project. The PI will be responsible for building the network of collaborators to continue to build new relationships and form new partnerships.
In addition the proposed visualisation tools will enhance collaboration between existing providers of species resource databases such as those in the i4Life project to allow them to more easily understand the overlap and differences in content of their repositories thereby improving the provision for the end-users of these databases.
Using the visualisation as an error-checking medium will allow cleaner and more precise data sets to be stored in the databases and thus reduce the potential for error in onwards analyses. Biodiversity data is used for a wide range of non-academic purposes such as conservation planning, eco-tourism, public outreach, infrastructure planning and land management and planning processes and it is only logical that less errors in the data will lead to less errors in subsequent decision-making on such issues. Similarly, it is also in these fields where the impact of any knowledge discovery made using the visualisations may result in statutory policy for example in biodiversity, which ends up affecting the general public as a whole.
The tools developed will allow better exploitation of data across these different repositories by helping reconcile species references.
Description | We have developed a visualisation tool that allows data creators to visually pre-check specimen collection data before it is uploaded to sites that host such data. Often such data is malformed semantically even if it passes syntactic data checks, the classic example being longitude inversions that place american species in the middle of china. For hosted datasets, the tools similarly allows users to visually assess data for suitability for their own research objectives and to check for otherwise hidden errors. Vesper itself has revealed that many standard taxonomies and smaller specimen collections in fact contain a high amount of error - unknown time points, mis-mapped positioning and undetermined positions in taxonomic hierarchies. This shows that tools beyond the current syntax-only checking mechanisms are needed and visualization-based tools allow users to bring their own knowledge to bear on the data involved. Vesper has generated interest from GBIF partner nodes and data publishers such as Canadensys |
Exploitation Route | Vesper can be used by others within the ecological community to examine and assess data sets that remain otherwise opaque - data correctness and cleaning is a central concern of data publishers at the moment, as a large proportion of published data out there is heavily contaminated with error. Botanical gardens and natural history museum collections can also be assessed this way if they use the data format that Vesper acts upon. It may even be the case that tools such as Vesper further encourage the take up of the particular data format involved - Darwin Core Archives. Vesper's visual mode of presentation can also be used to clearly illustrate that errors within such datasets often occur, and that these errors then propagate through to whichever analyses and policies use that data, in this way, like many visualisation-based tools, it has an educational role to play. |
Sectors | Agriculture Food and Drink Education Environment Culture Heritage Museums and Collections |
URL | http://www.vesper.org.uk |
Description | The tool that has been developed has been adopted by GBIF as part of their publication toolkit. |
First Year Of Impact | 2014 |
Sector | Digital/Communication/Information Technologies (including Software),Environment |
Impact Types | Societal |
Description | Global Biodiversity Information Facility |
Organisation | Global Biodiversity Information Facility (GBIF) |
Country | Global |
Sector | Charity/Non Profit |
PI Contribution | Development of tools to aid access and understanding of GBIF's datasets |
Collaborator Contribution | Supply of data and feedback to prototypes, hosting for developed tools on their website. |
Impact | Dissemination of VESPER website to larger community that congregate around GBIF publishing hub |
Start Year | 2006 |
Title | Vesper Software Repository |
Description | GitHub repository of Vesper Software for distribution (as per BBSRC directives on open sourcing of funded projects) https://github.com/martingraham/vesper A demonstration of the software is available at: http://www.vesper.org.uk/vesperDemo/vesper/demoNew.html |
Type Of Technology | Software |
Year Produced | 2013 |
Open Source License? | Yes |
Impact | No actual Impacts realised to date |
URL | https://github.com/martingraham/vesper |
Description | Presentation at TDWG Conference |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Slides of presentation given at the Taxonomic Databases Working Group (TDWG) Conference, Florence, Italy on Tuesday 29th November, 2013 http://www.tdwg.org/fileadmin/2013conference/slides/Graham_VESPER.ppt no actual impacts realised to date |
Year(s) Of Engagement Activity | 2013 |
URL | http://www.tdwg.org/fileadmin/2013conference/slides/Graham_VESPER.ppt |