Data Mining and Integration Strategies for Ecotoxicogenomics

Lead Research Organisation: Imperial College London
Department Name: Dept of Surgery and Cancer


Modern biological science is increasingly based on finding out how organisms work at the molecular level. In the last decade, new 'post-genomic' techniques have become available that enable researchers to simultaneously measure the levels of large numbers of biological molecules in a given condition, such as a disease state. These measurements relate to important biological processes for which there are many types of molecule (e.g. which of the many thousands of genes are 'turned on' or 'off' in a cell). In environmental science, these methods have become increasingly popular because of their ability to measure how organisms respond to changes in the environment, for example increases in levels of toxic pollutants. However, the large numbers of measurements made by the new technologies provides challenges to current methods of analysing the data and new techniques are required which can handle the huge data sets produced. This is a particular problem in environmental research, since the data are often more difficult to interpret than lab-based experiments. Furthermore, studies employing more than one post-genomic technique are becoming increasingly common in this area, despite the lack of methods to investigate the relationships between the data sets produced. This project aims to develop new methods for the analysis of such 'multi-omic' data sets in the area of environmental toxicology, so that meaningful biological information can be obtained from them. Our project will employ four environmental toxicity data sets to develop and apply the new techniques. We will focus on one particular project, the 'EcoWorm' consortium which produced high quality data relating to the effects of three environmental pollutants on the earth worms and nematode worms. Towards the end of the project we will apply the new methods to the other data sets to find out new biological knowledge in each specific area. We will develop new techniques based on existing statistical methods to detect relationships between data sets and will particularly focus on addressing the problems inherent in environmental data. We will also adapt methods to visualise the results of the analysis so that researchers can quickly generate ideas for further experiments. At the end of the project we will produce software package containing the new methods which we will make available to all scientists via the web. The benefits of the proposed research will principally be the generation of new biological information from existing and future environmental data. This not only furthers the cause of basic scientific research, but also improves the value for money obtained from past and future programmes funded by the NERC and other agencies. The improved methods will also be useful to industries interested in the effects of chemicals on the environment, as well as government and non-governmental organisations which have an interest in monitoring environmental hazards and risks. While developed in the environmental field, the new techniques will also be applicable to other areas of science and commerce where post-genomic methods are used, such as the pharmaceutical industry. Ultimately, the improved understanding of biology, in particular how organisms respond to changes in the environment, will contribute to advancing industrial competitiveness and overall quality of life in the UK.


10 25 50
Description In this project, we developed new tools for analysing ecotoxicogenomics data. These tools were applied to data from several previously funded NERC projects. For example, we developed a method for discovering non-linear relationships in omics data using the mutual information statsitic (see publication).
Exploitation Route The methods generated here could be further developed and refined by others for use with other multivariate data in almost any area of science or engineering.
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Environment,Healthcare

Description The methods developed in this research have been used in a number of further investigations (see publication).
First Year Of Impact 2009
Description New grant
Amount £447,432 (GBP)
Funding ID NE/H009973/1 
Organisation Natural Environment Research Council 
Sector Public
Country United Kingdom
Start 04/2010 
End 12/2013
Title Software for detecting nonlinear relationships 
Description The software embodies a new algorithm for detecting nonlinear relationships in biological data. We used a comparison between the mutual information and Pearson correlation statistics to highlight relationships which might be nonlinear in nature, and applied resampling techniques to determine statistical significance. The approach will be especially useful in the analysis of high throughput 'omics' data such as that from transcriptomics or metabolomics, where such relationships are expected (e.g. response of gene expression to environmental pollutant) but not easy to identify from the many thousands of measurements made. 
Type Of Technology Software 
Year Produced 2009 
Open Source License? Yes  
Impact The software was published in 2009 and is freely available to other researchers.