Data mining: A Large Scale Re-analysis of Designed Microarray Experiments

Lead Research Organisation: Lancaster University
Department Name: Mathematics and Statistics


**A pipeline for arbitrarily designed experiments** Several NERC-funded projects have been making use of innovative, more efficient microarray designs. These designs are making use of the standard two-channel platform, but avoid using reference samples. The resulting designs can be up to 50% more efficient than traditional reference-based designs. The main difficulty of these designs is that commercial microarray analysis software is not geared up to analyse such experiments. An alternative, freely available research tool will be developed to deal with these as well as traditionally designed experiments. *Being able to deal with the complexities of sample availability.* Genomic experiments in general --- and especially within environmental settings --- are often designed in an ad hoc way that depends on the intersection of the availability of genomic material and the availability of the measuring platform. In practice this means that either technical replicates or biological samples are measured or biological samples are pooled into a single measurement sample. Ignoring these difficulties in further analysis --- which is standard practice today --- leads to biased and inaccurate results. The analysis tool, based on linear mixed-effect models, should give accurate results whatever the sample preparation, whether it involves pooling of samples or repeated measurement of the same sample. *Extending the concept of differential expression to arbitrarily complex treatment allocations.* The experiments considered by the NERC collaborators on this project are all much more complex than the initial comparisons of two treatments. For example, in the Paterson project (NE/D000602/1), 3 different stages of nematodes are sampled from the same host, which can be treated or untreated. In the Pottinger project (Small grant MGF107, Molecular Genetics Facility Steering Committee), the gene expression within the brains of stickleback fish are measured after giving males and females two different doses of oestrogen. A complicating issue is that it is unknown whether the phenotypically female fish treated with the high oestrogen dose are either females or sex-reversed males. In the Tyler projects (NER/T/S/2002/00182 and NE/C507696/1), males and females of two types of fish are exposed to different doses of two different environmental pollutants and are sampled across strategically selected developmental time points. In all of these examples, there is a complicated factorial treatment structure, which makes pairwise comparisons both unattractive and scientifically uninteresting. Another aspect of the new biology is the need to explore much more subtle phenotypes. In the Sneddon project (NER/I/S/2001/00768, NE/C000889/1, NE/C002164/1) the phenotypes that are compared on the microarray are subtly different --- e.g. bold trout and disheartened bold trout that have lost a dominance challenge. This falls outside of the capabilities of all but the specialist. The analysis tool should be able to deal with arbitrarily complex factorial covariate structures and produce meaningful results that can be interpreted in terms of those covariate structures. *Providing added value to a large number of NERC funded studies.* The five collaborative groups are between themselves responsible for a database of some 2,000 microarrays across tens of different studies. This is an invaluable resource, which deserves to be analyzed further. All too often, only a first pass analysis has been applied to the data. All of the studies have their own complexities and therefore a close working relationship between analysts and environmental scientists is required in order to apply the analysis tool, that will be developed by the PI and the named researcher on the grant, to a large fraction of these NERC funded studies and to the majority of the microarray data therein.


10 25 50