The application of support vector machine feature selection to cross sectional studies in epidemiology

Lead Research Organisation: University of Liverpool
Department Name: Veterinary Clinical Science

Abstract

Why are some kids fat? Newspapers round up junk food, too many sweets and eating too much, but are these the only things that are important? What about the skinny cross country winner who eats like a horse or the big rugby player who eats hardly anything? What controls their weight? Also, why do children eat too much? Is it because they're bored, the fridge is full of their favourite food, dad or mum has cooked some delicious chocolate cake or because 'they have nothing better to do'? Many things contribute to obesity. To prevent it, we must identify those that are most important. Doing this is the science of EPIDEMIOLOGY - a big word that comes from the Greek EPI meaning disease and DEMOS meaning populations. Epidemiologists compare populations with a condition, e.g.obesity, or a disease e.g.leukaemia with those without the problem. They collect data about things which they can measure and which may be important e.g. what and how much food is eaten. These are called VARIABLES. Epidemiologists use these variables in statistical tests, run on computers, to identify which ones increase (or decrease) the risk of getting disease. This type of study has resulted in anti-smoking campaigns and the recommended '5 pieces of fruit a day'. Because of the importance of their results, epidemiologists design studies carefully and use the best statistical tests available. One of the most common is LOGISTIC REGRESSION - a mouthful more easily referred to as LR. LR is powerful; it can tease out important factors from a complicated mass of data by estimating the effect of one variable when adjusted for the effects of all others. In addition to LR, epidemiologists investigate different methods of analysis. Just as a sick person might seek a 'second opinion' from another doctor, epidemiologists need tests that either confirm or question their results. At the moment, there is no easily available and generally accepted alternative to LR. This proposal aims to alter this by using one of the most exciting developments in computing in the last 10 years. This development is SUPPORT VECTOR MACHINE learning - another mouthful best referred to as SVM. Although SVM sounds like a gadget, it is not. It is a technique of training computers to tell the difference between things. In this case, we are interested in training the SVM to tell us the difference between diseased and non-diseased groups, but the method has also been used to tell faces, voices and handwriting apart. It does this by selecting features that are important in differentiating or classifying the groups. So, SVM, like LR, identifies variables associated with disease but in a completely different way. Having two tests, working in different ways, assists epidemiologists in the same way that an X-ray and MRI scan are more helpful to a doctor than two X-rays. We have already used SVM, not on obese children, or smoking adults, but on a new disease of meat chickens. We studied chickens because we are veterinarians interesting in protecting their health and that of people who eat them. We first heard of SVM at a meeting held to introduce epidemiologists to new methods in mathematics and computing and were awarded a small amount of money to begin collaboration. We have shown that SVM is a useful technique but we need to test it in the field and develop an easy way for epidemiologists to use it. So, we are going to develop a user-friendly SVM program. Whilst doing this, we will write about SVM, talk about it at meetings, try it out on different diseases and train other people to use and evaluate it. Computer scientists and veterinarians may seem a strange combination. It is! We have each had to learn new jargon, just to talk to each other...we even use the same words for completely different things.... But, if epidemiologists are to use the power of modern computing to help prevent disease, teams such as these are essential. They are also a lot of fun!

Technical Summary

This proposal builds on an EPSRC small grant GR/S73631/01, graded 'tending to outstanding' in its final report. It is a resubmission from March 2005 of application GR/EP/DO3O684 sent EPRSC Life Sciences Interface as a continuation of GR/S73631/01 and recommended for submission to BBSRC with a contribution in financial support from EPSRC if successful. Epidemiologists from Liverpool Veterinary School and computer scientist from DIMACS, a National Science Foundations Institute at Rutgers, New Jersey, USA, have recently reported the use of Support Vector Machine learning as a method of identifying risk factors for disease from observational epidemiological data. Support Vector Machine classification was developed in the mid 90's and although related to neural networks, the technique is simpler more robust and founded on statistical learning theory. In particular the use of SVM overcomes overfitting associated with the empirical risk minimisation, (ERM) which aims to minimise the error on the training data set but results in poor generalisation (i.e. performance on unseen datasets). SVM are arguably the single most important development in supervised classification in recent years. They are known to generalise well in high dimensional space even with small training sample conditions, when the data are noisy. SVM are not only good classifiers but are also good feature selection techniques. SVM has been used for the verification and recognition of faces, speech, handwriting and such diverse events as goal detection in football matches and financial forecasting. In life sciences it has been applied to gene expression, proteomics and disease diagnosis With the exception of a recent report, using single nucleotide polymorphisms (SNPs) to predict an increased risk of breast cancer there have been no published reports of the application of SVM in epidemiology. This project aims to further develop the application of SVM, to improve kernel selection and to produce a user-friendly SVM program for wider epidemiological use Data for the development of this program will be provided from an epidemiological study of an emerging disease of poultry (wet litter). Because meat birds live for only 6-7 weeks broiler flocks provide the opportunity to validate the classifications made by SVM during the period of study. The final program will provide a new paradigm in epidemiology and act as an easily applicable 'second opinion' for statistical models generated using the 'epidemiological standard' of logisitic regression. It will also, as a by-product, improve our understanding of wet litter in poultry. The final program will be applicable to observational studies of non-infectious and infectious human and animal disease.

Publications

10 25 50
 
Description That chicken health and welfare could be improved dramatically and incrementally worldwide by the incorporation of machine learning algorithms into data management systems designed to monitor the production of meat birds
Exploitation Route This could be incorporated into software and has a worldwide market. It will be the subject of a follow on proposal.
Sectors Agriculture, Food and Drink,Creative Economy,Digital/Communication/Information Technologies (including Software),Pharmaceuticals and Medical Biotechnology