Developing new analytic techniques for profiling language phenotypes in genetic research

Lead Research Organisation: CARDIFF UNIVERSITY
Department Name: Sch of English Communication and Philos

Abstract

This project takes a first step towards investigating how our genetic make-up, as distinct from our cultural and psychological environment, determines our use of language. Such research could ultimately help geneticists screen for a variety of conditions in which language is affected, including Alzheimer's, Specific Language Impairment, autism, schizophrenia and depression. It could also contribute to our understanding of how human language evolved, and the profiling methodology could be applied to forensic linguistic analyses. For practical reasons, easily administered quantitative tests of language (e.g. mean length of utterance/sentence; scores on multiple-choice tests; reaction times) have been favoured in psychological and clinical research into language, but linguists have long recognised that much of what characterises the human linguistic ability is qualitative in nature rather than quantitative. There are presently inadequate means for characterising the qualitative aspects of language in ways that geneticists can use, and this project aims to identify analytic approaches that can translate the complexity and richness of language into something countable, so that mathematical probabilities can be calculated (See Objective 1).

The data are 600-word exam essays written at the age of 17 by members of a large Twin Study cohort in Queensland, Australia. These essays will be used to test a wide range of analytic techniques for the quantification or pseudo-quantification of features of the texts, resulting in a linguistic profile for each subject. The essays constitute an excellent linguistic resource in their own right, but their value is multiply enhanced by their Twin Study provenance. Twin studies are a highly effective way to explore the role of genes and environment in relation to a variable. Sophisticated statistical methods calculate the amount of variation between identical twins (who share all their DNA) and non-identical twins (whose DNA varies at the same rate as in non-twin siblings), while factoring in their shared experiences as siblings.

The individual subject profiles that are produced using the analyses found to be most effective and practical to administer will be passed to the Twin Study collaborators for trial multivariate analyses, to see if they have potential as external markers (phenotypes) of genetic differences (Objective 2). These trial analyses will be too small to reveal genuine effects, but they will establish the robustness of the input values. The outcome of the project will be a small set of analyses judged to have potential for phenotyping on a larger scale in a future international collaboration (Objective 3).

The project aims to maximise the range of expert input by inviting sub-field specialists to engage in small-scale trial analyses of the data using existing or new methods. The external contributions will be supplemented by those of in-house experts and early career staff, including a part time research assistant, to create an exciting research environment that fosters creativity and provides opportunities and mentoring for junior colleagues. The project will push the boundaries of the emerging activity in Medical Humanities (Objective 5), while breaking new ground in the quantification of qualitative features of language (Objective 4).

Publications

10 25 50
publication icon
Alison Wray (Author) (2008) Genes and the conceptualisation of language knowledge in Genomics, Society, and Policy

publication icon
Mollet E (2010) Choosing the best tools for comparative analyses of texts in International Journal of Corpus Linguistics

publication icon
Mollett E; Wray A; Fitzpatrick T (2011) The Phraseological View of Language: A Tribute to John Sinclair

 
Description We used essays written by identical and non-identical 17 year old twins to trial a range of linguistic profiling tools. These tools capture quantitatively different features, many qualitative in nature. We were interested in which tools were both meaningful in linguistic terms and also useable in research that requires reliable quantitative data--in this case, genetics. We examined the results from our analyses in relation to a range of cognitive measures collected by our collaborators (Epidemiologists at the Queensland Medical Research Institute) as part of their wider twin study. Although the data set was too small to directly determine whether there is a genetic component to linguistic variation, we did establish which tools would be most appropriate for ascertaining that on a larger sample in the future. The variables that we concluded as most useful are: idea density (using CPIDR), syntactic complexity (using Fichtner's C), latent semantic analysis (using Coh Metrix), and two measures of lexical diversity, Advanced Guiraud 1000 and Baayen's P.
Exploitation Route We have developed this research further through a series of other grants, most notably an ESRC project (PI Fitzpatrick) that examined word association responses in teenage and in over-65 yr old twins, and in a BRACE funded project (PI Wray) that is using the same profiling techniques plus word association responses to look at markers of future risk of Alzheimer's disease.
Sectors Digital/Communication/Information Technologies (including Software),Education,Healthcare,Culture, Heritage, Museums and Collections

 
Description College of Arts, Humanities and Social Sciences: Research Bid Support
Amount £5,000 (GBP)
Funding ID AJ33001001 
Organisation Cardiff University 
Sector Academic/University
Country United Kingdom
Start 09/2014 
End 08/2015
 
Description Research grant
Amount £39,995 (GBP)
Organisation BRACE (Alzheimer's disease charity) 
Sector Charity/Non Profit
Country United Kingdom
Start 11/2014 
End 10/2015
 
Description Spoken interviews from adolescent twins and over 65 year old twins currently being collected by Australian collaborators, Collaborator: Queensland Medical Research Institute Genetic Epidemiology Unit. 
Organisation University of Queensland
Department Queensland Institute of Medical Research
Country Australia 
Sector Academic/University 
PI Contribution Information taken from Final Report
Collaborator Contribution Data collection
Impact accumulated dataset
Start Year 2007