Computational natural language processing and the neuro-cognition of language

Lead Research Organisation: University of Cambridge
Department Name: Psychology

Abstract

Scientific understanding of the human language system is one of the main challenges facing cognitive science, and of interest to fields as diverse as linguistics, psychology, anthropology, philosophy, biology and computer science. An adequate theory of this complex system would need to integrate scientific knowledge from several fields, many of which are still in a state of rapid development. One such field is cognitive neuroscience, which investigates language function in the human brain, with the aim of developing neurobiologically and cognitively plausible accounts of the human capacity for dynamic comprehension and production of language. A key input to this research is linguistic information about the core properties of language. This information is typically obtained from conventional resources (dictionaries and grammars) which provide useful generalisations about language, but which do not include statistical information about language use or capture the considerable variation that linguistic items undergo across time and age, data type, and genre. Such information would be an invaluable resource for neuro-cognitive experiments, increasing the plausibility of neurobiological models of language, but it can only be obtained by analysing linguistic patterns and their frequencies in the specific human language data (e.g. patient data, spoken language corpus) of experimental interest. Manual analysis of linguistic data is prohibitively expensive. Automated language analysis using computational Natural Language Processing (NLP) is now a viable alternative. The last decades have seen a massive expansion in the application of statistical and machine learning methods to NLP. This work has made large-scale processing of human language data possible and yielded impressive results in speech and language processing tasks, including e.g. speech recognition, morphological analysis, parsing, and semantic interpretation. Although the same methods could be used to provide realistic, data-driven linguistic input to neuro-cognitive studies involving language, there have been no systematic attempts to do this. The basic NLP technology is available, but it is inaccessible for researchers without considerable computing skills and requires further development for optimal integration with neuro-cognitive research. In this new interdisciplinary project we will integrate research in cognitive neuroscience, experimental psycholinguistics and NLP with the aim of providing the infrastructure for more realistic models of language structure for input into theoretically-driven empirical studies of language in the mind and brain. We will conduct a series of neuro-cognitive experiments which focus on the processing of the core components of language at the levels of morphology, syntax and semantics, using linguistic input automatically extracted from relevant human language data. NLP techniques will be improved and extended to deal with a wider range of constructions, domains and text types as required. An easy-to-use tool will then be designed which will enable effective search, extraction and summarisation of the linguistic information in the annotated data and optimal integration with neuro-cognitive experiments.We expect this project (i) to improve the quality of neuro-cognitive experiments by rooting them in a much more realistic linguistic analysis, (ii) to advance research in NLP by extending existing techniques enabling richer and deeper analysis, and (iii) to provide an important case study for the integration of NLP into critical experimental research in cognitive sciences. The long term goal of this investigation is improved scientific understanding of human language processing which can benefit several disciplines and place researchers in a better position to develop more useful language models, NLP technology, as well as treatments and rehabilitation of various language disorders in the future.
 
Description Scientific understanding of the human language system is one of the main challenges facing cognitive science. An adequate theory of this complex system would need to integrate scientific knowledge from several fields. One such field is cognitive neuroscience, which investigates language function in the human brain, with the aim of developing plausible accounts of the human capacity for dynamic comprehension and production of language. A key input to this research is linguistic information about the core properties of language. Although highly important, statistical information about language use across time, age and genre is not used because manual gathering of such information has proven prohibitively expensive. However, automated analysis using Natural Language Processing (NLP) is a viable alternative. The aim of this interdisciplinary project was to integrate research in cognitive neuroscience, experimental psycholinguistics and NLP with the aim of providing the infrastructure for more realistic models of language structure for input into theoretically-driven empirical studies of language in the mind and brain. A series of neuro-cognitive experiments were conducted which focused on the processing of the core components of language, using linguistic input automatically extracted using NLP from relevant human language data. Novel NLP techniques were developed and existing ones extended where required. The first set of experiments was aimed at constructing a mechanistic account of how language is realised in the brain, combining cognitive and neuropsychological methods with multi-modal neuro-imaging, exploring how a bilaterally organised system supporting basic lexical access gives way to an increasingly left-lateralised network as the processing demands of the speech input change as a function of linguistic complexity. We examined lexical and grammatical complexity and their interaction during processing, using statistical information gathered in relevant datasets. Lexical statistics for verbs predicted resolution of local syntactic ambiguities during sentence processing, while measures of verb syntactic complexity correlated significantly with the activation patterns for inflected forms and verb phrases in a left-lateralized language network. The second set of experiments investigated language change following brain damage and as function of normal, healthy ageing. A corpus collected by CSLB was transcribed and analysed using NLP, and differences in linguistic production between healthy vs. brain-damaged patients and across the life-span were quantified. Statistical data from parsers were related to structural MR brain imaging data. The results brought us a step forward in understanding the basis of language change following both gradual and catastrophic neural change. The final set of experiments focused on conceptual representation. We developed an NLP technique capable of automatically extracting feature-based conceptual representations from corpora and a semi-automatic method for collecting novel features in a property norming study, and demonstrated that these methods can be used to enrich existing models of conceptual representation. The project has shown that NLP can be used to improve the quality of neuro-cognitive experiments. Better quality experiments can lead to improved scientific understanding of human language processing which can benefit several disciplines and place researchers in a better position to develop more useful language models, NLP technology, and treatments for various language disorders.
Exploitation Route The CSLB norms (Devereux et al 2014) have been used and adapted by many other groups of researchers around the world and provide a level of detail that will be very useful for researchers in the field of natural language processing.
Sectors Digital/Communication/Information Technologies (including Software),Healthcare

 
Description The focus of the completed project was to investigate a new, more meaningful approach to academic research. Therefore the specific experimental nature of this project resulted in academic impact but did not lend itself to providing societal or economic impact during the lifetime of the project or even medium term. However, we have made the CSLB property norm data (Devereux et al, 2014) freely and conveniently available to other researchers via the web. We deliberately made this data available in a format likely to be of use to a wide variety of researchers, and to date it has been accessed by around 180 researchers in variety of areas including linguistics, psychology and clinical environments from around the world (including India, Japan, USA, Finland, China and Israel). We know that one group of computational linguistics for example, Schwartz, Reichart & Rappoport (2014), have used the CSLB norms to train and test a model for the semantic classification of nouns and the norms have now been used and cited in about 15 publications. Ultimately it is hoped that other researchers will use the norm data and potentially incorporate it into computational language applications or other tools of benefit for clinical application, e.g. systems for automatically processing and extracting means from human speech and text, which could provide economic and societal value.
First Year Of Impact 2014
Sector Healthcare
Impact Types Societal,Economic

 
Description Further Funding - Newton Trust
Amount £55,643 (GBP)
Organisation University of Cambridge 
Department Isaac Newton Trust
Sector Academic/University
Country United Kingdom
Start 01/2011 
End 06/2012