Modeling the Development of Phonetic Representations

Lead Research Organisation: University of Edinburgh
Department Name: Sch of Informatics

Abstract

Listeners differ cross-linguistically in the cues, or perceptual dimensions, they rely on when perceiving speech. For example, Japanese listeners categorizing English [l] and [r] do not rely on the same acoustic features of the speech signal (e.g., the third formant) that native English listeners do. These cross-linguistic differences are typically attributed to listeners' knowledge of sound categories. For example, English listeners know that [l] and [r] are two categories, whereas Japanese listeners know that [l] and [r] are part of the same category; and this knowledge is hypothesized to affect their reliance on dimensions.

The proposed research tests the hypothesis that category knowledge is not necessary for perceptual dimension learning to occur. Drawing on representation learning methods that have performed well in low-resource automatic speech recognition, where extensive labeled training data are not available, two models are proposed that learn dimensions without relying on knowledge of sound categories. The first relies on temporal information as a proxy for category knowledge, while the second relies on top-down information from similar words, which infants have been shown to use. These models are evaluated on their ability to predict listeners' discrimination judgments from speech perception experiments on native and non-native contrasts, when trained on the same language background as the listeners.

Planned Impact

Building models of perceptual dimension learning can significantly impact science and health. For example, greater insight into dimension learning can help us to understand the difficulties that adults face when acquiring a second language, and possibly to develop better methods for teaching second languages. In addition, research suggests that dyslexia and some language learning impairments can be caused by problems with low-level speech processing. A better understanding of perceptual learning could lead to better diagnosis and treatment in these cases.

Our work could also lead to improved speech technology for low-resource languages. State-of-the-art systems rely on hundreds of hours of hand-transcribed data, which is time-consuming and expensive to create. Consequently, high-quality systems are only available for a few languages, and speech recognition researchers are increasingly looking for ways to develop systems that learn from audio alone. Our proposed work draws on existing methods, but explores these in ways that have not been done within the ASR community, including more carefully controlled comparisons between methods and comparisons against human perceptual data. We anticipate that our investigation will lead to insights and perhaps new techniques that can transfer directly to the field of speech recognition and ultimately lead to systems that learn more effectively using little or no transcribed audio. Such systems could become important tools for documenting and analyzing endangered and minority languages, and could help make speech technology more universally available, not just to majority language speakers in rich countries, as most systems are today.
 
Description Research conducted under this project used computational models to build theories of what infants learn about the sounds in their native language(s) during their first year of life. Several models were built to simulate infant learners. Those models were exposed to natural speech in a more realistic way than previous models, and they were tested on their linguistic knowledge in ways similar to infants.

The results from these modeling simulations could have a very large impact on theories of infant language learning. Previous theories have interpreted cross-linguistic differences in infants' perception - for example, English-learning 12-month-olds can discriminate 'r' (as in rock) from 'l' (as in lock) better than Japanese-learning infants can - as evidence that infants already know which phonetic categories, like 'r' or 'l', are used in their language. Several models built as part of this project showed the same cross-linguistic differences that infants do, but did not have the same type of phonetic category knowledge that had previously been hypothesized. For example, models trained on English speech did not know categories like 'r' or 'l', but were still better than the models trained on Japanese speech at discriminating these sounds. This means that the observed cross-linguistic differences in infants' discrimination are not necessarily evidence that infants know phonetic categories like 'r' and 'l'. That finding could radically impact theories of what infants know at the beginning of their second year of life.

Building models that learn from natural speech also led to other types of advances. For example, it led to the discovery of new information sources - present in highly variable natural speech, but absent in the controlled laboratory speech that researchers often study - that children could use when learning about the speech sounds of their native language(s). It also facilitated the development of new techniques for making robust model-based predictions about how infants are likely to behave in laboratory experiments. Such techniques could increase the scientific impact of any future modeling simulations, by making it easier to test hypotheses about what infants know. Finally, the project led to the development of new techniques for building speech technology without large quantities of annotated speech data. This can potentially lead to improved speech technology for a wider range of languages, including endangered and minority languages, in the future.

The project has led to enhanced opportunities for student training. Results from the project have already been incorporated into linguistics courses at the University of Maryland. Members of the project team published a "perspectives" article in Open Mind, a freely available open access journal, that lays out the new theory in a way that is likely to be accessible to advanced undergraduates, and can be used in courses at other universities. Postdoctoral researchers and students who were directly involved in the project learned new, state-of-the art methods for conducting research in language acquisition using techniques from computer science for working with large-scale data; their expertise has already been helpful in training others in the community to conduct research at the boundary between these disciplines.
Exploitation Route Showing that perceptual changes do not require category knowledge has a potentially very large impact on research in language acquisition. This finding opens up a new set of questions regarding what the early learning process consists of, what drives it, and how it relates to phoneme and word learning. We hope this finding will stimulate further behavioral and computational work to demonstrate and test empirical differences between the standard explanation of early perceptual development and the one advocated here.

Our project also provides one of the first examples of how one can use speech corpora for large-scale models of perceptual development. This is potentially very valuable to the field as a methodological advance, because it would allow cognitive science research to take advantage of extensive resources that already exist in the engineering community, and to more easily test alternative models against behavioral data.
Sectors Digital/Communication/Information Technologies (including Software)

Education

Healthcare

 
Description Feldman 
Organisation University of Maryland, College Park
Department Department of Linguistics
Country United States 
Sector Academic/University 
PI Contribution Postdoctoral and PI research time, grant and paper writing.
Collaborator Contribution (Early) Consultation on aspects of the research and paper writing; (Later) Co-PI of research team.
Impact Papers: Elsner et al. 2013, Frank et al. 2014 (see publications list). Multi-disciplinary, collaborator is a developmental linguist
Start Year 2012
 
Description Kamper (2019-20) 
Organisation University of Stellenbosch
Country South Africa 
Sector Academic/University 
PI Contribution Cognitive science and bilingualism expertise, modelling
Collaborator Contribution Code and pretrained models, speech technology expertise
Impact 'MULTILINGUAL ACOUSTIC WORD EMBEDDING MODELS FOR PROCESSING ZERO-RESOURCE LANGUAGES', accepted at ICASSP 2020
Start Year 2019