A cross-linguistic investigation of meaning-driven combinatorial restrictions in clausal embedding

Lead Research Organisation: University of Edinburgh
Department Name: Sch of Philosophy Psychology & Language

Abstract

There is a broad consensus among linguists and cognitive scientists that human language involves a "combinatorial system" that determines how words can be combined to form grammatical structures, and a "semantic system" that interprets such grammatical structures. However, there is no consensus on the precise relationship between these two systems. In particular, it is still unclear whether the combinatorial and the semantic system operate completely independently, or whether certain combinatorial patterns are in fact meaning-driven.

To address this question, "clause-embedding predicates" such as "think" and "wonder" provide an important test case. These predicates vary in the types of complements they can combine with. For instance, "think" only combines with declarative clauses (e.g., "that Jo left") while "wonder" only combines with interrogative clauses (e.g., "who left"). Also, only some of them can combine with a noun phrase (e.g. "believe the rumour" is grammatical but "think the rumour" is not). In languages like Spanish, they furthermore vary with respect to the grammatical mood of their complements (indicative or subjunctive). There is preliminary evidence for a systematic link between such combinatorial patterns and the fine-grained semantic properties of these predicates. For example, predicates expressing an unfulfilled desire (e.g. "hope", "wish") never combine with interrogative clauses (e.g. "Jo hopes who left" is ungrammatical). However, the lack of a systematic investigation of such connections across multiple languages, and of a unified theoretical framework, have so far prevented a thorough understanding of such potential meaning-driven combinatorial restrictions.

This project will pursue an integrated approach to investigate the relation and interaction between the combinatorial and the semantic system in the area of clause-embedding, by combining multi-lingual data-collection and psycho-linguistic experiments with the development of unified theoretical analyses. It will make use of recent developments in the field: novel semantic theories that make it possible to articulate very precise hypotheses about the relationship between meaning and combinatorial restrictions, and novel psycho-linguistic methods that deliver fine-grained data to evaluate such hypotheses.

Concretely, the project will 1) collect data from 14 languages around the world on the combinatorial and semantic properties of clause-embedding predicates, 2) develop sophisticated theoretical hypotheses about the mechanisms underlying potential meaning-driven combinatorial patterns in clausal embedding, and 3) quantitatively evaluate these hypotheses based on psycho-linguistic experiments. The results will shed new light on the fundamental question of how the combinatorial and the semantic system operate together in the human language faculty. The resulting data will be made publicly available to serve further research and computational applications.
 
Description The overarching aim of the project is to investigate the nature of an aspect of human language that is common across all languages, i.e., correlations between the meanings (semantics) of linguistic expressions and the regularities about how these expressions combine with other expressions in a sentence (combinatorial properties). To achieve this goal, the project focuses on particular expressions which linguists call "clause-embedding predicates" such as "believe", "know" and "wonder" and similar expressions in other languages. Linguists have hypothesised that the semantics of clause-embedding predicates correlate with what types of clauses they can combine with (e.g., declarative/statement clauses like "that it is raining" and interrogative/question clauses like "whether it is raining"). Focusing on this domain, the project (i) collects data from 14 languages regarding the semantic and combinatorial properties of clause-embedding predicates; (ii) evaluates correlations between semantic and combinatorial properties of different predicates; and (iii) construct a unifying theoretical account of the correlations.

The most significant achievement from the project so far is the construction of the database consisting of detailed semantic and combinatorial properties of clause-embedding predicates in 15 languages (Catalan, Dutch, English, French, German, Greek, Hebrew, Hindi, Italian, Japanese, Kîîtharaka, Mandarin, Spanish, Swedish and Turkish). The database contains information about ~50 clause-embedding predicates in each language. Each predicate is annotated with respect to ~15 semantic properties and ~12 combinatorial properties. It enables the evaluation of correlations between semantic and combinatorial properties of clause-embedding predicates in a cross-linguistic setting. The data collection methodology is published open access so other researchers can contribute to the database.

There are two novel aspects of this database. First, this is the first cross-linguistic database of its kind, and allows precise evaluation of the correlations between semantic properties and combinatorial properties of clause-embedding predicates common across languages. Second, the methodology of collecting semantic properties integrates techniques from psycholinguistics and fieldwork linguistics. Following practices in psycholinguistics, we use inference speakers draw from linguistic expressions as primitive data points. At the same time, following practices in fieldwork linguistics, we also record refined data using qualitative judgments derived from interactions between researchers and linguistic consultants.

The project currently evaluates theoretical insights from the collected data, and the full results will be reported at the time of the project conclusion. At this point, two primary results are available. 1) Across languages, so-called emotive factives (e.g., surprise, annoy, be happy etc. and their counterparts in non-English languages) are compatible with interrogative complements in general but not with interrogative complements involving "whether" and its equivalents in non-English languages. 2) Across languages, aside from several systematic exceptions, a significant majority of predicates that are compatible with both declarative and interrogative complements impose a systematic semantic relationship between sentences involving a declarative complement and an interrogative complement. In the rest of the project, we will pursue theoretical explanations of these and other potential generalisations contained in our dataset.

There are practical limitations in the current database. The data collection process was time-intensive. Each language required a total of 60 to 100 hours of work by a native speaker with a background in linguistics, typically over the course of 3 to 4 months with regular consultation sessions with one of the authors of the present paper. Because of this, the current database only features introspective judgments coming from a single speaker per language. While this is a good place to start, the database is not yet equipped to address issues pertaining to within and across speaker variability.
Exploitation Route The data available in the database will be beneficial for foreign-language education as they illustrate the exact difference between corresponding clause-embedding predicates across languages. They will also be beneficial for the education of logical-reasoning skills since they provide natural language examples for discussing logical reasoning in the classroom.

The data concerning the semantic properties of logical words gathered in the project will include a large set of entailment patterns licensed by a variety of clause-embedding predicates in a diverse set of languages. Such patterns are essential for the development of computational systems for natural language inference (NLI), which has applications in Question-Answering, Information Extraction, and Machine Translation. Data on inference patterns are specifically useful as benchmarks against which the quality of NLI systems are evaluated.
Sectors Digital/Communication/Information Technologies (including Software),Education

 
Title Clause-embedding predicates questionnaire 
Description This questionnaire has been designed to collect linguistic data on the semantic and selectional properties of clause-embedding predicates across languages. For each language, the intended end result is a collection of linguistic data (which, for brevity, we will refer to as a database) with two components: - A spreadsheet summarising the semantic and selectional properties of the relevant clause-embedding predicates in the language. In this spreadsheet, each row registers the semantic and selectional properties of a certain predicate: the first n columns register its relevant semantic properties, and the next m columns its selectional properties. - A text document which, for each predicate, provides detailed empirical arguments (including example sentences) for the properties assigned to it in the spreadsheet. Such arguments cannot be included in the spreadsheet itself because examples will have to be glossed and there will be accompanying text as well. Researchers will be able to use these two components of the database for different purposes. For extracting generalisations and performing statistical analysis they will use the spreadsheets. For concrete examples and linguistic arguments, they will turn to the text documents. 
Type Of Material Improvements to research infrastructure 
Year Produced 2022 
Provided To Others? No  
Impact Based on the questionnaire, preliminary data collection for five languages (Dutch, Greek, German, and French) has been completed. The datasets can be used as language resources to train natural language processing (NLP) systems. 
 
Description Partnership with University of Konstanz and University of Amsterdam 
Organisation University of Konstanz
Department Department of Linguistics
Country Germany 
Sector Academic/University 
PI Contribution The project is funded by the AHRC-DFG joint collaboration grant. As part of the project, my research team is collaborating with the team in Konstanz to (a) develop the methodology of data collection on clause-embedding predicates, (b) construct theories to account for cross-linguistic generalisations observed in the data, and (c) experimentally evaluate the theories in terms of behavioural experiments. The Edinburgh team will be responsible for the data collection of Dutch, Greek, French, Mandarin, Japanese, Hebrew, Kiitharaka, English, Polish and German. The rest of the partnership is fully collaborative and will not involve any predefined division of labor.
Collaborator Contribution The Konstanz team will be responsible for the data collection of Italian, Spanish, Swedish, Hungarian, Turkish and Akan. The rest of the partnership is fully collaborative and will not involve any predefined division of labor.
Impact - Clause-embedding predicates questionnaire - Clause-embedding predicates datasets
Start Year 2018