IntelliText - Intelligent Tools for Creating and Analysing Electronic Text Corpora for Humanities Research

Lead Research Organisation: University of Leeds
Department Name: School of Modern Languages and Cultures

Abstract

Much humanities research relies on or would benefit from analysis of electronic corpora - representative collections of texts (such as books, newspaper articles, technical manuals in computer-readable format), which may also be annotated with linguistic or domain information. The main advantage of using corpora over hand-picked examples is the ability to collect data systematically, to assess the centrality of certain features to the research material, and to establish experimentally potential trends in the data. Projects which rely on electronic corpora can be expected to have greater academic and social impact, thanks to increased consistency in data analysis.

However, the major difficulty faced by corpus-based studies in humanities research is that creating and annotating a new corpus and designing an appropriate search engine for textual analysis require complex technical support, e.g., expertise in programming, web development, etc. Such a level of technical expertise is often unavailable to smaller humanities projects; but even larger corpus-based projects often miss opportunities for data analysis because of inadequate methodological or technological support for relevant computational aspects. Even when a corpus already exists, the task of building appropriate computational tools for analysing, intelligently searching and visualising the data still remain too challenging for many potential humanities projects.

Humanities researchers' lack of awareness of modern computational techniques for corpus-based studies can seriously limit the scope and the impact of any planned research projects. Moreover, computer scientists who design corpus-based tools frequently do not understand the specific needs of humanities research; their tools are often difficult to adapt to a specific project, or lack an intuitive interface and documentation. As a result, the existence of several non-trivial computational techniques with the power to collect and prepare corpus material and reveal new dependencies and patterns in the data has been overlooked in the humanities. Thus important potential synergies for research have been neglected.

IntelliText's novel contribution will be to tune advanced tools and methods from computer science to the needs of humanities researchers, integrating them into a single software application with a simple interface and good documentation. This will allow humanities researchers with no specialised background in computer science or corpus linguistics to take advantage of powerful methods of text collection and analysis. It will enable them to collect new project corpora from the web, have them enriched automatically with linguistic and other annotations, and then easily uncover interesting patterns of usage, starting either from their own intuitions and hypotheses, or from expressions and patterns identified as potentially noteworthy by the system.

The software will be designed and tested in novel applications by researchers interested in the stylistic features of translated text, in language learning and contrastive linguistics, and in detecting and describing shifts in sentiment and opinion. These will demonstrate its generalisability for addressing the needs of a wide spectrum of humanities researchers, including historians and specialists in literature, media and corporate or government communications, all of whom are represented on the Project Board. IntelliText will be made freely available for research purposes as Open Source software, introducing these tools and methods into fresh areas and permitting further extensions by the user community after funding ends.

In short, the impact of IntelliText will be to strengthen the theoretical foundations of many humanities disciplines by enabling a much larger community of researchers than hitherto to make testable predictions, and then to verify themby reference to solid corpus evidence uncovered by advanced and automated analytical techniques.

Planned Impact

We expect that the IntelliText project will have a direct industrial impact in the area of linguistic engineering (speech and language technology): the deliverables will be useful for industrial companies which build software system for machine translation, speech recognition, text-to-speech, information extraction, and automatic question answering. Nowadays such technologies rely on large annotated corpora of electronic texts for development and evaluation of the software. However, the quality of such systems is limited by the quality of the corpora used. For such industrial users IntelliText project will deliver a platform for targeted fine-grained collection and annotation of corpora: e.g., the system will integrate the state-of-the-art tools that can harvest corpora in a specific subject domain or genre. These corpora can greatly enhance the quality of the linguistic engineering technologies designed for specialised domains, since the data then becomes cleaner and less ambiguous. In addition, the IntelliText system will contain a coherent set of tools that work together, which will save much time and effort for industrial users by automating most common workflows and tasks, like collecting, annotating and aligning parallel corpora, etc.

Specialists working in government communications (DH, DWP, HMRC, HSE) or in critical areas such as finacial services will be helped in detecting explanations and instructions that are difficult fror the citizen or consumer to understand and finding simpler rewordings. Linguistic analysis of tone of voice for corporate branding is another area where the IntelliText system is expected to make a commercial impact. The proposed technologies will improve branded communications in this area by automating key stages for collection, annotation and most importantly - analysis of the corpus-based linguistic data. The system will enable language analysts to view the data from multiple perspectives and to make more accurate conclusions on connotations and expected effect of the language used in corporate publicity materials.

The professional translation community will also benefit from using the proposed system. Nowadays the translation industry makes extensive use of computer-assisted technologies, such as translation memories, electronic terminological databases and dictionaries, post-editing of machine translation output. Underlying all these technologies is a corpus-based analysis of how linguistic expressions are used in the source and target languages. Professional translators nowadays use large monolingual and bilingual corpora to analyse this usage, but there is a growing need for specialised corpora (domain-specific or even user-specific), to focus the translator's search for terminology or linguistic patterns. By using IntelliText, translators will be able to collect and interrogate data for their own projects.

We will take a number of steps to ensure that the project reaches users beyond academe. First, the Project Board includes members active in knowledge transfer who will advise on channels for outreach. Second, we will actively approach the industrial community via the Centre's industrial partners from the UK, Europe and US in other research projects with a knowledge transfer component, such as Translation Automation User Society (TAUS), Linguatec Language Technologies and Tilde Language Technologies. Third, we will present the system at conferences with industrial participation (Translating and the Computer, ITI Conference). Finally, we will promote IntelliText on the enterprise and knowledge transfer section of the University of Leeds website.

Publications

10 25 50