Efficient Techniques for learning high-quality Multilingual Sense Representations

Lead Research Organisation: University of Cambridge
Department Name: Linguistics

Abstract

Learning accurate multilingual word representations is of crucial importance for a wide range of NLP applications, including Word Sense Disambiguation, cross-lingual information retrieval [3], plagiarism detection, document classification, and knowledge transfer from resource-rich to resource-poor languages [10]. Following this key trend, my PhD work will focus on the development of computationally efficient ways to learn high-quality multilingual embeddings, so that semantically similar words in the same language (monolingual objectives), as well as in different languages (cross-lingual objective), lie nearby in a shared multilingual semantic space [6]. Most state-of-the art multilingual embeddings techniques rely on seed lexica in order to learn a mapping function between two separately trained monolingual semantic spaces [11]. Recently proposed methods jointly optimize the two objectives mentioned above using both monolingual and parallel corpora [6], or exploit document-aligned corpora as bilingual signal to induce a unified multilingual space [10]. However, the majority of these approaches model a purely lexical semantic space where each word type is represented as a single vector, thus conflating all possible meanings of a term and ignoring polysemy and homonymy. Following the promising path recently proposed in [9], one key objective of my PhD will consist in investigating the employment of large multilingual semantic networks to de-conflate multilingual embeddings into their respective sense representations. By applying this methodology to a multilingual semantic space, vectors referring to similar meanings, irrespective of the language, would lie closely. Note that in this scenario we are dealing with a multilingual semantic space shared by word as well as sense embeddings. Extending the example proposed in [9] for the English word digit and its two numerical and anatomical meanings to the case of a bilingual English-Italian semantic space, the representation for the numerical sense of digit would lie near to the numerical sense of the corresponding Italian word cifra, while the vector representing the anatomical meaning of digit would be placed near to the word embedding of dito.

Selected References

1. Al-Rfou, R., et al. 2013. Polyglot: Distributed word representations for multilingual NLP. arXiv preprint arXiv:1307.1662
2. Camacho-Collados, J., Pilehvar, M. T., and Navigli, R. 2015. A framework for the construction of monolingual and cross-lingual word similarity datasets. ACL-IJCNLP
3. Camacho-Collados, J., Pilehvar, M. T., and Navigli, R. 2016a. Semantic Representations of Word Senses and Concepts. arXiv preprint arXiv:1608.00841
4. Camacho-Collados, J., Pilehvar, M. T., and Navigli, R. 2016b. Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities. Artificial Intelligence, 240
5. Choi, E. et al. 2016. Multi-layer Representation Learning for Medical Concepts. arXiv preprint arXiv:1602.05568
6. Gouws, S., Bengio, Y., and Corrado, G. 2015. BilBOWA: Fast Bilingual Distributed Representations without Word Alignments, Proc. of ICML
7. Kors, J. A., et al. 2015. A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC. Journal of the American Medical Informatics Association, ocv037.
8. Navigli, R., Ponzetto, S. P. 2012. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193
9. Pilehvar, M. T., Collier, N. 2016. De-conflated semantic representations. arXiv preprint arXiv:1608.01961
10. Vulic, I., and Moens, M. F. 2016. Bilingual Distributed Word Representations from Document-Aligned Comparable Data. Journal of Artificial Intelligence Research, 55, 953-994.
11. Vulic, I., Korhonen, A. 2016. On the role of seed lexicons in learning bilingual word embeddings. ACL

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
NE/M009009/1 05/10/2015 31/12/2022
1945246 Studentship NE/M009009/1 01/10/2017 30/09/2020 Costanza Conforti
 
Description Collaboration with the Joint Research Centre (JRC) in Ispra 
Organisation European Commission
Department Joint Research Centre (JRC)
Country European Union (EU) 
Sector Public 
PI Contribution We are planning a Deep Learning model for multi-task multi-language stance detection and emotion detection on the topic of vaccination perception.
Collaborator Contribution JRC in Ispra is providing us with data on the topic of vaccination by means of their Enterprise API.
Impact The collaboration is multi-disciplinary. The disciplines involved are Artificial Intelligence (us as University of Cambridge) and policymaking (JRC).
Start Year 2019