ReproHum: Investigating Reproducibility of Human Evaluations in Natural Language Processing

Lead Research Organisation: University of Aberdeen

Department Name: Computing Science

Abstract

Over the past few months, we have laid the groundwork for the ReproHum project (summarised in the 'pre-project' column in the Work Plan document) with (i) a study of 20 years of human evaluation in NLG which reviewed and labelled 171 papers in detail, (ii) the development of a classification system for NLP evaluations, (iii) a proposal for a shared task for reproducibility of human evaluation in NLG, and (iv) a proposal for a workshop on human evaluation in NLP. We have built an international network of 20 research teams currently working on human evaluation who will actively contribute to this project (see Track Record section), making combined contributions in kind of over £80,000. This pre-project activity has created an advantageous starting position for the proposed work, and means we can 'hit the ground running' with the scientifically interesting core of the work.

In this foundational project, our key goals are the development of a methodological framework for testing the reproducibility of human evaluations in NLP, and of a multi-lab paradigm for carrying out such tests in practice, carrying out the first study of this kind in NLP. We will (i) systematically diagnose the extent of the human evaluation reproducibility problem in NLP and survey related current work to address it (WP1); (ii) develop the theoretical and methodological underpinnings for reproducibility testing in NLP (WP2); (iii) test the suitability of the shared-task paradigm (uniformly popular across NLP fields) for reproducibility testing (WP3); (iv) create a design for multi-test reproducibility studies, and run the ReproHum study, an international large-scale multi-lab effort conducting 50+ individual, coordinated reproduction attempts on human evaluations in NLP from the past 10 years (WP4); and (v) nurture and build international consensus regarding how to address the reproducibility crisis, via technical meetings and growing our international network of researchers (WP5).

Funded Value:

£227,201

Funded Period:

Apr 22 - Feb 24

Funder:

EPSRC

Project Status:

Active

Project Category:

Research Grant

Project Reference:

EP/V05645X/1

Principal Investigator:

Anya Belz

Research Subject:

Info. & commun. Technol. (60%)

Linguistics (40%)

Research Topic:

Artificial Intelligence (60%)

Computational Linguistics (40%)

Organisations

People	ORCID iD
Anya Belz (Principal Investigator)
Ehud Reiter (Co-Investigator)

Publications

Author Name Title

Publication Date Published

10 25 50

Belz A (2022) A Metrological Perspective on Reproducibility in NLP* in Computational Linguistics

A. Belz (2022) The 2022 ReproGen Shared Task on Reproducibility of Evaluations in NLG: Overview and Results

A. Belz (2022) The 2nd Workshop on Human Evaluation of NLP Systems (HumEval 2022)

Collaboration


Description	Collaboration on ReproHum MLMT Study
Organisation	Bielefeld University
Country	Germany
Sector	Academic/University
PI Contribution	The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP.
Collaborator Contribution	Each partner lab carry out between one and three individual reproduction experiments.
Impact	Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop.
Start Year	2022


Description	Collaboration on ReproHum MLMT Study
Organisation	Bocconi University
Country	Italy
Sector	Academic/University
PI Contribution	The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP.
Collaborator Contribution	Each partner lab carry out between one and three individual reproduction experiments.
Impact	Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop.
Start Year	2022


Description	Collaboration on ReproHum MLMT Study
Organisation	Charles University
Country	Czech Republic
Sector	Academic/University
PI Contribution	The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP.
Collaborator Contribution	Each partner lab carry out between one and three individual reproduction experiments.
Impact	Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop.
Start Year	2022


Description	Collaboration on ReproHum MLMT Study
Organisation	Darmstadt University of Applied Sciences
Country	Germany
Sector	Academic/University
PI Contribution	The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP.
Collaborator Contribution	Each partner lab carry out between one and three individual reproduction experiments.
Impact	Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop.
Start Year	2022


Description	Collaboration on ReproHum MLMT Study
Organisation	Edinburgh Napier University
Country	United Kingdom
Sector	Academic/University
PI Contribution	The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP.
Collaborator Contribution	Each partner lab carry out between one and three individual reproduction experiments.
Impact	Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop.
Start Year	2022


Description	Collaboration on ReproHum MLMT Study
Organisation	Georgia Institute of Technology
Country	United States
Sector	Academic/University
PI Contribution	The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP.
Collaborator Contribution	Each partner lab carry out between one and three individual reproduction experiments.
Impact	Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop.
Start Year	2022


Description	Collaboration on ReproHum MLMT Study
Organisation	Google
Country	United States
Sector	Private
PI Contribution	The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP.
Collaborator Contribution	Each partner lab carry out between one and three individual reproduction experiments.
Impact	Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop.
Start Year	2022


Description	Collaboration on ReproHum MLMT Study
Organisation	Heriot-Watt University
Department	School of Mathematical and Computer Sciences
Country	United Kingdom
Sector	Academic/University
PI Contribution	The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP.
Collaborator Contribution	Each partner lab carry out between one and three individual reproduction experiments.
Impact	Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop.
Start Year	2022


Description	Collaboration on ReproHum MLMT Study
Organisation	McGill University
Country	Canada
Sector	Academic/University
PI Contribution	The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP.
Collaborator Contribution	Each partner lab carry out between one and three individual reproduction experiments.
Impact	Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop.
Start Year	2022


Description	Collaboration on ReproHum MLMT Study
Organisation	Peking University
Country	China
Sector	Academic/University
PI Contribution	The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP.
Collaborator Contribution	Each partner lab carry out between one and three individual reproduction experiments.
Impact	Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop.
Start Year	2022


Description	Collaboration on ReproHum MLMT Study
Organisation	Pompeu Fabra University
Department	Department of Information and Communication Technologies
Country	Spain
Sector	Academic/University
PI Contribution	The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP.
Collaborator Contribution	Each partner lab carry out between one and three individual reproduction experiments.
Impact	Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop.
Start Year	2022


Description	Collaboration on ReproHum MLMT Study
Organisation	Technological University Dublin
Country	Ireland
Sector	Academic/University
PI Contribution	The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP.
Collaborator Contribution	Each partner lab carry out between one and three individual reproduction experiments.
Impact	Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop.
Start Year	2022


Description	Collaboration on ReproHum MLMT Study
Organisation	Trivago NV
Country	Germany
Sector	Private
PI Contribution	The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP.
Collaborator Contribution	Each partner lab carry out between one and three individual reproduction experiments.
Impact	Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop.
Start Year	2022


Description	Collaboration on ReproHum MLMT Study
Organisation	University of Chicago
Country	United States
Sector	Academic/University
PI Contribution	The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP.
Collaborator Contribution	Each partner lab carry out between one and three individual reproduction experiments.
Impact	Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop.
Start Year	2022


Description	Collaboration on ReproHum MLMT Study
Organisation	University of Groningen
Country	Netherlands
Sector	Academic/University
PI Contribution	The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP.
Collaborator Contribution	Each partner lab carry out between one and three individual reproduction experiments.
Impact	Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop.
Start Year	2022


Description	Collaboration on ReproHum MLMT Study
Organisation	University of North Carolina at Charlotte
Country	United States
Sector	Academic/University
PI Contribution	The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP.
Collaborator Contribution	Each partner lab carry out between one and three individual reproduction experiments.
Impact	Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop.
Start Year	2022


Description	Collaboration on ReproHum MLMT Study
Organisation	University of Santiago de Compostela
Country	Spain
Sector	Academic/University
PI Contribution	The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP.
Collaborator Contribution	Each partner lab carry out between one and three individual reproduction experiments.
Impact	Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop.
Start Year	2022


Description	Collaboration on ReproHum MLMT Study
Organisation	University of Tilburg
Country	Netherlands
Sector	Academic/University
PI Contribution	The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP.
Collaborator Contribution	Each partner lab carry out between one and three individual reproduction experiments.
Impact	Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop.
Start Year	2022


Description	Collaboration on ReproHum MLMT Study
Organisation	Utrecht University
Country	Netherlands
Sector	Academic/University
PI Contribution	The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP.
Collaborator Contribution	Each partner lab carry out between one and three individual reproduction experiments.
Impact	Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop.
Start Year	2022


Description	Collaboration on ReproHum MLMT Study
Organisation	Zurich University of Applied Sciences
Country	Switzerland
Sector	Academic/University
PI Contribution	The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP.
Collaborator Contribution	Each partner lab carry out between one and three individual reproduction experiments.
Impact	Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop.
Start Year	2022

Abstract

Organisations

People

ORCID iD

Publications