ReproHum: Investigating Reproducibility of Human Evaluations in Natural Language Processing
Lead Research Organisation:
University of Aberdeen
Department Name: Computing Science
Abstract
Over the past few months, we have laid the groundwork for the ReproHum project (summarised in the 'pre-project' column in the Work Plan document) with (i) a study of 20 years of human evaluation in NLG which reviewed and labelled 171 papers in detail, (ii) the development of a classification system for NLP evaluations, (iii) a proposal for a shared task for reproducibility of human evaluation in NLG, and (iv) a proposal for a workshop on human evaluation in NLP. We have built an international network of 20 research teams currently working on human evaluation who will actively contribute to this project (see Track Record section), making combined contributions in kind of over £80,000. This pre-project activity has created an advantageous starting position for the proposed work, and means we can 'hit the ground running' with the scientifically interesting core of the work.
In this foundational project, our key goals are the development of a methodological framework for testing the reproducibility of human evaluations in NLP, and of a multi-lab paradigm for carrying out such tests in practice, carrying out the first study of this kind in NLP. We will (i) systematically diagnose the extent of the human evaluation reproducibility problem in NLP and survey related current work to address it (WP1); (ii) develop the theoretical and methodological underpinnings for reproducibility testing in NLP (WP2); (iii) test the suitability of the shared-task paradigm (uniformly popular across NLP fields) for reproducibility testing (WP3); (iv) create a design for multi-test reproducibility studies, and run the ReproHum study, an international large-scale multi-lab effort conducting 50+ individual, coordinated reproduction attempts on human evaluations in NLP from the past 10 years (WP4); and (v) nurture and build international consensus regarding how to address the reproducibility crisis, via technical meetings and growing our international network of researchers (WP5).
In this foundational project, our key goals are the development of a methodological framework for testing the reproducibility of human evaluations in NLP, and of a multi-lab paradigm for carrying out such tests in practice, carrying out the first study of this kind in NLP. We will (i) systematically diagnose the extent of the human evaluation reproducibility problem in NLP and survey related current work to address it (WP1); (ii) develop the theoretical and methodological underpinnings for reproducibility testing in NLP (WP2); (iii) test the suitability of the shared-task paradigm (uniformly popular across NLP fields) for reproducibility testing (WP3); (iv) create a design for multi-test reproducibility studies, and run the ReproHum study, an international large-scale multi-lab effort conducting 50+ individual, coordinated reproduction attempts on human evaluations in NLP from the past 10 years (WP4); and (v) nurture and build international consensus regarding how to address the reproducibility crisis, via technical meetings and growing our international network of researchers (WP5).
Organisations
- University of Aberdeen (Lead Research Organisation)
- Zurich University of Applied Sciences (Collaboration)
- University of Groningen (Collaboration)
- Edinburgh Napier University (Collaboration)
- University of North Carolina at Charlotte (Collaboration, Project Partner)
- Technological University Dublin (Collaboration, Project Partner)
- Darmstadt University of Applied Sciences (Collaboration)
- Pompeu Fabra University (Collaboration, Project Partner)
- Heriot-Watt University (Collaboration)
- Bielefeld University (Collaboration)
- Utrecht University (Collaboration)
- University of Santiago de Compostela (Collaboration, Project Partner)
- Georgia Institute of Technology (Collaboration)
- Charles University (Collaboration, Project Partner)
- McGill University (Collaboration)
- University of Chicago (Collaboration)
- Bocconi University (Collaboration)
- Peking University (Collaboration, Project Partner)
- University of Tilburg (Collaboration)
- Google (Collaboration)
- Trivago NV (Collaboration)
- Tilburg University (Project Partner)
- Utrecht University (Project Partner)
- University of Malta (Project Partner)
- University of Michigan–Ann Arbor (Project Partner)
- TU Darmstadt (Project Partner)
- University of Manchester (Project Partner)
- University of Groningen (Project Partner)
- Heidelberg University (Project Partner)
- Trinity College Dublin (Project Partner)
- Trivago N.V. (Project Partner)
- Heriot-Watt University (Project Partner)
- McGill University (Project Partner)
- VU Amsterdam (Project Partner)
- Edinburgh Napier University (Project Partner)
People |
ORCID iD |
Anya Belz (Principal Investigator) | |
Ehud Reiter (Co-Investigator) |
Publications


Belz A
(2022)
A Metrological Perspective on Reproducibility in NLP*
in Computational Linguistics



Thomson C
(2024)
Common Flaws in Running Human Evaluation Experiments in NLP
in Computational Linguistics
Description | The ReproHum project set out to address the following challenges in the field of NLP: 1. We do not currently have any other way of testing the validity of human evaluations; 2. Human evaluations play a central role in NLP, and are widely assumed to yield 'true' estimates of quality [10]: new automatic evaluation metrics are routinely meta-evaluated through correlation with human evaluation results, yet the latter are not themselves subject to validation in any form; 3. We do not currently have a solid basis for meta-evaluation or reproducibility testing: the most fundamental prerequisite for these is to determine whether two evaluations are comparable in the sense that they assess the same aspect of quality; in fact we have found plenty of evidence that evaluations that use the same term (e.g. Fluency) often do not in fact measure the same thing [1, 3]; 4. The validity of large swathes of conclusions in NLP, because drawn on assumptions of comparability, is unknown: Given (b) and (c) above, any conclusion that is based on an assumption that a set of evaluations are comparable, without providing evidence for this assumption, is subject to doubt. The ReproHum project was designed to address the above challenges head on. The project's targeted outcomes and their state of completion are as follows: A. A formal system for assessing the comparability of any two human evaluations: Quantitative Reproducibility Assessment, or QRA, and its extension QRA++; STATUS: complete; B. Experimental design and reporting templates for enhancing the reproducibility of individual evaluation experiments: HEDS datasheet and ReproNLP extension; STATUS: complete; C. A theoretical and methodological framework for carrying out (i) individual reproducibility assessments, and (ii) multi-test studies, of human evaluation results in NLP, both validated through a shared task on reproducibility of human evaluations and a large-scale multi-lab study of reproducibility of human evaluations: Common Approach to Reproduction in NLP, Design of ReproHum Multi-lab Multi-test (MLMT) Study, ReproNLP Shared Task 2021-2024; STATUS: MLMT Study is ongoing, otherwise complete. Apart from completing and reporting the MLMT Study, we have thus achieved the ambitious overall goal of the ReproHum projec: to find workable solutions that comprehensively address the reproducibility crisis in human evaluation in NLP. |
Exploitation Route | We have created a number of reusable resources for reproducibility (The ReproHum Reproducibility Toolbox): HEDS Human Evaluation Datasheet Common Approach to Reproduction Quantitative Reproducibility Assessment Repository of annotated papers with full details about experiments carried out and completed HEDS sheets Benchmark datasets/tasks |
Sectors | Digital/Communication/Information Technologies (including Software) |
URL | https://humeval.github.io/,https://repronlp.github.io,https://reprohum.github.io |
Description | Collaboration on ReproHum MLMT Study |
Organisation | Bielefeld University |
Country | Germany |
Sector | Academic/University |
PI Contribution | The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP. |
Collaborator Contribution | Each partner lab carry out between one and three individual reproduction experiments. |
Impact | Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop. |
Start Year | 2022 |
Description | Collaboration on ReproHum MLMT Study |
Organisation | Bocconi University |
Country | Italy |
Sector | Academic/University |
PI Contribution | The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP. |
Collaborator Contribution | Each partner lab carry out between one and three individual reproduction experiments. |
Impact | Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop. |
Start Year | 2022 |
Description | Collaboration on ReproHum MLMT Study |
Organisation | Charles University |
Country | Czech Republic |
Sector | Academic/University |
PI Contribution | The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP. |
Collaborator Contribution | Each partner lab carry out between one and three individual reproduction experiments. |
Impact | Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop. |
Start Year | 2022 |
Description | Collaboration on ReproHum MLMT Study |
Organisation | Darmstadt University of Applied Sciences |
Country | Germany |
Sector | Academic/University |
PI Contribution | The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP. |
Collaborator Contribution | Each partner lab carry out between one and three individual reproduction experiments. |
Impact | Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop. |
Start Year | 2022 |
Description | Collaboration on ReproHum MLMT Study |
Organisation | Edinburgh Napier University |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP. |
Collaborator Contribution | Each partner lab carry out between one and three individual reproduction experiments. |
Impact | Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop. |
Start Year | 2022 |
Description | Collaboration on ReproHum MLMT Study |
Organisation | Georgia Institute of Technology |
Country | United States |
Sector | Academic/University |
PI Contribution | The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP. |
Collaborator Contribution | Each partner lab carry out between one and three individual reproduction experiments. |
Impact | Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop. |
Start Year | 2022 |
Description | Collaboration on ReproHum MLMT Study |
Organisation | |
Country | United States |
Sector | Private |
PI Contribution | The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP. |
Collaborator Contribution | Each partner lab carry out between one and three individual reproduction experiments. |
Impact | Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop. |
Start Year | 2022 |
Description | Collaboration on ReproHum MLMT Study |
Organisation | Heriot-Watt University |
Department | School of Mathematical and Computer Sciences |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP. |
Collaborator Contribution | Each partner lab carry out between one and three individual reproduction experiments. |
Impact | Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop. |
Start Year | 2022 |
Description | Collaboration on ReproHum MLMT Study |
Organisation | McGill University |
Country | Canada |
Sector | Academic/University |
PI Contribution | The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP. |
Collaborator Contribution | Each partner lab carry out between one and three individual reproduction experiments. |
Impact | Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop. |
Start Year | 2022 |
Description | Collaboration on ReproHum MLMT Study |
Organisation | Peking University |
Country | China |
Sector | Academic/University |
PI Contribution | The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP. |
Collaborator Contribution | Each partner lab carry out between one and three individual reproduction experiments. |
Impact | Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop. |
Start Year | 2022 |
Description | Collaboration on ReproHum MLMT Study |
Organisation | Pompeu Fabra University |
Department | Department of Information and Communication Technologies |
Country | Spain |
Sector | Academic/University |
PI Contribution | The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP. |
Collaborator Contribution | Each partner lab carry out between one and three individual reproduction experiments. |
Impact | Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop. |
Start Year | 2022 |
Description | Collaboration on ReproHum MLMT Study |
Organisation | Technological University Dublin |
Country | Ireland |
Sector | Academic/University |
PI Contribution | The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP. |
Collaborator Contribution | Each partner lab carry out between one and three individual reproduction experiments. |
Impact | Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop. |
Start Year | 2022 |
Description | Collaboration on ReproHum MLMT Study |
Organisation | Trivago NV |
Country | Germany |
Sector | Private |
PI Contribution | The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP. |
Collaborator Contribution | Each partner lab carry out between one and three individual reproduction experiments. |
Impact | Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop. |
Start Year | 2022 |
Description | Collaboration on ReproHum MLMT Study |
Organisation | University of Chicago |
Country | United States |
Sector | Academic/University |
PI Contribution | The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP. |
Collaborator Contribution | Each partner lab carry out between one and three individual reproduction experiments. |
Impact | Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop. |
Start Year | 2022 |
Description | Collaboration on ReproHum MLMT Study |
Organisation | University of Groningen |
Country | Netherlands |
Sector | Academic/University |
PI Contribution | The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP. |
Collaborator Contribution | Each partner lab carry out between one and three individual reproduction experiments. |
Impact | Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop. |
Start Year | 2022 |
Description | Collaboration on ReproHum MLMT Study |
Organisation | University of North Carolina at Charlotte |
Country | United States |
Sector | Academic/University |
PI Contribution | The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP. |
Collaborator Contribution | Each partner lab carry out between one and three individual reproduction experiments. |
Impact | Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop. |
Start Year | 2022 |
Description | Collaboration on ReproHum MLMT Study |
Organisation | University of Santiago de Compostela |
Country | Spain |
Sector | Academic/University |
PI Contribution | The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP. |
Collaborator Contribution | Each partner lab carry out between one and three individual reproduction experiments. |
Impact | Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop. |
Start Year | 2022 |
Description | Collaboration on ReproHum MLMT Study |
Organisation | University of Tilburg |
Country | Netherlands |
Sector | Academic/University |
PI Contribution | The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP. |
Collaborator Contribution | Each partner lab carry out between one and three individual reproduction experiments. |
Impact | Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop. |
Start Year | 2022 |
Description | Collaboration on ReproHum MLMT Study |
Organisation | Utrecht University |
Country | Netherlands |
Sector | Academic/University |
PI Contribution | The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP. |
Collaborator Contribution | Each partner lab carry out between one and three individual reproduction experiments. |
Impact | Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop. |
Start Year | 2022 |
Description | Collaboration on ReproHum MLMT Study |
Organisation | Zurich University of Applied Sciences |
Country | Switzerland |
Sector | Academic/University |
PI Contribution | The ReproHum project team coordinate the work of 20 partner labs on this multi-lab multi-test study of factors affecting reproducibility of human evaluations in NLP. |
Collaborator Contribution | Each partner lab carry out between one and three individual reproduction experiments. |
Impact | Internal progress reports on MLMT study. Lab reports and overall results report to be published at HumEval 2023 Workshop. |
Start Year | 2022 |