Understanding the annotation process: annotation for Big data
Lead Research Organisation:
University of Sheffield
Department Name: Information School
Abstract
Data is being collected and created at the fastest rate in human history; by the far the vast majority of this is in digital format. Allied with this, what was previously "offline" information can now be digitised quickly and cheaply e.g. old manuscripts, maps etc. This vast collection of existing and new information creates new opportunities and also difficulties. For a lot of this information to be useful it must be categorised and annotated in some way, so that sense can be made of the data and also so that the correct data can be accessed more easily. It is possible to complete this categorisation by hand with human annotators, but this effort can be expensive in terms of time, money and resources. This is especially true for large data sets or for data sets that require niche expertise to annotate. With this expense in mind, many have turned to machine learning to annotate data; however machine learning approaches still require human intervention to both create training sets for algorithms and judge the output of algorithms. Thus it is inevitable that human intervention is involved at some stage of the categorisation and annotation process. In this project we aim to gain a better understanding of this annotation process so that we can provide guidelines, approaches and processes for providing the most cost effective and accurate annotations for data sets.
We propose to work with the three main types of unstructured data faced in big data: text, image, and video. The first challenge is to better understand the process assessors go through when annotating and judging different types of material. This will be carried out using a mixture of qualitative and quantitative techniques, using smaller scale lab-based studies. By better understanding the process by which individuals annotate and classify material, we hope to provide insights which can be used to make the annotation process more efficient, and identify a set of initial factors which affect annotation performance, such as degree of domain expertise and time. Based on this initial work, the aim is to then investigate which of these factors most affect assessment, using large scale crowdsourcing style methods. The final challenge is related to the classification task: how should annotation be approached, to give the best results when used in machine learning? Based on this, the project aims to create a set of guidelines for the creation of annotation and relevance sets.
We propose to work with the three main types of unstructured data faced in big data: text, image, and video. The first challenge is to better understand the process assessors go through when annotating and judging different types of material. This will be carried out using a mixture of qualitative and quantitative techniques, using smaller scale lab-based studies. By better understanding the process by which individuals annotate and classify material, we hope to provide insights which can be used to make the annotation process more efficient, and identify a set of initial factors which affect annotation performance, such as degree of domain expertise and time. Based on this initial work, the aim is to then investigate which of these factors most affect assessment, using large scale crowdsourcing style methods. The final challenge is related to the classification task: how should annotation be approached, to give the best results when used in machine learning? Based on this, the project aims to create a set of guidelines for the creation of annotation and relevance sets.
Planned Impact
We expect the project to have impact both within the research community, and outside. We have identified the following groups which will be the primary beneficiaries for our work:
- The research community - the project will generate a annotation sets which can be used in a range of future research, including the evaluation of novel machine learning systems (e.g. for classification) and information retrieval systems. Outside these technical areas, there is also potential impact in the general social science and information science area, by providing guidelines which can be used for the creation of annotation sets, uniquely tailored to the new automatic learning systems designed for big data.
- Digital librarians and archivists- the availability of best practises for annotating digital archives will make it easier for librarians and archivists to prepare for the automatic "big data" technologies. Automatic annotation is still in it's infancy, rarely applied in end-user focused repositories, either in industry or in academia. Guidelines which both maximise the automatic techniques, and demonstrate what the techniques are capable of, and perhaps even more importantly, what automatic techniques are not capable of, is expected to be valuable for information professionals. Ultimately, automated systems will aid individuals to maintain and share digital archives, which will be especially pertinent for specialised archives where there is a lack of resources for manual classification e.g. niche historical archives.
- Industry information professionals - similar to the situation above for digital librarians and archivists, many information professionals are currently under great cost and time pressures, e.g. health information librarians, or legal research. In such very specialised domains, access to experts may be very expensive e.g. in the legal domain, paying experienced layers to judge the relevance of material to cases can be very costly. Ultimately the annotation design guidelines which would be created in this project would allow organisations to focus sparse resources in particular areas, in order to maximise the performance of automated classification techniques, and minimise the use of experts (e.g. using experts only for marginally relevant or different to categorise material).
- Educators - similar to the case of research community, the project will generate data sets which can also be used in education. We would also hope that such open data sets would encourage the use of automatic techniques outside the research community, since the resources will be more easily available to students and teachers alike.
- General public -It is anticipated that through this work that digital archives can be annotated more quickly and efficiently. In addition it is hoped that physical resources and archives can be digitised and annotated more quickly and efficiently. Thus, ultimately, more of these resources will become available to the UK population. Access to these resources will be beneficial in terms of learning and awareness, as well as providing access to information that otherwise would not be available to the public or indeed may even have been lost.
- The research community - the project will generate a annotation sets which can be used in a range of future research, including the evaluation of novel machine learning systems (e.g. for classification) and information retrieval systems. Outside these technical areas, there is also potential impact in the general social science and information science area, by providing guidelines which can be used for the creation of annotation sets, uniquely tailored to the new automatic learning systems designed for big data.
- Digital librarians and archivists- the availability of best practises for annotating digital archives will make it easier for librarians and archivists to prepare for the automatic "big data" technologies. Automatic annotation is still in it's infancy, rarely applied in end-user focused repositories, either in industry or in academia. Guidelines which both maximise the automatic techniques, and demonstrate what the techniques are capable of, and perhaps even more importantly, what automatic techniques are not capable of, is expected to be valuable for information professionals. Ultimately, automated systems will aid individuals to maintain and share digital archives, which will be especially pertinent for specialised archives where there is a lack of resources for manual classification e.g. niche historical archives.
- Industry information professionals - similar to the situation above for digital librarians and archivists, many information professionals are currently under great cost and time pressures, e.g. health information librarians, or legal research. In such very specialised domains, access to experts may be very expensive e.g. in the legal domain, paying experienced layers to judge the relevance of material to cases can be very costly. Ultimately the annotation design guidelines which would be created in this project would allow organisations to focus sparse resources in particular areas, in order to maximise the performance of automated classification techniques, and minimise the use of experts (e.g. using experts only for marginally relevant or different to categorise material).
- Educators - similar to the case of research community, the project will generate data sets which can also be used in education. We would also hope that such open data sets would encourage the use of automatic techniques outside the research community, since the resources will be more easily available to students and teachers alike.
- General public -It is anticipated that through this work that digital archives can be annotated more quickly and efficiently. In addition it is hoped that physical resources and archives can be digitised and annotated more quickly and efficiently. Thus, ultimately, more of these resources will become available to the UK population. Access to these resources will be beneficial in terms of learning and awareness, as well as providing access to information that otherwise would not be available to the public or indeed may even have been lost.
Publications
Halvey M
(2015)
SIGIR 2014 Workshop on Gathering Efficient Assessments of Relevance (GEAR)
in ACM SIGIR Forum
Hasler L
(2015)
Augmented Test Collections: A Step in the Right Direction
| Description | Making sense of data is a core part of the "big data" area, and one important tool in big data is the use of training sets which are generally created manually by assessors. This project has investigated the behaviour of assessors when engaged on typical judging tasks. Fundamental to many judgment tasks is the concept of relevance, e.g. in web search the degree of relevance of a document to some search task as judged by a user or assessor. In this project a number of studies were carried out to investigate in detail how assessors judge the relevance of different types of material. The assessments themselves occurred in different situations, in the Information School's usability lab where the behaviour of participants could be recorded in detail both using qualitative and qualitative techniques, and also using crowd sourcing where participants are remote and unknown, but where data can be gathered from a much larger group of participants. One study focused on the notion of relevance in search. Data collection in this case started by generating "real world" search tasks through the use of a survey. This allowed a focus on the differences between primary and secondary assessors, i.e. the differences between an assessor who created an original search task vs. tasks created by others. Initial findings suggest that while secondary assessors may find the assessment task challenging in various ways, agreement between primary and secondary assessors is high, suggesting that secondary assessors may typically be "good enough" in many situations. A second example of the work that has been carried out in the project was the use of a video collection to create a graded relevance collection which includes behavioural and perceptual information about the judgment process. This data can be used to augment a training set or test collection (used to judge how well an automatic system performs). The creation of "augmented test collections" was one of the themes discussed at some length at the workshop organised by the project at the ACM SIGIR conference in 2014. Other work has also looked at how assessors judge images, this work also having being published at the SIGIR conference. One of the themes of the work that has arisen is the trade offs inherent in different collection methods: small scale but detailed in lab studies, or gathering less detail but at a larger scale using crowdsourcing. It is likely that both will be required in different situations. While we might directly ask participants to make judgements results have suggested that such judgements are not always robust. An example of this from this project has been how different people interpret their 'confidence' in a relevance judgement. Such differences when studying individuals suggests that there are dangers to always working at a large scale and taking participants judgements as correct. |
| Exploitation Route | A large quantity of data has been collected by the project and it is hoped that students at Stathclyde and Sheffield will be able to further work on preparing this data for distribution, and also using the data to investigate machine learning techniques as envisaged by the project. With the benefit of hindsight the original project aims have proved to be overambitious given the time available, but it is hoped that the data collected will provide a rich source for further investigations by the research community and in time more fully address all of the project objectives. |
| Sectors | Culture Heritage Museums and Collections |
| URL | https://pure.strath.ac.uk/portal/en/projects/understanding-the-annotation-process-annotation-for-big-data(9353268d-0c6a-4c9e-a492-cc60818474fc).html |
| Title | A Comparison of Primary and Secondary Relevance Judgements for Real-Life Topics |
| Description | This dataset is from a user study that examines in detail the differences between primary and secondary assessors on a set of "real-world" topics which were gathered specifically for this study. By gathering topics which are representative of the staff and students at a major university, at a particular point in time, we aimed to explore differences between primary and secondary relevance judgements for real-life search tasks. Findings from our study suggest that while secondary assessors may find the assessment task challenging in various ways (they generally possess less interest and knowledge in secondary topics and take longer to assess documents), agreement between primary and secondary assessors is high. |
| Type Of Material | Database/Collection of data |
| Year Produced | 2016 |
| Provided To Others? | Yes |
| Impact | Word which developed and used this data set will be presented at the ACM SIGIR Conference on Human Information Interaction and Retrieval 2016. |
| URL | https://pure.strath.ac.uk/portal/en/datasets/a-comparison-of-primary-and-secondary-relevance-judgeme... |
| Title | Evaluating the effort involved in relevance assessments for images |
| Description | This data set contains data from a user evaluation on image relevance assessments conducted by Halvey and Villa and published at SIGIR 2014. This evaluation was part of the UK Arts and Humanities Research Council funded project "Understanding the annotation process: annotation for big data" (grant AH/L010364/1). Full details of the evaluation and the data area available in the attached file data_description.txt. All files attached are text, with files containing data being tab delimited. |
| Type Of Material | Database/Collection of data |
| Year Produced | 2015 |
| Provided To Others? | Yes |
| Impact | This data set was used in work published in: Halvey, M., & Villa, R. (2014). Evaluating the effort involved in relevance assessments for images. In Proceedings of the 37th international ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR '14). (pp. 887-890). New York. 10.1145/2600428.2609466 |
| URL | https://pure.strath.ac.uk/portal/en/datasets/evaluating-the-effort-involved-in-relevance-assessments... |
| Title | Video Test Collection with Graded Relevance Assessments |
| Description | This dataset contains a video test collection with variable levels of relevance (i.e. graded relevance assessments). To the best of our knowledge the first example of such a test collection. We also gathered behavioural and perceptual data from assessors during the assessment process, which is also novel for a test collection. |
| Type Of Material | Database/Collection of data |
| Year Produced | 2015 |
| Provided To Others? | Yes |
| Impact | A short paper describing this dataset was published: Qiying, W, Halvey, M & Villa, R 2016, 'Video test collection with graded relevance assessments'. in ACM SIGIR Conference on Human Information Interaction and Retrieval. ACM SIGIR Conference on Human Information Interaction and Retrieval, Chapel Hill, United States, 13-17 March., 10.1145/2854946.2854980 |
| URL | https://pure.strath.ac.uk/portal/en/datasets/video-test-collection-with-graded-relevance-assessments... |
