BetterCrowd: Human Computation for Big Data

Lead Research Organisation: University of Sheffield
Department Name: Information School

Abstract

In the last few years we have seen a rapid increase of available data. Digitization has become endemic. This has lead to a data deluge that left many unable to cope with such large amounts of messy data. Also because of the large number of content producers and different formats, data is not always easy to process by machines due to its its diverse quality and the presence of bias. Thus, in the current data-driven economy, if organizations can effectively analyze data at scale and use it as decision-support infrastructure at the executive level, data will lead to a key competitive advantage. To deal with the current data deluge, in the BetterCrowd project I will define and evaluate Human Computation methods to improve both the effectiveness and efficiency of currently available hybrid Human-Machine systems.

Human Computation (HC) is a game-changing paradigm that systematically exploits human intelligence at scale to improve purely machine-based data management systems (see, for example, CrowdDB [13]). This is often obtained by means of Crowdsourcing, that is, outsourcing certain tasks from the machine to a crowd of human individuals who perform short tasks (also known as Human Intelligence Tasks or HITs) that are simple for humans but still difficult for machines (e.g., understanding the content of a picture or sarcasm in text). Involving humans in the computation process is a fundamental scientific challenge that requires obtaining the best from human abilities and effectively embedding them into traditional computational systems. The challenges involved with the use of HC are both its efficiency (i.e., humans are naturally slower than machines in terms of information processing) and effectiveness (i.e., while machines deterministically compute, humans behavior may be unpredictable and possibly malicious).

The project is composed of two main parts. We will first look at how to improve crowdsourcing effectiveness by proposing novel techniques to detect malicious workers in crowdsourcing platforms. In the second part, we will make HC techniques scale so that they can be applied to larger volume of data focusing on scheduling tasks to the crowd (WP2).

Planned Impact

Other than academic beneficiaries, this research will impact commercial organizations and people involved in crowdsourcing activities.

Big Data Analytics Market.
The scale of data currently being produced by large organizations requires novel ways of managing and, most importantly, analyzing such enormous amounts of data in order to produce value for data consumers (e.g., company customers, employees, or governmental organization clients). In this context, high data quality is critical. The techniques developed within the scope of this project for efficient and effective human computation can be used to create better Big Data solutions and products in coordination with analytics platforms used within large-scale organizations. An example of industry use of hybrid human-machine techniques by means of crowdsourcing is already in place at Twitter where new trending topics and acronyms are detected in real-time by a mix of machine-based stream processing approaches and crowdsourcing on Amazon Mechanical Turk. The outcome of the BetterCrowd project has potential for impact on Big Data crowd-based solutions by optimizing requests to the crowd and the overall output quality produced.

Enterprise crowdsourcing.
In the enterprise domain, large companies (e.g., IBM, Microsoft, VeriSign) have already started to run in-house crowdsourcing: Employees of the company are the crowd to which data quality HITs are sent. Crowdsourcing in this context is different in aspects such as worker reputation and incentives. However, research findings on scalability aspects will be directly applicable to this way of leveraging knowledge workers within companies. Scheduling HITs to a crowd for in-house crowdsourcing is extremely important as this can be considered as a single multi-tenant system with jobs run with different priorities.

Crowdsourcing as career path.
The crowdsourcing market has seen an exponential growth over the last few years with a doubling of the market over the last two years. That said, crowdsourcing is still a highly unregulated market. In the longer term, the more efficient and effective use of crowdsourcing that will result from this project will support the creation of better conditions for the work of the crowd not excluding the definition of crowd work as a profession which is already a reality in developing countries such as India.
 
Description The BetterCrowd research goals include the improvement of efficiency and effectiveness of current Human Computation techniques making it possible to deal with high volume and velocity of data.

- What were the most significant achievements from the award?

We successfully achieved the planned goals as demonstrated by the reported scientific publications.
We tackled efficiency challenges of Human Computation by proposing and evaluating novel scheduling approaches for microtasks in a crowdsourcing platform. This work was published as a full paper in the Proceedings of the 25th International Conference on World Wide Web, WWW 2016.
We tackled effectiveness challenges of Human Computation showing how the introduction of time limits to complete crowdsourcing microtasks can significantly improve the quality of the produced data. This work was published as a full paper in the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2016).
We additionally studied the concept of task complexity in crowdsourcing. Our method allows to measure complexity given a task. This can be used, for example, to set appropriate rewards for crowdsourced tasks. This work was published as a full paper in the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2016).
We also looked at the effect of work environment on crowd work observing how device and internet connection speed have significant impact on work performance. This work was published as a full paper in the Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT) and presented at The ACM International Joint Conference on Pervasive and Ubiquitous Computing (UBICOMP 2017). Maui, Hawaii, September 2017.
We looked at attack schemes to crowdsourcing tasks. This work was presented at the 2017 Workshop on Hybrid Human-Machine Computing (HHMC 2017). Guildford, UK, September 2017.
In the final part of the project, we looked at agreement in crowdsourcing. We first measured the impact of worker agreement in crowdsourcing. This work was publshed as a full paper in The 3rd ACM International Conference on the Theory of Information Retrieval (ICTIR 2017). Amsterdam, The Netherlands, October 2017. We then proposed a new agreement measure for crowdsourcing. This work was published as a full paper in 5th AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2017). Quebec City, Canada, October 2017.
We also applied Human Computation techniques for the generation of linguistic datasets that can be used to train and test supervised machine learning models. This work was published in the 20th International Conference on Asian Language Processing (IALP 2016).
Finally, we published a short book providing an overview and introduction to the field of hybrid human-machine information systems. This work has been published in the collection Foundation and Trends in Web Science Vol. 7: No. 1, pp 1-87. 2017.

- To what extent were the award objectives met? If you can, briefly explain why any key objectives were not met.

The objective of obtaining improvements of efficiency and effectiveness of current Human Computation techniques have been met.
Exploitation Route Our findings on how to make Human Computation more efficient and more effective can be used by academics who use crowdsourcing as a research method as well as by organisations working in the data industry which need scalable manual annotation of data. As an example, we have been invited by Facebook to explain how to obtain high quality data by means of crowdsourcing and by Accenture to explain how to combine big data processing and crowdsourcing together.
Sectors Creative Economy,Digital/Communication/Information Technologies (including Software)

 
Description Other than the academic impact demonstrated by the published research, I was invited to give a talk on crowdsourcing quality at the Facebook London office in the context of an internal summit about the Facebook content monitoring system. I have also been invited by Accenture Latvia to give a talk about my research work in the framework of an ACM Distinguished Speaker lecture. The talk was live streamed with more than 1500 views. This research has lead to a Facebook Research grant on using the published techniques in the context of crowdsourcing for online content moderation. This research has lead to a Meta AI grant on using the published techniques in the context of behaviour tracking in human annotation tasks. This research has lead to a Google AI grant on using the published techniques in the context of an annotation task allocation research project.
Sector Digital/Communication/Information Technologies (including Software),Environment
Impact Types Cultural,Economic

 
Description Invited talk at Facebook about quality assurance in crowdsourcing
Geographic Reach Multiple continents/international 
Policy Influence Type Influenced training of practitioners or researchers
 
Description ELIAS research network programme - Science meetings
Amount € 7,500 (EUR)
Funding ID 5917 
Organisation European Science Foundation (ESF) 
Sector Charity/Non Profit
Country France
Start 08/2016 
End 08/2016
 
Description H2020-ICT-14-2016 topic Big Data PPP: cross-sectorial and cross-lingual data integration and experimentation
Amount € 1,699,324 (EUR)
Funding ID 732328 
Organisation European Commission 
Sector Public
Country European Union (EU)
Start 01/2017 
End 12/2019
 
Title ModOp: A Javascript tool to help crowdsourcing form design. 
Description This Javascript-based tool highlights potential problems with crowdsourcing task designs to that the designers can fix them before crowdsourcing the task. 
Type Of Technology Webtool/Application 
Year Produced 2016 
Impact
URL https://github.com/AlessandroChecco/ModOp
 
Description Public engagement talk in the context of the British Science Week 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact 50 people from the Sheffield are attended my public talk "The Gig Economy: Challenges and Opportunities". In the talk I discussed how the rise of human computation can be seen as an new employment opportunity but also comes with risks of social security, minimum wages, and others. The audience responded with interest in our work and look forward to the results of our research aiming at creating better work environment in on-line crowdsourcing platforms.
Year(s) Of Engagement Activity 2017
URL http://www.scienceweeksy.org.uk/event/202
 
Description accenture 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact "The Power of Big Data" - ACM Distinguished Speaker talk at Accenture Latvia, 2017.
Year(s) Of Engagement Activity 2017
 
Description adc school 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact "Crowdsourcing for Data Management", invited talk at the PhD School of the Australasian Database Conference (ADC) 2017, Brisbane, 2017.
Year(s) Of Engagement Activity 2017
URL http://adc-conferences.org.au/adc2017/phdschool.html
 
Description huml iswc 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact Co-Organised workshop at the ISWC 2018 conference titled "the second international workshop on Augmenting Intelligence with Humans­-in-­the-­Loop"
Year(s) Of Engagement Activity 2018
URL https://humlworkshop.github.io/HumL-ISWC2018/
 
Description huml www 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact Co-Organised workshop at the WWW 2018 conference titled "the first international workshop on Augmenting Intelligence with Humans­-in-­the-­Loop"
Year(s) Of Engagement Activity 2018
URL https://humlworkshop.github.io/HumL-WWW2018/
 
Description talk dtgs 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact "The Power of Big Data" - ACM Distinguished Speaker talk as keynote at the Second International "Digital Transformation & Global Society" Conference (DTGS'17), St Petersburg, 2017.
Year(s) Of Engagement Activity 2017
URL http://dtgs-conference.org/