BetterCrowd: Human Computation for Big Data

Lead Research Organisation: University of Sheffield

Department Name: Information School

Abstract

In the last few years we have seen a rapid increase of available data. Digitization has become endemic. This has lead to a data deluge that left many unable to cope with such large amounts of messy data. Also because of the large number of content producers and different formats, data is not always easy to process by machines due to its its diverse quality and the presence of bias. Thus, in the current data-driven economy, if organizations can effectively analyze data at scale and use it as decision-support infrastructure at the executive level, data will lead to a key competitive advantage. To deal with the current data deluge, in the BetterCrowd project I will define and evaluate Human Computation methods to improve both the effectiveness and efficiency of currently available hybrid Human-Machine systems.

Human Computation (HC) is a game-changing paradigm that systematically exploits human intelligence at scale to improve purely machine-based data management systems (see, for example, CrowdDB [13]). This is often obtained by means of Crowdsourcing, that is, outsourcing certain tasks from the machine to a crowd of human individuals who perform short tasks (also known as Human Intelligence Tasks or HITs) that are simple for humans but still difficult for machines (e.g., understanding the content of a picture or sarcasm in text). Involving humans in the computation process is a fundamental scientific challenge that requires obtaining the best from human abilities and effectively embedding them into traditional computational systems. The challenges involved with the use of HC are both its efficiency (i.e., humans are naturally slower than machines in terms of information processing) and effectiveness (i.e., while machines deterministically compute, humans behavior may be unpredictable and possibly malicious).

The project is composed of two main parts. We will first look at how to improve crowdsourcing effectiveness by proposing novel techniques to detect malicious workers in crowdsourcing platforms. In the second part, we will make HC techniques scale so that they can be applied to larger volume of data focusing on scheduling tasks to the crowd (WP2).

Planned Impact

Other than academic beneficiaries, this research will impact commercial organizations and people involved in crowdsourcing activities.

Big Data Analytics Market.
The scale of data currently being produced by large organizations requires novel ways of managing and, most importantly, analyzing such enormous amounts of data in order to produce value for data consumers (e.g., company customers, employees, or governmental organization clients). In this context, high data quality is critical. The techniques developed within the scope of this project for efficient and effective human computation can be used to create better Big Data solutions and products in coordination with analytics platforms used within large-scale organizations. An example of industry use of hybrid human-machine techniques by means of crowdsourcing is already in place at Twitter where new trending topics and acronyms are detected in real-time by a mix of machine-based stream processing approaches and crowdsourcing on Amazon Mechanical Turk. The outcome of the BetterCrowd project has potential for impact on Big Data crowd-based solutions by optimizing requests to the crowd and the overall output quality produced.

Enterprise crowdsourcing.
In the enterprise domain, large companies (e.g., IBM, Microsoft, VeriSign) have already started to run in-house crowdsourcing: Employees of the company are the crowd to which data quality HITs are sent. Crowdsourcing in this context is different in aspects such as worker reputation and incentives. However, research findings on scalability aspects will be directly applicable to this way of leveraging knowledge workers within companies. Scheduling HITs to a crowd for in-house crowdsourcing is extremely important as this can be considered as a single multi-tenant system with jobs run with different priorities.

Crowdsourcing as career path.
The crowdsourcing market has seen an exponential growth over the last few years with a doubling of the market over the last two years. That said, crowdsourcing is still a highly unregulated market. In the longer term, the more efficient and effective use of crowdsourcing that will result from this project will support the creation of better conditions for the work of the crowd not excluding the definition of crowd work as a profession which is already a reality in developing countries such as India.

Funded Value:

£99,555

Funded Period:

Jan 16 - Jun 17

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/N011589/1

Principal Investigator:

Gianluca Demartini

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Artificial Intelligence (20%)

Human-Computer Interactions (20%)

Information & Knowledge Mgmt (60%)

Organisations

University of Sheffield (Lead Research Organisation)

People	ORCID iD
Gianluca Demartini (Principal Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Catasta M (2017) An Introduction to Hybrid Human-Machine Information Systems in Foundations and Trends® in Web Science

Checco A (2017) Let's Agree to Disagree: Fixing Agreement Measures for Crowdsourcing

Difallah D (2016) Scheduling Human Intelligence Tasks in Multi-Tenant Crowd-Powered Systems

Gadiraju U (2017) Modus Operandi of Crowd Workers The Invisible Role of Microtask Work Environments in Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies

Helmy M (2016) Towards building a standard dataset for Arabic keyphrase extraction evaluation

Maddalena E (2017) Considering Assessor Agreement in IR Evaluation

Maddalena E (2016) Crowdsourcing Relevance Assessments: The Unexpected Benefits of Limiting the Time to Judge

Yang J (2016) Modeling Task Complexity in Crowdsourcing

Key Findings
Impact Summary
Policy Influence
Further Funding
Software and Technical Products
Engagement Activities


Description	The BetterCrowd research goals include the improvement of efficiency and effectiveness of current Human Computation techniques making it possible to deal with high volume and velocity of data. - What were the most significant achievements from the award? We successfully achieved the planned goals as demonstrated by the reported scientific publications. We tackled efficiency challenges of Human Computation by proposing and evaluating novel scheduling approaches for microtasks in a crowdsourcing platform. This work was published as a full paper in the Proceedings of the 25th International Conference on World Wide Web, WWW 2016. We tackled effectiveness challenges of Human Computation showing how the introduction of time limits to complete crowdsourcing microtasks can significantly improve the quality of the produced data. This work was published as a full paper in the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2016). We additionally studied the concept of task complexity in crowdsourcing. Our method allows to measure complexity given a task. This can be used, for example, to set appropriate rewards for crowdsourced tasks. This work was published as a full paper in the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2016). We also looked at the effect of work environment on crowd work observing how device and internet connection speed have significant impact on work performance. This work was published as a full paper in the Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT) and presented at The ACM International Joint Conference on Pervasive and Ubiquitous Computing (UBICOMP 2017). Maui, Hawaii, September 2017. We looked at attack schemes to crowdsourcing tasks. This work was presented at the 2017 Workshop on Hybrid Human-Machine Computing (HHMC 2017). Guildford, UK, September 2017. In the final part of the project, we looked at agreement in crowdsourcing. We first measured the impact of worker agreement in crowdsourcing. This work was publshed as a full paper in The 3rd ACM International Conference on the Theory of Information Retrieval (ICTIR 2017). Amsterdam, The Netherlands, October 2017. We then proposed a new agreement measure for crowdsourcing. This work was published as a full paper in 5th AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2017). Quebec City, Canada, October 2017. We also applied Human Computation techniques for the generation of linguistic datasets that can be used to train and test supervised machine learning models. This work was published in the 20th International Conference on Asian Language Processing (IALP 2016). Finally, we published a short book providing an overview and introduction to the field of hybrid human-machine information systems. This work has been published in the collection Foundation and Trends in Web Science Vol. 7: No. 1, pp 1-87. 2017. - To what extent were the award objectives met? If you can, briefly explain why any key objectives were not met. The objective of obtaining improvements of efficiency and effectiveness of current Human Computation techniques have been met.
Exploitation Route	Our findings on how to make Human Computation more efficient and more effective can be used by academics who use crowdsourcing as a research method as well as by organisations working in the data industry which need scalable manual annotation of data. As an example, we have been invited by Facebook to explain how to obtain high quality data by means of crowdsourcing and by Accenture to explain how to combine big data processing and crowdsourcing together.
Sectors	Creative Economy,Digital/Communication/Information Technologies (including Software)


Description	Other than the academic impact demonstrated by the published research, I was invited to give a talk on crowdsourcing quality at the Facebook London office in the context of an internal summit about the Facebook content monitoring system. I have also been invited by Accenture Latvia to give a talk about my research work in the framework of an ACM Distinguished Speaker lecture. The talk was live streamed with more than 1500 views. This research has lead to a Facebook Research grant on using the published techniques in the context of crowdsourcing for online content moderation. This research has lead to a Meta AI grant on using the published techniques in the context of behaviour tracking in human annotation tasks. This research has lead to a Google AI grant on using the published techniques in the context of an annotation task allocation research project.
Sector	Digital/Communication/Information Technologies (including Software),Environment
Impact Types	Cultural,Economic


Description	Invited talk at Facebook about quality assurance in crowdsourcing
Geographic Reach	Multiple continents/international
Policy Influence Type	Influenced training of practitioners or researchers


Description	ELIAS research network programme - Science meetings
Amount	€ 7,500 (EUR)
Funding ID	5917
Organisation	European Science Foundation (ESF)
Sector	Charity/Non Profit
Country	France
Start	08/2016
End	08/2016


Description	H2020-ICT-14-2016 topic Big Data PPP: cross-sectorial and cross-lingual data integration and experimentation
Amount	€ 1,699,324 (EUR)
Funding ID	732328
Organisation	European Commission
Sector	Public
Country	European Union (EU)
Start	01/2017
End	12/2019


Title	ModOp: A Javascript tool to help crowdsourcing form design.
Description	This Javascript-based tool highlights potential problems with crowdsourcing task designs to that the designers can fix them before crowdsourcing the task.
Type Of Technology	Webtool/Application
Year Produced	2016
Impact	-
URL	https://github.com/AlessandroChecco/ModOp


Description	Public engagement talk in the context of the British Science Week
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Public/other audiences
Results and Impact	50 people from the Sheffield are attended my public talk "The Gig Economy: Challenges and Opportunities". In the talk I discussed how the rise of human computation can be seen as an new employment opportunity but also comes with risks of social security, minimum wages, and others. The audience responded with interest in our work and look forward to the results of our research aiming at creating better work environment in on-line crowdsourcing platforms.
Year(s) Of Engagement Activity	2017
URL	http://www.scienceweeksy.org.uk/event/202


Description	accenture
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Industry/Business
Results and Impact	"The Power of Big Data" - ACM Distinguished Speaker talk at Accenture Latvia, 2017.
Year(s) Of Engagement Activity	2017


Description	adc school
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	"Crowdsourcing for Data Management", invited talk at the PhD School of the Australasian Database Conference (ADC) 2017, Brisbane, 2017.
Year(s) Of Engagement Activity	2017
URL	http://adc-conferences.org.au/adc2017/phdschool.html


Description	huml iswc
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Industry/Business
Results and Impact	Co-Organised workshop at the ISWC 2018 conference titled "the second international workshop on Augmenting Intelligence with Humans-in-the-Loop"
Year(s) Of Engagement Activity	2018
URL	https://humlworkshop.github.io/HumL-ISWC2018/


Description	huml www
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Industry/Business
Results and Impact	Co-Organised workshop at the WWW 2018 conference titled "the first international workshop on Augmenting Intelligence with Humans-in-the-Loop"
Year(s) Of Engagement Activity	2018
URL	https://humlworkshop.github.io/HumL-WWW2018/


Description	talk dtgs
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	"The Power of Big Data" - ACM Distinguished Speaker talk as keynote at the Second International "Digital Transformation & Global Society" Conference (DTGS'17), St Petersburg, 2017.
Year(s) Of Engagement Activity	2017
URL	http://dtgs-conference.org/

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications