BetterCrowd: Human Computation for Big Data
Lead Research Organisation:
University of Sheffield
Department Name: Information School
Abstract
In the last few years we have seen a rapid increase of available data. Digitization has become endemic. This has lead to a data deluge that left many unable to cope with such large amounts of messy data. Also because of the large number of content producers and different formats, data is not always easy to process by machines due to its its diverse quality and the presence of bias. Thus, in the current data-driven economy, if organizations can effectively analyze data at scale and use it as decision-support infrastructure at the executive level, data will lead to a key competitive advantage. To deal with the current data deluge, in the BetterCrowd project I will define and evaluate Human Computation methods to improve both the effectiveness and efficiency of currently available hybrid Human-Machine systems.
Human Computation (HC) is a game-changing paradigm that systematically exploits human intelligence at scale to improve purely machine-based data management systems (see, for example, CrowdDB [13]). This is often obtained by means of Crowdsourcing, that is, outsourcing certain tasks from the machine to a crowd of human individuals who perform short tasks (also known as Human Intelligence Tasks or HITs) that are simple for humans but still difficult for machines (e.g., understanding the content of a picture or sarcasm in text). Involving humans in the computation process is a fundamental scientific challenge that requires obtaining the best from human abilities and effectively embedding them into traditional computational systems. The challenges involved with the use of HC are both its efficiency (i.e., humans are naturally slower than machines in terms of information processing) and effectiveness (i.e., while machines deterministically compute, humans behavior may be unpredictable and possibly malicious).
The project is composed of two main parts. We will first look at how to improve crowdsourcing effectiveness by proposing novel techniques to detect malicious workers in crowdsourcing platforms. In the second part, we will make HC techniques scale so that they can be applied to larger volume of data focusing on scheduling tasks to the crowd (WP2).
Human Computation (HC) is a game-changing paradigm that systematically exploits human intelligence at scale to improve purely machine-based data management systems (see, for example, CrowdDB [13]). This is often obtained by means of Crowdsourcing, that is, outsourcing certain tasks from the machine to a crowd of human individuals who perform short tasks (also known as Human Intelligence Tasks or HITs) that are simple for humans but still difficult for machines (e.g., understanding the content of a picture or sarcasm in text). Involving humans in the computation process is a fundamental scientific challenge that requires obtaining the best from human abilities and effectively embedding them into traditional computational systems. The challenges involved with the use of HC are both its efficiency (i.e., humans are naturally slower than machines in terms of information processing) and effectiveness (i.e., while machines deterministically compute, humans behavior may be unpredictable and possibly malicious).
The project is composed of two main parts. We will first look at how to improve crowdsourcing effectiveness by proposing novel techniques to detect malicious workers in crowdsourcing platforms. In the second part, we will make HC techniques scale so that they can be applied to larger volume of data focusing on scheduling tasks to the crowd (WP2).
Planned Impact
Other than academic beneficiaries, this research will impact commercial organizations and people involved in crowdsourcing activities.
Big Data Analytics Market.
The scale of data currently being produced by large organizations requires novel ways of managing and, most importantly, analyzing such enormous amounts of data in order to produce value for data consumers (e.g., company customers, employees, or governmental organization clients). In this context, high data quality is critical. The techniques developed within the scope of this project for efficient and effective human computation can be used to create better Big Data solutions and products in coordination with analytics platforms used within large-scale organizations. An example of industry use of hybrid human-machine techniques by means of crowdsourcing is already in place at Twitter where new trending topics and acronyms are detected in real-time by a mix of machine-based stream processing approaches and crowdsourcing on Amazon Mechanical Turk. The outcome of the BetterCrowd project has potential for impact on Big Data crowd-based solutions by optimizing requests to the crowd and the overall output quality produced.
Enterprise crowdsourcing.
In the enterprise domain, large companies (e.g., IBM, Microsoft, VeriSign) have already started to run in-house crowdsourcing: Employees of the company are the crowd to which data quality HITs are sent. Crowdsourcing in this context is different in aspects such as worker reputation and incentives. However, research findings on scalability aspects will be directly applicable to this way of leveraging knowledge workers within companies. Scheduling HITs to a crowd for in-house crowdsourcing is extremely important as this can be considered as a single multi-tenant system with jobs run with different priorities.
Crowdsourcing as career path.
The crowdsourcing market has seen an exponential growth over the last few years with a doubling of the market over the last two years. That said, crowdsourcing is still a highly unregulated market. In the longer term, the more efficient and effective use of crowdsourcing that will result from this project will support the creation of better conditions for the work of the crowd not excluding the definition of crowd work as a profession which is already a reality in developing countries such as India.
Big Data Analytics Market.
The scale of data currently being produced by large organizations requires novel ways of managing and, most importantly, analyzing such enormous amounts of data in order to produce value for data consumers (e.g., company customers, employees, or governmental organization clients). In this context, high data quality is critical. The techniques developed within the scope of this project for efficient and effective human computation can be used to create better Big Data solutions and products in coordination with analytics platforms used within large-scale organizations. An example of industry use of hybrid human-machine techniques by means of crowdsourcing is already in place at Twitter where new trending topics and acronyms are detected in real-time by a mix of machine-based stream processing approaches and crowdsourcing on Amazon Mechanical Turk. The outcome of the BetterCrowd project has potential for impact on Big Data crowd-based solutions by optimizing requests to the crowd and the overall output quality produced.
Enterprise crowdsourcing.
In the enterprise domain, large companies (e.g., IBM, Microsoft, VeriSign) have already started to run in-house crowdsourcing: Employees of the company are the crowd to which data quality HITs are sent. Crowdsourcing in this context is different in aspects such as worker reputation and incentives. However, research findings on scalability aspects will be directly applicable to this way of leveraging knowledge workers within companies. Scheduling HITs to a crowd for in-house crowdsourcing is extremely important as this can be considered as a single multi-tenant system with jobs run with different priorities.
Crowdsourcing as career path.
The crowdsourcing market has seen an exponential growth over the last few years with a doubling of the market over the last two years. That said, crowdsourcing is still a highly unregulated market. In the longer term, the more efficient and effective use of crowdsourcing that will result from this project will support the creation of better conditions for the work of the crowd not excluding the definition of crowd work as a profession which is already a reality in developing countries such as India.
Organisations
People |
ORCID iD |
Gianluca Demartini (Principal Investigator) |
Publications
Catasta M
(2017)
An Introduction to Hybrid Human-Machine Information Systems
in Foundations and Trends® in Web Science
Gadiraju U
(2017)
Modus Operandi of Crowd Workers The Invisible Role of Microtask Work Environments
in Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
Maddalena E
(2017)
Considering Assessor Agreement in IR Evaluation
Yang J
(2016)
Modeling Task Complexity in Crowdsourcing
Description | The BetterCrowd research goals include the improvement of efficiency and effectiveness of current Human Computation techniques making it possible to deal with high volume and velocity of data. - What were the most significant achievements from the award? We successfully achieved the planned goals as demonstrated by the reported scientific publications. We tackled efficiency challenges of Human Computation by proposing and evaluating novel scheduling approaches for microtasks in a crowdsourcing platform. This work was published as a full paper in the Proceedings of the 25th International Conference on World Wide Web, WWW 2016. We tackled effectiveness challenges of Human Computation showing how the introduction of time limits to complete crowdsourcing microtasks can significantly improve the quality of the produced data. This work was published as a full paper in the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2016). We additionally studied the concept of task complexity in crowdsourcing. Our method allows to measure complexity given a task. This can be used, for example, to set appropriate rewards for crowdsourced tasks. This work was published as a full paper in the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2016). We also looked at the effect of work environment on crowd work observing how device and internet connection speed have significant impact on work performance. This work was published as a full paper in the Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT) and presented at The ACM International Joint Conference on Pervasive and Ubiquitous Computing (UBICOMP 2017). Maui, Hawaii, September 2017. We looked at attack schemes to crowdsourcing tasks. This work was presented at the 2017 Workshop on Hybrid Human-Machine Computing (HHMC 2017). Guildford, UK, September 2017. In the final part of the project, we looked at agreement in crowdsourcing. We first measured the impact of worker agreement in crowdsourcing. This work was publshed as a full paper in The 3rd ACM International Conference on the Theory of Information Retrieval (ICTIR 2017). Amsterdam, The Netherlands, October 2017. We then proposed a new agreement measure for crowdsourcing. This work was published as a full paper in 5th AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2017). Quebec City, Canada, October 2017. We also applied Human Computation techniques for the generation of linguistic datasets that can be used to train and test supervised machine learning models. This work was published in the 20th International Conference on Asian Language Processing (IALP 2016). Finally, we published a short book providing an overview and introduction to the field of hybrid human-machine information systems. This work has been published in the collection Foundation and Trends in Web Science Vol. 7: No. 1, pp 1-87. 2017. - To what extent were the award objectives met? If you can, briefly explain why any key objectives were not met. The objective of obtaining improvements of efficiency and effectiveness of current Human Computation techniques have been met. |
Exploitation Route | Our findings on how to make Human Computation more efficient and more effective can be used by academics who use crowdsourcing as a research method as well as by organisations working in the data industry which need scalable manual annotation of data. As an example, we have been invited by Facebook to explain how to obtain high quality data by means of crowdsourcing and by Accenture to explain how to combine big data processing and crowdsourcing together. |
Sectors | Creative Economy,Digital/Communication/Information Technologies (including Software) |
Description | Other than the academic impact demonstrated by the published research, I was invited to give a talk on crowdsourcing quality at the Facebook London office in the context of an internal summit about the Facebook content monitoring system. I have also been invited by Accenture Latvia to give a talk about my research work in the framework of an ACM Distinguished Speaker lecture. The talk was live streamed with more than 1500 views. This research has lead to a Facebook Research grant on using the published techniques in the context of crowdsourcing for online content moderation. This research has lead to a Meta AI grant on using the published techniques in the context of behaviour tracking in human annotation tasks. This research has lead to a Google AI grant on using the published techniques in the context of an annotation task allocation research project. |
Sector | Digital/Communication/Information Technologies (including Software),Environment |
Impact Types | Cultural,Economic |
Description | Invited talk at Facebook about quality assurance in crowdsourcing |
Geographic Reach | Multiple continents/international |
Policy Influence Type | Influenced training of practitioners or researchers |
Description | ELIAS research network programme - Science meetings |
Amount | € 7,500 (EUR) |
Funding ID | 5917 |
Organisation | European Science Foundation (ESF) |
Sector | Charity/Non Profit |
Country | France |
Start | 08/2016 |
End | 08/2016 |
Description | H2020-ICT-14-2016 topic Big Data PPP: cross-sectorial and cross-lingual data integration and experimentation |
Amount | € 1,699,324 (EUR) |
Funding ID | 732328 |
Organisation | European Commission |
Sector | Public |
Country | European Union (EU) |
Start | 01/2017 |
End | 12/2019 |
Title | ModOp: A Javascript tool to help crowdsourcing form design. |
Description | This Javascript-based tool highlights potential problems with crowdsourcing task designs to that the designers can fix them before crowdsourcing the task. |
Type Of Technology | Webtool/Application |
Year Produced | 2016 |
Impact | - |
URL | https://github.com/AlessandroChecco/ModOp |
Description | Public engagement talk in the context of the British Science Week |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Public/other audiences |
Results and Impact | 50 people from the Sheffield are attended my public talk "The Gig Economy: Challenges and Opportunities". In the talk I discussed how the rise of human computation can be seen as an new employment opportunity but also comes with risks of social security, minimum wages, and others. The audience responded with interest in our work and look forward to the results of our research aiming at creating better work environment in on-line crowdsourcing platforms. |
Year(s) Of Engagement Activity | 2017 |
URL | http://www.scienceweeksy.org.uk/event/202 |
Description | accenture |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Industry/Business |
Results and Impact | "The Power of Big Data" - ACM Distinguished Speaker talk at Accenture Latvia, 2017. |
Year(s) Of Engagement Activity | 2017 |
Description | adc school |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | "Crowdsourcing for Data Management", invited talk at the PhD School of the Australasian Database Conference (ADC) 2017, Brisbane, 2017. |
Year(s) Of Engagement Activity | 2017 |
URL | http://adc-conferences.org.au/adc2017/phdschool.html |
Description | huml iswc |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Industry/Business |
Results and Impact | Co-Organised workshop at the ISWC 2018 conference titled "the second international workshop on Augmenting Intelligence with Humans-in-the-Loop" |
Year(s) Of Engagement Activity | 2018 |
URL | https://humlworkshop.github.io/HumL-ISWC2018/ |
Description | huml www |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Industry/Business |
Results and Impact | Co-Organised workshop at the WWW 2018 conference titled "the first international workshop on Augmenting Intelligence with Humans-in-the-Loop" |
Year(s) Of Engagement Activity | 2018 |
URL | https://humlworkshop.github.io/HumL-WWW2018/ |
Description | talk dtgs |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | "The Power of Big Data" - ACM Distinguished Speaker talk as keynote at the Second International "Digital Transformation & Global Society" Conference (DTGS'17), St Petersburg, 2017. |
Year(s) Of Engagement Activity | 2017 |
URL | http://dtgs-conference.org/ |