uComp: Embedded Human Computation for Knowledge Extraction and Evaluation

Lead Research Organisation: University of Sheffield
Department Name: Computer Science

Abstract

The rapid growth and fragmented character of social media and publicly available structured data challenges established approaches to knowledge extraction. Many algorithms fail when they encounter noisy, multilingual and often contradictory input. Efforts to increase the reliability and scalability of these algorithms face a lack of suitable training data and gold standards. Given that humans excel at interpreting contradictory and context-dependent language data, the uComp project will address the above mentioned shortcomings by merging collective human intelligence and automated methods in a symbiotic fashion. The project will build upon the emerging field of Human Computation (HC) in the tradition of games with a purpose and crowdsourcing marketplaces. It will advance the field of Web Science by developing a scalable and generic HC framework for knowledge extraction and evaluation, delegating the most challenging tasks to large communities of users and continuously learning from their input to optimise automated methods as part of an iterative process. A major contribution is the proposed foundational research on Embedded Human Computation (EHC), which will advance and integrate the currently fragmented research on human and machine computation. EHC goes beyond mere data collection and embeds the HC paradigm into adaptive knowledge extraction workflows. An open evaluation campaign will validate the accuracy and scalability of EHC to acquire factual and affective knowledge. In addition to novel evaluation methods, uComp will also provide shared datasets and benchmark the EHC approach against established knowledge processing algorithms.

While the methods of uComp will be held generic to be evaluated across domains, climate change was chosen as the main use case for its challenging nature, subject to changing and conflicting interpretations. Active collaboration with international organisations (EEA, NOAA, NASA) will increase the project's visibility and promote the adoption of the EHC paradigm among a wide range of stakeholders.

Planned Impact

The uComp project aims to harness the power of Human Computation (HC) for scaling up and improving the accuracy of deep knowledge acquisition from noisy, multilingual data. It spans two novel ICT disciplines: Human Computation and Web Science. The latter is one of CHIST-ERA's target research areas. The 2010 FET Consultation also identifies Web Science as a key emerging discipline, which requires breakthrough foundational research and where academic impact could be significant. As stated in the report, harnessing collective intelligence comes with "large and novel problems as to have genuinely revolutionary potential". uComp aims to tackle these new challenges, in order to deliver the necessary fundamental understanding and computational methods, and to ultimately enable innovative HC-based digital applications. The project will thus directly contribute to the target impacts of FET Objective ICT-2011.9.1: Challenging current Thinking, through an "ambitious proof-of-concept and its supporting scientific foundation, where novelty comes from new, high-risk ideas rather than from the refinement of current ICT approaches".

Another major impact area comes from uComp's application domain, i.e., climate change, which was chosen for its challenging nature in line with the CHIST-ERA call, and the match with the JPI Climate strategic research agenda. Joint Programming Initiatives (JPI) implement the European Research Area (ERA). Twelve member states have taken up this approach and promote a new JPI on "Connecting Climate Knowledge for Europe". JPI Climate identifies climate change as a complex reality that affects European society at large, and calls for knowledge-based information and services to respond to stakeholder needs. The active collaboration with the European Environment Agency (EEA), the British Library, the National Oceanic and Atmospheric Administration (NOAA) and the NASA Ames Research Center will ensure a high international visibility and societal impact, and the widespread adoption of EHC methods among participating scientists. Thereby uComp will support stakeholder collaboration and engage citizens in scientific research (citizen science).

On UK level, the British Library has a Memorandum of Understanding with the Living With Environmental Change (LWEC) programme, a partnership comprising the 22 major public sector funders and producers of environmental information, to work together to improve access and discovery of environmental information. Through this relationship, results from this work will have a significant, wide impact and may be applied by a wide range of stakeholders well beyond those involved in this project.

The project will also carry out detailed investigation of the commercial exploitation potential of uComp as an enabler of active business ecosystems around the EHC concept. uComp will devise strategies to attract research and commercial partners by aligning the scientific progress with those partners' specific needs. We will also undertake a continuous 'market watch', to analyse and explore different business opportunities, which will be turned into particular action in the exploitation plan, viable for prospective commercialisation of some of the results after the end of the project. Activities towards standardisation of the project results and collaboration with other projects and relevant initiatives will also be explored and coordinated, so that uComp can have the best possible impact both in the scientific and commercial communities.
 
Description This grant has ended long ago and the PI left the University of Sheffield. There are no new findings to report.
Exploitation Route This grant has ended long ago and the PI left the University of Sheffield. There are no new findings to report.
Sectors Digital/Communication/Information Technologies (including Software)

 
Description - Collaboration with the National Oceanic and Atmospheric Administration (NOAA) and a follow-up project with the United Nations Environment Programme (UNEP) - The GATE crowdsourcing plugin has now been taken up for use, maintenance, and, if needed, further development as part of the following research projects: - SoBigData - an H2020 research infrastructure project on social media ana-lytics, where human computation will play an important role; - DecarboNet - an FP7 project, which is using the uComp GATE crowdsourcing plugin to create gold-standard datasets for evaluation of entity linking, envi-ronmental term extraction, and opinion mining; - COMRADES - an H2020 CAPS project, which will use the uComp GATE crowdsourcing plugin to create much needed gold-standard datasets for evaluating algorithms for analysing disaster response social media content.
First Year Of Impact 2015
Sector Digital/Communication/Information Technologies (including Software)
Impact Types Societal,Economic,Policy & public services

 
Description Collaboration with University of Washington 
Organisation University of Washington
Country United States 
Sector Academic/University 
PI Contribution Collaboration around part-of-speech tagging for Twitter content
Start Year 2013
 
Title A Python implementation of spectral association 
Description tbc 
Type Of Technology Webtool/Application 
Year Produced 2014 
Impact tbc URL not working? 
URL http://wwwai.wu-wien.ac.at/~wohlg/spectral_association
 
Title A named entity linking dataset for twitter 
Description tbc 
Type Of Technology Software 
Year Produced 2014 
Impact tbc 
URL http://www.derczynski.com/sheffield/resources/ipm_nel.tar.gz
 
Title A part-of-speech tagger for user-generated noisy text 
Description tbc 
Type Of Technology Webtool/Application 
Year Produced 2014 
Impact tbc 
URL https://gate.ac.uk/wiki/twitter-postagger.html
 
Title Basic command line Tweet retriever, a simple script to retrieve Tweets by their identifiers 
Description tbc 
Type Of Technology Software 
Year Produced 2014 
Impact tbc 
URL https://deft.limsi.fr/2015/tools/tweet_basic-retriever.zip
 
Title Brown clusters over multiple text types and with various hyperparameter variations 
Description tbc 
Type Of Technology Software 
Year Produced 2014 
Impact tbc 
URL https://s3-eu-west-1.amazonaws.com/downloads.gate.ac.uk/resources/derczynski-chester-boegh-brownpath...
 
Title Climate Challenge 
Description The game combines practical steps to reduce one's carbon footprint with language resource acquisition tasks and questions about future climate-related conditions that we cannot answer today, but will be able to answer in the future. 
Type Of Technology Webtool/Application 
Year Produced 2015 
Impact tbc 
URL http://www.ecoresearch.net/climate-challenge
 
Title Code for generalised Brown clustering, an unsupervised technique for finding similar words 
Description tbc 
Type Of Technology Software 
Year Produced 2014 
Impact tbc 
URL https://github.com/sean-chester/generalised-brown
 
Title DEFT 2015 Training and Tests Corpora with manual annotation of opinions, sentiments and emotions at various granularity levels of Tweets in French about climate change 
Description tbc 
Type Of Technology Software 
Year Produced 2015 
Impact tbc URL needs password? 
URL https://deft.limsi.fr/2015/corpus/train/TRAIN_TWEETS_ID-03042015.zip
 
Title DEFT 2015 evaluation toolkit 
Description A GWAP Plugin for the validation of ontologies for the Protege Ontology Editor 
Type Of Technology Software 
Year Produced 2015 
Impact tbc 
URL https://deft.limsi.fr/2015/tools/evaldeft2015_20150513.tar.gz
 
Title DEFT 2015 evaluation toolkit, set of perl programs to compute the DEFT 2015 evaluation measures 
Description tbc 
Type Of Technology Software 
Year Produced 2015 
Impact tbc 
URL https://deft.limsi.fr/2015/tools/evaldeft2015_20150513.tar.gz
 
Title GATE Crowdsourcing plugin 
Description The open-source GATE Crowdsourcing plugin offers infrastructural support for mapping documents to crowdsourcing units and back, as well as automatically generating reusable crowdsourcing interfaces for NLP classification and sequence annotation task 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact interest from potential users after dissemination activities. 
URL https://gate.ac.uk/wiki/crowdsourcing.html
 
Title GATE Crowdsourcing plugin, including automatic adjudication tools 
Description tbc 
Type Of Technology Webtool/Application 
Year Produced 2014 
Impact tbc 
URL https://gate.ac.uk/wiki/crowdsourcing.html
 
Title Language Quiz 
Description The game combines various language resource acquisition tasks in multiple languages (German, Spanish, Russian, Chinese, and Czech), some of them based on translated versions of existing English-language resources. Initially, the system collects answers from a group of players. From their answers the majority opinion is determined, which is considered to be the correct answer and serves as the basis for awarding game points 
Type Of Technology Webtool/Application 
Year Produced 2015 
Impact tbc 
URL http://quiz.ucomp.eu
 
Title TwitIE: Information extraction system for Twitter data 
Description NLP on social media data is hard. Content is often brief, contains mistakes, lacks context, and is uncurated - very different from the well-formed news text that tools typically operate over. TwitIE is a GATE pipeline for Information Extraction over tweets, one of the noisiest forms of social media text. 
Type Of Technology Software 
Year Produced 2013 
Open Source License? Yes  
Impact user interest after presentations and tutorials. 
URL http://gate.ac.uk/wiki/twitie.html
 
Title a uComp Plugin for Protégé Ontology Editor: The plugin that facilitates the integration of typical crowdsourcing tasks into ontology engineering from within Protégé 
Description tbc 
Type Of Technology Webtool/Application 
Year Produced 2014 
Impact tbc 
URL https://github.com/UcompWu1/Gwap-Protege-Plugin
 
Title eWRT - easy Web Retrieval Tooki 
Description tbc 
Type Of Technology Webtool/Application 
Year Produced 2014 
Impact tbc 
URL http://www.semanticlab.net/index.php/eWRT
 
Title the TwitIE GATE-based NLP pre-processing application 
Description tbc 
Type Of Technology Webtool/Application 
Year Produced 2014 
Impact tbc 
URL http://gate.ac.uk/wiki/twitie.html
 
Description Crowdsourcing best practices - iswc2014 tutorial presentation 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Many questions and discussions on crowdsourcing workflow and methodology.

Invigourated interest in presented methodology; recognition of its contribution to best practise.
Year(s) Of Engagement Activity 2014
URL http://www.slideshare.net/martasabou1/crowdsourcing-bestpractices
 
Description The 6th GATE Training Course (3-7 June 2013, Sheffield, UK) 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact The GATE training course contained a module on mining social media, developed within uComp.
A lot of reactions afterwards, given the increasing popularity of social media analysis.


Increased interest in this functionality from new and existing GATE users.
International awareness and take up.
Year(s) Of Engagement Activity 2013
URL http://bit.ly/XcOS0N
 
Description The 7th GATE Training Course: Mining social media content with GATE, 9 - 13 June 2014, Sheffield, UK 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The GATE training course informed a wide audience on the use of GATE for scientific and commercial purposes.
The tutorial contained uComp's GATE module for crowdsourcing for text mining

The module has been downloaded and after dissemination used by tutorial participants and other interested parties.
Year(s) Of Engagement Activity 2014
URL https://gate.ac.uk/conferences/fig/fig7.html
 
Description Tutorial: Natural Language Processing for Social Media. K. Bontcheva and L. Derczynski. 26 April 2014. EACL 2014. 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact The tutorial informed interested parties from the relevant disciplines about the application options for social media analysis within GATE.

The tutorial generated interest within the scientific and commercial communities participating in the conference, and has led to re-use and information exchange.
Year(s) Of Engagement Activity 2014
URL http://eacl2014.org/tutorial-social-media