Lodie,Web Scale Information Extraction via Linked Open Data

Lead Research Organisation: University of Sheffield

Department Name: Computer Science

Abstract

The World Wide Web provides access to tens of billions of pages. These pages contain information that is largely unstructured and only intended for human readability, however we are reliant on computers "reading" these pages in order to find the information we need. The proposed research intends to develop technologies to radically improve the billions of searches which are performed every day by fulfilling the initial vision, by Tim Berners-Lee, for a Web where the webpage content is readable by both humans and machines. Such a vision, disregarded during the initial development of the Web, has now come back in the form of the Web of Data, or Linked Open Data (LOD), where billions of pieces of information are linked together and made available for automated processing. There is however a lack of interconnection between the information in the webpages and that in LOD. A number of initiatives, like RDFa (supported by W3C) or Microformats (used by schema.org and supported by major search engines) are trying to enable machines to make sense of the information contained in human readable pages by providing the ability to annotate webpage content with links into LOD.
While the current state of the art in Web Information Extraction (IE) relies on domain specific training data or generic extraction patterns, by leveraging LOD the proposed research aims to develop IE methodologies and technologies providing pervasive, user-driven, Web-scale information extraction where the target of the IE is defined by the user information needs and aimed at the billions of available Web documents covering an unlimited number of domains.
In this research we aim to develop models and algorithms to create a continuum between LOD and the human readable Web. The approach will utilise wealth of facts available from LOD and the limited number of pages annotated with RDFa/Microformats to learn to connect unannotated webpage content to the LOD cloud. This will provide the reciprocal advantages of: (i) enabling the search of Web pages via the unambiguous LOD instances and concepts, and (ii) the extension of the LOD with the wealth of information available from webpage content.
The key challenge is the development of efficient, Web-scale, semi-supervised, iterative learning methods able to use the initial "seed" data and annotations, by generating models which exploit: (i) the local and global information regularities (e.g. structured information in tables, as well as pages and site-wide regularities); (ii) the redundancy (or repetition) of information; (iii) any ontological restrictions available in LOD. As the learning methods iterate from known interconnections to infer new connections they must cope with the massive amount of noise generated by the number and variety of documents, domains and facts available.
In addition to publishing the research and its findings the IE methods developed will be tested on the task of extracting information relevant to schema.org (a task currently promoted by large search engines companies such as Google and Bing) as well as in international public evaluations. As part of such evaluations the project will generate at least one publicly available, Web-scale IE task (inclusive of corpora, linked resources, etc.) to enable comparison of research results by other researchers.
The project aims to impact the fields of Natural Language Processing, Machine Learning, Information Retrieval and Web and Semantic Technologies by exploring the extraction of information in Web-scale, user-driven tasks. Success in the project will enable new ways of both creating/using the LOD and providing a paradigm shift in the way information can be retrieved from the Web; away from a reliance on keywords and towards the search and exploration of the concepts and meaning (semantics) embedded in those words.

Planned Impact

Potential beneficiaries of the project results are technology, data and service providers as well as government and citizens.

IE tools providers
The project will advance the state of the art in Information Extraction from Web documents making it working at Web scale and portable with minimum user effort. Currently most companies are focussing on limited scale, on intensive porting effort or on very generic tasks such as named entities recognition (e.g. www.opencalais.com, www.ontotext.com). The project will enable going beyond these limitations. We will both generate know-how and tools published using a free licence such as MIT, which allows unlimited scientific and commercial reuse. These companies will benefit in a short-term scale, already during the lifetime of the project and through follow up knowledge transfer projects (e.g. via TSB or industrially funded projects).

Providers of information-based services
The use of LOD is one focus of research of the main search engines, as attested by the recent launch (May 2011) of schema.org. Schema.org focuses on their immediate needs by (i) asking users to manually annotate their pages and (ii) limiting the annotations to a set of specific domains which are the main core of the search engine business (e.g. eCommerce). The proposed project will enable search engines both not to depend on user's availability to annotate their pages and to go beyond the domain limitations in schema.org and to extend to the coverage of the whole LOD. We will provide input both in terms of know how and open source software.
Similarly, companies mining the social web (e.g. Twitter) for e.g. emergency responses and homeland security (e.g. k-now.co.uk), as well as companies providing specific services on the Web (e.g. price comparison sites) will be able to go beyond the current techniques that mainly require manual development of extraction methods. The technology developed will provide them with tools able to efficiently and effectively adapt to their needs, tasks and domains using the wealth of information available on the LOD. These organisations will benefit both during and after the lifetime of the project and through follow up knowledge transfer projects.

Data publishers
One of the main bottlenecks in publishing data is consistency analysis. While some publishers just dump their data as is and expect others to link and clean it, careful (esp. professional) publishers mind about the correctness and consistency of data, as well as their coverage. The project will provide measures of consistency and variability for data analysis and methods to integrate existing data with new data extracted form the Web. Data cleaning tools implementing the consistency measures will be available for use during the project lifetime. For other more complex applications, follow up knowledge transfer projects will be organised.

Government and society
IE has applications in - among others - homeland security, military applications, counter-terrorism and emergency response. The ability to identify events and facts over large scale has shown benefits in all those areas. The Web is a huge - largely untapped - source of information. As the proposed project will address the identification of facts and events over large scale, it will contribute to a safer society.

Web users and citizens
Finally consumers and citizens will benefit from the results of the project in an indirect way through the availability of new services based on large availability of interlinked data. One of the most prominent will be the improved search experience - with large quantity of quality data integrated onto the LOD, information search on the LOD can benefit in terms of both accuracy and coverage. The time frame for this is 1-5 years after the project end, when products based on the developed technology will have been industrialised and its benefits brought to the consumers.

Funded Value:

£540,482

Funded Period:

Aug 12 - Nov 15

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/J019488/1

Principal Investigator:

Fabio Ciravegna

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Information & Knowledge Mgmt (100%)

Organisations

People	ORCID iD
Fabio Ciravegna (Principal Investigator)
Anna Lisa Gentile (Researcher Co-Investigator)
Ziqi Zhang (Researcher Co-Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 > >|

10 25 50

Augenstein I (2016) Distantly supervised Web relation extraction for knowledge base population in Semantic Web

Augenstein, I. (2014) Relation Extraction from the Web using Distant Supervision

Augenstein, I. (2014) Joint Information Extraction from the Web using Linked Data

Augenstein, I. (2013) Mapping Keywords to Linked Data Resources for Automatic Query Expansion

Awodele O (2012) Use of complementary medicine amongst patients on antiretroviral drugs in an HIV treatment centre in Lagos, Nigeria. in Current drug safety

Ciravegna, F. (2012) LODIE: Linked Open Data for Web-scale Information Extraction

Eva Blomqvist (Author) (2013) Statistical Knowledge Patterns for Characterising Linked Data

Gao J (2015) Semantic Web Evaluation Challenges

Gentile A (2013) Unsupervised wrapper induction using linked data

Gentile A (2014) Text, Speech and Dialogue

Key Findings
Impact Summary
Further Funding
Collaboration
Engagement Activities


Description	The project is concluded with three key findings. First, the project confirms its hypothesis that the large amount of linked open data can be used to train Information Extraction systems, which then mines useful information from the Web. IE tasks dealing with unstructured, semi-structured, and structured data can all exploit linked data for training weakly-supervised systems. Second, due to its decentralised nature, linked data are noisy as publishers can create redundant datasets but describing them using inconsistent vocabularies, or generate incorrect data by mistakes. Such problem can be addressed by mapping heterogeneous vocabularies to reduce inconsistency in data, or using task-specific training data selection methods. For the first, we have developed data-driven mapping methods used to create ontology patterns which are ultimately used to define user's IE task and retrieve training data from the linked open data. For the second, we have developed different methods to select high quality training data for structured, semi-structured, and unstructured IE tasks, based on the principle of selecting candidates that are less likely to be ambiguous. Third, despite the gigantic size of the linked open data cloud, the distribution of data over domains are very unbalanced. The distribution appears to have a long tail, composed of a very large number of items (e.g., entities) that have very few usage (i.e., linked with other items by a relation). This has made it difficult to distinguish genuinely correct data from errors in the training data selection phase. As a result, our methods may inevitably fail to learn certain useful extraction patterns due to the incompleteness of training data. However this also opens up further research questions for future: how do we identify and quantify the long tail? Are there still much to be learnt for items sitting in the long tail of the distribution from the Web? To learn more about the long tail items does the learning method need to change? Or in other words, can we train models using 'head' items then use the model to extract information for the 'tail' items? And if not, what should we do to recover the long tail?
Exploitation Route	We have published papers and open source software
Sectors	Aerospace Defence and Marine Digital/Communication/Information Technologies (including Software) Healthcare Manufacturing including Industrial Biotechology Pharmaceuticals and Medical Biotechnology
URL	http://staffwww.dcs.shef.ac.uk/people/F.Ciravegna/Fabio_Ciravegna/Papers.html


Description	LODIE has attracted various interest from industries and it has helped to secure additional fundings from companies in the form of knowledge transfer. Two companies have used LODIE's technology to create competitive advantage: 1) JustGiving Ltd., (https://home.justgiving.com/) the world's social platform for giving, has used LODIE's large-scale Information Extraction (IE) technology to mine information from their very large in-house customer datasets, and information about charitable organisations and events on the Web. Such information are used to profile customers and charities to enhance match-making. The work was undertaken as a 3 month project during October 2014 and January 2015. And the output has been used to rebuild JustGiving's Website. 2) Klood Ltd., (https://www.klood.com/) the internet and social marketing company, has adapted LODIE's IE technology to mine specific events from social media in the football domain. This was undertaken in a one-month trial project during November 2015, followed by a 1 1/2 year project through Football Whispers Ltd, a company specialising in information and predictions in the field of football transfer. The output was released to Football Whispers users at the beginning of 2016. The company launched their main product using our technology as the main backbone of their transfer prediction engine. During the following year and a half, we provided services to the company analysing around 70 million messages a month. To our knowledge the company had around 2.5m unique monthly users and had as major customers Sky Sports and FourFourTwo. The company acquired the IP in 2016 from the University for further internal and external exploitation with plans to port from Football to other sports, starting with the American NFL.
First Year Of Impact	2014
Sector	Digital/Communication/Information Technologies (including Software),Leisure Activities, including Sports, Recreation and Tourism
Impact Types	Economic


Description	Football Whisper - Mining the Web for Football Player Transfer News
Amount	£100,000 (GBP)
Organisation	Klood
Sector	Private
Country	United Kingdom
Start	11/2015
End	12/2017


Description	JustGiving Charity/Cause Information Extraction system
Amount	£90,000 (GBP)
Organisation	JustGiving
Sector	Private
Country	United Kingdom
Start	09/2014
End	01/2015


Description	Just Giving
Organisation	JustGiving
Country	United Kingdom
Sector	Private
PI Contribution	We made available the Lodie technologies to them for analysis of their data
Collaborator Contribution	They provided data and requirements for our research
Impact	They have redesigned their web products based on the experience developed with us and our tools.
Start Year	2014


Description	ESWC Summer School 2014
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	the summer school sparked questions and interesting discussions afterwards not aware of any
Year(s) Of Engagement Activity	2014
URL	http://www.slideshare.net/isabelleaugenstein/introduction-to-natural-language-processing-for-the-sem...


Description	Invited talk: Aligning relations on Linked Data
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other academic audiences (collaborators, peers etc.)
Results and Impact	talk sparked questions and in-depth discussion with a number of audience. potential collaboration also discussed not aware of any
Year(s) Of Engagement Activity	2013


Description	Invited talk: Linked Data for Web Scale Information Extraction
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other academic audiences (collaborators, peers etc.)
Results and Impact	talk sparked questions and in-depth discussions with a number of audience from not aware of any
Year(s) Of Engagement Activity	2013
URL	http://staffwww.dcs.shef.ac.uk/people/A.L.Gentile/AnnalisaWebSite/annalisaRMIT.pdf


Description	Linked Data for Web Scale Information Extraction
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other academic audiences (collaborators, peers etc.)
Results and Impact	the tutorial sparked interesting discussions afterwards After the tutorial some researchers have approached us to discuss possibility of future collaborations
Year(s) Of Engagement Activity	2013
URL	http://oak.dcs.shef.ac.uk/wsie2013/index.html


Description	Semantic Technologies Coordinator for ESWC2014
Form Of Engagement Activity	A formal working group, expert panel or dialogue
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other academic audiences (collaborators, peers etc.)
Results and Impact	The main role of Semantic Technologies Coordinators is to produce accessible Linked Open Data about a conference. ESWC2014 is the European Semantic Web Conference http://2014.eswc-conferences.org/organizing-committee Anna Lisa Gentile, has been invited to be part of the Semantic Technologies Coordinators for next year European Semantic Web Conference, ESWC2015 http://2015.eswc-conferences.org/about-eswc-2015/organizing-committee
Year(s) Of Engagement Activity	2014
URL	http://2014.eswc-conferences.org/organizing-committee


Description	Web Scale Information Extraction
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other academic audiences (collaborators, peers etc.)
Results and Impact	The tutorial sparked discussion afterwards After the tutorial some researchers approached us for possibility of future collaboration
Year(s) Of Engagement Activity	2013
URL	http://www.ecmlpkdd2013.org/wp-content/uploads/2013/09/Web-Scale-Information-Extraction.pdf

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications