Lodie,Web Scale Information Extraction via Linked Open Data

Lead Research Organisation: University of Sheffield
Department Name: Computer Science

Abstract

The World Wide Web provides access to tens of billions of pages. These pages contain information that is largely unstructured and only intended for human readability, however we are reliant on computers "reading" these pages in order to find the information we need. The proposed research intends to develop technologies to radically improve the billions of searches which are performed every day by fulfilling the initial vision, by Tim Berners-Lee, for a Web where the webpage content is readable by both humans and machines. Such a vision, disregarded during the initial development of the Web, has now come back in the form of the Web of Data, or Linked Open Data (LOD), where billions of pieces of information are linked together and made available for automated processing. There is however a lack of interconnection between the information in the webpages and that in LOD. A number of initiatives, like RDFa (supported by W3C) or Microformats (used by schema.org and supported by major search engines) are trying to enable machines to make sense of the information contained in human readable pages by providing the ability to annotate webpage content with links into LOD.
While the current state of the art in Web Information Extraction (IE) relies on domain specific training data or generic extraction patterns, by leveraging LOD the proposed research aims to develop IE methodologies and technologies providing pervasive, user-driven, Web-scale information extraction where the target of the IE is defined by the user information needs and aimed at the billions of available Web documents covering an unlimited number of domains.
In this research we aim to develop models and algorithms to create a continuum between LOD and the human readable Web. The approach will utilise wealth of facts available from LOD and the limited number of pages annotated with RDFa/Microformats to learn to connect unannotated webpage content to the LOD cloud. This will provide the reciprocal advantages of: (i) enabling the search of Web pages via the unambiguous LOD instances and concepts, and (ii) the extension of the LOD with the wealth of information available from webpage content.
The key challenge is the development of efficient, Web-scale, semi-supervised, iterative learning methods able to use the initial "seed" data and annotations, by generating models which exploit: (i) the local and global information regularities (e.g. structured information in tables, as well as pages and site-wide regularities); (ii) the redundancy (or repetition) of information; (iii) any ontological restrictions available in LOD. As the learning methods iterate from known interconnections to infer new connections they must cope with the massive amount of noise generated by the number and variety of documents, domains and facts available.
In addition to publishing the research and its findings the IE methods developed will be tested on the task of extracting information relevant to schema.org (a task currently promoted by large search engines companies such as Google and Bing) as well as in international public evaluations. As part of such evaluations the project will generate at least one publicly available, Web-scale IE task (inclusive of corpora, linked resources, etc.) to enable comparison of research results by other researchers.
The project aims to impact the fields of Natural Language Processing, Machine Learning, Information Retrieval and Web and Semantic Technologies by exploring the extraction of information in Web-scale, user-driven tasks. Success in the project will enable new ways of both creating/using the LOD and providing a paradigm shift in the way information can be retrieved from the Web; away from a reliance on keywords and towards the search and exploration of the concepts and meaning (semantics) embedded in those words.

Planned Impact

Potential beneficiaries of the project results are technology, data and service providers as well as government and citizens.

IE tools providers
The project will advance the state of the art in Information Extraction from Web documents making it working at Web scale and portable with minimum user effort. Currently most companies are focussing on limited scale, on intensive porting effort or on very generic tasks such as named entities recognition (e.g. www.opencalais.com, www.ontotext.com). The project will enable going beyond these limitations. We will both generate know-how and tools published using a free licence such as MIT, which allows unlimited scientific and commercial reuse. These companies will benefit in a short-term scale, already during the lifetime of the project and through follow up knowledge transfer projects (e.g. via TSB or industrially funded projects).

Providers of information-based services
The use of LOD is one focus of research of the main search engines, as attested by the recent launch (May 2011) of schema.org. Schema.org focuses on their immediate needs by (i) asking users to manually annotate their pages and (ii) limiting the annotations to a set of specific domains which are the main core of the search engine business (e.g. eCommerce). The proposed project will enable search engines both not to depend on user's availability to annotate their pages and to go beyond the domain limitations in schema.org and to extend to the coverage of the whole LOD. We will provide input both in terms of know how and open source software.
Similarly, companies mining the social web (e.g. Twitter) for e.g. emergency responses and homeland security (e.g. k-now.co.uk), as well as companies providing specific services on the Web (e.g. price comparison sites) will be able to go beyond the current techniques that mainly require manual development of extraction methods. The technology developed will provide them with tools able to efficiently and effectively adapt to their needs, tasks and domains using the wealth of information available on the LOD. These organisations will benefit both during and after the lifetime of the project and through follow up knowledge transfer projects.

Data publishers
One of the main bottlenecks in publishing data is consistency analysis. While some publishers just dump their data as is and expect others to link and clean it, careful (esp. professional) publishers mind about the correctness and consistency of data, as well as their coverage. The project will provide measures of consistency and variability for data analysis and methods to integrate existing data with new data extracted form the Web. Data cleaning tools implementing the consistency measures will be available for use during the project lifetime. For other more complex applications, follow up knowledge transfer projects will be organised.

Government and society
IE has applications in - among others - homeland security, military applications, counter-terrorism and emergency response. The ability to identify events and facts over large scale has shown benefits in all those areas. The Web is a huge - largely untapped - source of information. As the proposed project will address the identification of facts and events over large scale, it will contribute to a safer society.

Web users and citizens
Finally consumers and citizens will benefit from the results of the project in an indirect way through the availability of new services based on large availability of interlinked data. One of the most prominent will be the improved search experience - with large quantity of quality data integrated onto the LOD, information search on the LOD can benefit in terms of both accuracy and coverage. The time frame for this is 1-5 years after the project end, when products based on the developed technology will have been industrialised and its benefits brought to the consumers.

Publications

10 25 50
 
Description The project is concluded with three key findings.

First, the project confirms its hypothesis that the large amount of linked open data can be used to train Information Extraction systems, which then mines useful information from the Web. IE tasks dealing with unstructured, semi-structured, and structured data can all exploit linked data for training weakly-supervised systems.

Second, due to its decentralised nature, linked data are noisy as publishers can create redundant datasets but describing them using inconsistent vocabularies, or generate incorrect data by mistakes. Such problem can be addressed by mapping heterogeneous vocabularies to reduce inconsistency in data, or using task-specific training data selection methods. For the first, we have developed data-driven mapping methods used to create ontology patterns which are ultimately used to define user's IE task and retrieve training data from the linked open data. For the second, we have developed different methods to select high quality training data for structured, semi-structured, and unstructured IE tasks, based on the principle of selecting candidates that are less likely to be ambiguous.

Third, despite the gigantic size of the linked open data cloud, the distribution of data over domains are very unbalanced. The distribution appears to have a long tail, composed of a very large number of items (e.g., entities) that have very few usage (i.e., linked with other items by a relation). This has made it difficult to distinguish genuinely correct data from errors in the training data selection phase. As a result, our methods may inevitably fail to learn certain useful extraction patterns due to the incompleteness of training data. However this also opens up further research questions for future: how do we identify and quantify the long tail? Are there still much to be learnt for items sitting in the long tail of the distribution from the Web? To learn more about the long tail items does the learning method need to change? Or in other words, can we train models using 'head' items then use the model to extract information for the 'tail' items? And if not, what should we do to recover the long tail?
Exploitation Route We have published papers and open source software
Sectors Aerospace, Defence and Marine,Digital/Communication/Information Technologies (including Software),Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

URL http://staffwww.dcs.shef.ac.uk/people/F.Ciravegna/Fabio_Ciravegna/Papers.html
 
Description LODIE has attracted various interest from industries and it has helped to secure additional fundings from companies in the form of knowledge transfer. Two companies have used LODIE's technology to create competitive advantage: 1) JustGiving Ltd., (https://home.justgiving.com/) the world's social platform for giving, has used LODIE's large-scale Information Extraction (IE) technology to mine information from their very large in-house customer datasets, and information about charitable organisations and events on the Web. Such information are used to profile customers and charities to enhance match-making. The work was undertaken as a 3 month project during October 2014 and January 2015. And the output has been used to rebuild JustGiving's Website. 2) Klood Ltd., (https://www.klood.com/) the internet and social marketing company, has adapted LODIE's IE technology to mine specific events from social media in the football domain. This was undertaken in a one-month trial project during November 2015, followed by a 1 1/2 year project through Football Whispers Ltd, a company specialising in information and predictions in the field of football transfer. The output was released to Football Whispers users at the beginning of 2016. The company launched their main product using our technology as the main backbone of their transfer prediction engine. During the following year and a half, we provided services to the company analysing around 70 million messages a month. To our knowledge the company had around 2.5m unique monthly users and had as major customers Sky Sports and FourFourTwo. The company acquired the IP in 2016 from the University for further internal and external exploitation with plans to port from Football to other sports, starting with the American NFL.
First Year Of Impact 2014
Sector Digital/Communication/Information Technologies (including Software),Leisure Activities, including Sports, Recreation and Tourism
Impact Types Economic

 
Description Football Whisper - Mining the Web for Football Player Transfer News
Amount £100,000 (GBP)
Organisation Klood 
Sector Private
Country United Kingdom
Start 11/2015 
End 12/2017
 
Description JustGiving Charity/Cause Information Extraction system
Amount £90,000 (GBP)
Organisation JustGiving 
Sector Private
Country United Kingdom
Start 10/2014 
End 01/2015
 
Description Football Whispers 
Organisation Football Whispers Ltd
Country United Kingdom 
Sector Private 
PI Contribution Football Whispers are a company providing information on rumours about football to both enthusiasts and professionals (e.g. television networks). It is a new venture that has adopted part of the Lodie technologies (and part of technologies developed as part of the Randms and Redites EPSRC projects to analyse millions of messages from social media (e.g. Twitter). The are now online and having thousands of daily visitors to their web site.
Collaborator Contribution They have provided strict requirements and preexisting knowledge about football, as well as connection to huge levels of pay for data.
Impact The output is their own product that is largely provided by our social media analysis technology. We are now in the process of IP release discussion also for fields different from Football. The Ip rights are likely to be in the hundreds of thousands of pounds and shares in their company
Start Year 2015
 
Description Just Giving 
Organisation JustGiving
Country United Kingdom 
Sector Private 
PI Contribution We made available the Lodie technologies to them for analysis of their data
Collaborator Contribution They provided data and requirements for our research
Impact They have redesigned their web products based on the experience developed with us and our tools.
Start Year 2014
 
Description ESWC Summer School 2014 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact the summer school sparked questions and interesting discussions afterwards

not aware of any
Year(s) Of Engagement Activity 2014
URL http://www.slideshare.net/isabelleaugenstein/introduction-to-natural-language-processing-for-the-sem...
 
Description Invited talk: Aligning relations on Linked Data 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact talk sparked questions and in-depth discussion with a number of audience. potential collaboration also discussed

not aware of any
Year(s) Of Engagement Activity 2013
 
Description Invited talk: Linked Data for Web Scale Information Extraction 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact talk sparked questions and in-depth discussions with a number of audience from

not aware of any
Year(s) Of Engagement Activity 2013
URL http://staffwww.dcs.shef.ac.uk/people/A.L.Gentile/AnnalisaWebSite/annalisaRMIT.pdf
 
Description Linked Data for Web Scale Information Extraction 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact the tutorial sparked interesting discussions afterwards

After the tutorial some researchers have approached us to discuss possibility of future collaborations
Year(s) Of Engagement Activity 2013
URL http://oak.dcs.shef.ac.uk/wsie2013/index.html
 
Description Semantic Technologies Coordinator for ESWC2014 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact The main role of Semantic Technologies Coordinators is to produce accessible Linked Open Data about a conference. ESWC2014 is the European Semantic Web Conference http://2014.eswc-conferences.org/organizing-committee

Anna Lisa Gentile, has been invited to be part of the Semantic Technologies Coordinators for next year European Semantic Web Conference, ESWC2015 http://2015.eswc-conferences.org/about-eswc-2015/organizing-committee
Year(s) Of Engagement Activity 2014
URL http://2014.eswc-conferences.org/organizing-committee
 
Description Web Scale Information Extraction 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact The tutorial sparked discussion afterwards

After the tutorial some researchers approached us for possibility of future collaboration
Year(s) Of Engagement Activity 2013
URL http://www.ecmlpkdd2013.org/wp-content/uploads/2013/09/Web-Scale-Information-Extraction.pdf