GATE Cloud Exploratory: Adapting the General Architecture for Text Engineering to Cloud Computing

Lead Research Organisation: University of Sheffield
Department Name: Computer Science

Abstract

When you plug your fridge into the mains electricity supply you don't worry about all the technology sitting behind the wall socket -- it just works. Cloud computing is starting to supply IT in a similar fashion. No more worrying about backups, no more hours spent configuring a new or repaired machine -- just plug into the network, fire up your web browser and away you go.Researchers have tougher and more specialised IT needs than most, so to realise the same ease of use that the cloud now provides for email or word processing requires work in several areas. One of these areas is to adapt existing established research tools to the cloud, and that is what this project will do. Our tool is called GATE, a General Architecture for Text Engineering. Over the last decade the UK's GATE system has become a world-leader for research and development of text mining algorithms.Text has become a more and more important communication method in recent decades. Our children are now spending over 6 hours in front of screens; our evenings often include sessions on Facebook or writing email to friends and relatives. When we interact with the corporations and governmental organisations whose infrastructure and services underpin our daily lives, we fill in forms or write emails. When we want to publicise our work or share details of our leisure activities we create websites, post Twitter messages or blog entries. Scientists also now use these channels in their work, in addition to publishing in peer-reviewed journals -- a process which has also seen a huge expansion in recent years.This avalanche of the written word has changed many things, not least the way that scientists gather information. For example, a team at the World Health Organisation's cancer research agency recently found the first evidence of a link between particular genetic mutation and the risk of lung cancer in smokers. Their experiments require large amounts of costly laboratory time to test hypotheses, based on samples of mutations in gene sequences from their test subjects. Text mining from previous publications makes it possible for them to reduce this lab time by factoring in probabilities based on association strengths between mutations, environmental factors and active chemicals.A second area that has been revolutionised by new media is customer relations and market research, which are no longer about monitoring the goings on of the corporate call centre. Keeping up to date with the public image of your products or services now means coping with the Twitter firehose (45 million posts per day), the comment sections of consumer review sites, or the point-and-click 'contact us' forms from the company website. To do this by hand is now impossible in the general case: the data volume long ago outstripped the possibility of cost-effective manual monitoring. Text mining provides alternative, automatic methods for dealing with text.GATE provides four systems to support scientists experimenting with new text mining algorithms and developers using text mining in their applications:- GATE Developer: an integrated development environment for language processing components- GATE Embedded: an object library optimised for inclusion in diverse applications- GATE Teamware: a collaborative annotation environment for high volume web-based semantic annotation projects built around a workflow engine- GATE Mmir: (Multi-paradigm Information Management Index and Repository) a massively scaleable multi-paradigm indexWe have identified a need for a particular type of cloud service in our research field and this project will implement it such that there is close to zero barrier to entry for researchers. Based on our preliminary investigative work, we expect to complete a production quality service within this project. In simpler terms - this project will work towards making use of GATE on the cloud more like electric sockets and fridges!

Planned Impact

A recent study of the market for intelligent text processing solutions estimated its size at $350 million, with further significant growth potential. The study also showed that blogs and social media are the most important kind of unstructured information for which automatic solutions are needed; closely followed by news articles, email, and online forums. This project will develop a working cloud adaptation of a widely-used research platform for text processing (GATE). In order to achieve maximum impact, software will be made available as open-source under a commercially-friendly license such as LGPL. The portability, costs and effort evaluation reports will include practical guidance for companies and researchers, to promote the easy adoption of the project results, with specific target being the large and active GATE R&D company user base. To complement these, we will undertake dissemination and commercial engagement activities aimed at key industry sectors. In terms of economic impact, we will target key growth areas for text processing solutions such as intellectual property protection and patent search; voice-of-the-customer applications; online brand, product, and reputation management; digital archives and eGovernment; and companies providing internet privacy and security services. Major beneficiaries are the Digital Economy sectors identified above. In addition to the project partners, through UK and international research projects the PI has built successful industrial collaborations with other large companies (BT, The Stationery Office, Elsevier, Nokia, Yahoo, Atos, Dassault Aviation, Elsevier, MPS Bank, Creditreform) and SMEs (Fizzback, Garlik, Innovantage, Ontotext, Matrixware, Mondeca, Ontoprise, ISOCO), many of whom are already using GATE-based text processing solutions and can therefore benefit directly from this project. Further knowledge transfer opportunities arise through the Sheffield University connections to the digital and new media industries in the Sheffield city region, which are growing at a faster rate than anywhere else in the UK in terms of specialist companies and new jobs. A unique opportunity arises also from the 100 million pound South Yorkshire Digital Region project, which will pilot the Next Generation Broadband and thus provide the required infrastructure for businesses to access efficiently cloud-based services. Last, but not least, this project will have impact on advancing knowledge across scientific disciplines and improving the career opportunities of all team members by enabling them to build expertise in cloud computing. The project will help the PI and his group to re-affirm their status as world-leading researchers in their field, initiate new cross-disciplinary research and industrial collaborations, and engage with newly emerging technology. The inter-disciplinary nature of our research partners will help researchers from social sciences, humanities, bio-informatics, and information science to gain knowledge of fields complementary to their primary expertise and to make use of the new cloud-based text processing infrastructure.

Publications

10 25 50
publication icon
Tablan V (2013) GATECloud.net: a platform for large-scale, open-source text processing on the cloud. in Philosophical transactions. Series A, Mathematical, physical, and engineering sciences

 
Description Establishment of leading open source cloud-based text analytics platform; numerous commercial uptakes.
First Year Of Impact 2011
Sector Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Financial Services, and Management Consultancy,Healthcare,Culture, Heritage, Museums and Collections
Impact Types Societal,Economic,Policy & public services

 
Description GATECloud.net website 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience
Results and Impact The live demonstration of the results of the project
Year(s) Of Engagement Activity 2011