DISSP: Dependable Internet-Scale Stream Processing
Lead Research Organisation:
Imperial College London
Department Name: Computing
Abstract
Real-time stream data has begun to play an increasingly important role on the Internet. One of the causes for this is the proliferation of geographically-distributed stream data sources such as sensor networks, scientific instruments, pervasive computing environments and web feeds connected to the Internet. Potentially millions of users world-wide want to take advantage of the availability of this data. Therefore they require a convenient way to process real-time stream data at a global scale through applications that perform Internet-scale stream processing (ISSP). Analogous to how search engines make static web data useful to users, an ISSP system reliably collects, filters and processes stream data from potentially thousands of data sources on behalf of many users.The research focus of this proposal is to address a major challenge facing all ISSP applications: achieving robustness in the presence of failures. Failures are a fact of life in such systems because of their scale, heterogeneity and reliance on best-effort Internet infrastructure. This conflicts with the requirements of many users that demand a dependable ISSP (DISSP) service --- a service that should continue functioning even during Internet path outages, processing host failures and resource shortages. For example, consider a scientific study that detects and analyses transient events in the sky by correlating high-bandwidth, real-time image streams from dozens of radio telescopes world-wide. After the failure of the network connection to one of the telescopes, a scientist would expect the system to continue operating, perhaps with reduced resolution. Similarly, the failure of a host that processes the sky images should not cause a service interruption, although it may lead to a decrease in detection confidence of anomalies. Unfortunately, mechanisms for building DISSP systems are lacking. Conventional techniques for reliable data processing cannot be applied to a DISSP system because of its requirements of global scalability, of short recovery times in a real-time setting and of resource efficiency due to a shared infrastructure.This proposal addresses these challenges so that ISSP systems needed for interconnecting tomorrow's pervasive sensor systems and global scientific experiments can become a reality. We intend to develop new techniques for building dependable ISSP systems, taking their unique features in terms of scale, failure model and data quality into account. In particular, we will devise approaches that gracefully degrade result quality in response to resource shortage after failure. While result quality is reduced, the system will provide constant feedback to users on the achieved level of service. Feedback will be expressed in a domain-specific way, e.g., by notifying a scientific user about the reduction in detection confidence of events of interest. This feedback will also drive an adaptive fault-tolerance mechanism, allowing the DISSP system to strategise about resource allocation in order to minimise the reduction in service quality of a maximum number of users.
Organisations
People |
ORCID iD |
Peter Pietzuch (Principal Investigator) |
Publications
Castro Fernandez R
(2013)
Integrating scale out and fault tolerance in stream processing using operator state management
Cervino J
(2012)
Adaptive Provisioning of Stream Processing Systems in the Cloud
Cervino, J
(2012)
Adaptive Provisioning of Stream Processing Systems in the Cloud
in 7th International Workshop on Self Managing Database Systems (SMDB'12)
Evangelia Kalyvianaki (Author)
(2012)
Overload Management in Data Stream Processing Systems with Latency Guarantees
Eyers D
(2012)
Living in the present
Fernandez R
(2016)
Java2SDG: Stateful big data processing for the masses
Fernandez R
(2014)
Scalable stateful stream processing for smart grids
Fernandez, R.C
(2014)
Making State Explicit for Imperative Big Data Processing
Fiscato, M;
(2009)
A Quality-Centric Data Model for Distributed Stream Management Systems
in 7th International Workshop on Quality in Database (QDB'09)
Description | The focus of the DISSP project is to investigate new scalable and reliable software platforms for processing large amounts of streaming data in real-time. The research work resulted in the development of two architectures: (a) the DISSP system can process data in resource-constrained environments, guaranteeing the fairness and quality of processing; and (b) the SEEP system, which is a data-parallel Big Data processing platform that provides a novel stateful dataflow model, which is more expressive than existing Big Data processing models. |
Exploitation Route | Both the DISSP and SEEP systems have been released as open-source software. The SEEP data processing platform is available on the public GitHub repository and has already gained a number of external users, including IBM and several start-ups. The developed stream processing technology has applications in any domain that has to analyse Big Data as part of its operation. |
Sectors | Digital/Communication/Information Technologies (including Software) Financial Services and Management Consultancy Retail Security and Diplomacy Transport |
URL | http://lsds.doc.ic.ac.uk/projects/DISSP |
Description | The SEEP platform was uses by IBM to create a port of the system to mobile phones with the goal of creating sensing applications on mobile devices. In addition, there was considerable commercial interest in the SEEP platform, and we are exploring the exploitation of SEEP with our industrial partners (in particular BAE Systems). A follow-on project with BP explores applications in the oil and gas domain; a project with Morgan Stanley and VMware investigates applications related to the management of large IT infrastructures. In addition, we have extended SEEP to support machine learning workloads, and received further industrial regarding this. |
First Year Of Impact | 2012 |
Sector | Aerospace, Defence and Marine,Digital/Communication/Information Technologies (including Software),Energy,Financial Services, and Management Consultancy,Other |
Impact Types | Economic |
Description | BP Complex Event Processing |
Amount | £295,000 (GBP) |
Organisation | BP (British Petroleum) |
Sector | Private |
Country | United Kingdom |
Start | 11/2013 |
End | 10/2015 |
Description | Custom and Multicore Technologies for Regular Matching Over Streams |
Amount | £94,621 (GBP) |
Funding ID | CASE Award |
Organisation | BAE Systems |
Sector | Academic/University |
Country | United Kingdom |
Start | 09/2011 |
End | 03/2015 |
Description | Data Stream Processing in Hybrid Coalition Networks |
Amount | £128,000 (GBP) |
Organisation | IBM |
Department | IBM UK Ltd |
Sector | Private |
Country | United Kingdom |
Start | 11/2013 |
End | 04/2015 |
Description | HARNESS: Hardware- and Network-Enhanced Software Systems for Cloud Computing |
Amount | € 750,000 (EUR) |
Funding ID | 318521 |
Organisation | European Commission |
Sector | Public |
Country | European Union (EU) |
Start | 09/2012 |
End | 09/2015 |
Description | SEEP: Scalable and Elastic Stream Processing |
Amount | £94,621 (GBP) |
Funding ID | CASE Award |
Organisation | BAE Systems |
Sector | Academic/University |
Country | United Kingdom |
Start | 09/2011 |
End | 10/2014 |
Title | SEEP stream processing platform |
Description | SEEP is a scalable stream processing platform that can be used for real-time Big Data processing in cloud environments. |
Type Of Technology | Software |
Year Produced | 2014 |
Open Source License? | Yes |
Impact | The SEEP platform was used for a range of projects by collaborators eg from IBM. |
URL | http://github.com/lsds/Seep/ |