Error-tolerant Stream Processing System Design (ESP-SD)

Lead Research Organisation: UNIVERSITY COLLEGE LONDON

Department Name: Electronic and Electrical Engineering

Abstract

The energy dissipation and fault rates in future CMOS integration are expected to require the abandonment of traditional system reliability in favour of approaches that control errors across the application, runtime support, and system architecture. Commercial stakeholders of stream processing applications, such as multimedia analysis & retrieval and webpage ranking systems, already feel the strain of inadequate system-level scaling and robustness under increasing user demand. While such applications can tolerate certain imprecision (errors) in their calculations, this aspect is currently not used for system scalability and resilience.

The ESP-SD project will derive theory, methods, and a prototype set of tools for scalable adjustment of computation and error-propagation in stream processing applications operating under a fault-generating computing environment. This will be achieved via the following innovations:

* stochastic models of error tolerance in algorithms for multimedia and linked-data analytics (used in audio/video matching, semantic multimedia retrieval, webpage ranking systems, etc.);

* new forms of accelerated error-tolerant computation within numerical stream processing libraries;

* opportunistic designs for compiler and runtime support offering graceful resilience to runtime errors.

ESP-SD aims for up to two orders of magnitude of throughput and energy scaling against conventional processing (under the same platform), with application results that are reliable in a stochastic sense. That is, all mechanisms for acceleration, energy saving and reliability in ESP-SD are geared towards minimizing the "expected" error in applications, and not the worst-case error.

The project will derive practical designs demonstrating its impact, examples of which are provided in the "Impact Summary" and the "Pathways to Impact" on JeS.

Planned Impact

Cross-layer resiliency will most likely become a requirement for future HPC and embedded stream processing systems. ESP-SD will investigate methods for graceful precision degradation and error tolerance at the system level. The derived scientific publications could become trend setters in the evolution of stream-processing systems, while the provided demonstration results could pioneer future deployments within commercial systems, with the potential to influence consumer services used by millions of people.

Indicative impact cases include (but are not limited to):

* Robust realisation of stream processing algorithms under processor hardware that did not pass the quality control and may exhibit transient errors due to process variations.

* Use of ageing (low-performance, possibly unreliable) processor hardware at a lower-precision setting instead of being decommissioned.

* Development of entirely new single-instruction multiple-data (SIMD) instructions for graceful numerical computing, i.e., fault-tolerant SIMD extensions (termed as "turbo SIMD") with increased throughput and decreased energy consumption when deriving approximate results.

* Derivation of ultra low-power approximate stream processing primitives that could allow for "always on" stream analysis on mobile devices. For example, via the derived primitives, a mobile device could constantly capture and analyse visual, auditory and other sensory inputs (e.g., GPS and acceleration data) with very low overhead in energy consumption. Such analytics can be used to predict user needs and make automated intelligent suggestions based on contextual awareness.

* Creation of an integrated cross-layer framework where energy consumption, throughput increase, precision and fault tolerance are competing cost functions of the design: one can seamlessly be traded for the other.

* Consideration, for the first time, of user-centric questions such as: Do average guarantees of precision and quality-of service satisfy the end-user better than over-provisioned worst-case guarantees? Can a marketplace be created for bidding on prices for higher-precision outputs (e.g., in finance or media) with benefits for developers and more options for end users?

Funded Value:

£469,399

Funded Period:

Nov 14 - Nov 17

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/M00113X/1

Principal Investigator:

Yiannis Andreopoulos

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Computer Sys. & Architecture (100%)

Organisations

People	ORCID iD
Yiannis Andreopoulos (Principal Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Anam M (2014) Precision-Energy-Throughput Scaling of Generic Matrix Multiplication and Convolution Kernels via Linear Projections in IEEE Transactions on Circuits and Systems for Video Technology

Anam M (2016) Reliable Linear, Sesquilinear, and Bijective Operations on Integer Data Streams Via Numerical Entanglement in IEEE Transactions on Signal Processing

Anam M (2016) Reliable Linear, Sesquilinear and Bijective Operations On Integer Data Streams Via Numerical Entanglement

Anam M (2018) Generalized Numerical Entanglement for Reliable Linear, Sesquilinear and Bijective Operations on Integer Data Streams in IEEE Transactions on Emerging Topics in Computing

Anarado I (2017) Mitigating Silent Data Corruptions in Integer Matrix Products: Toward Reliable Multimedia Computing on Unreliable Hardware in IEEE Transactions on Circuits and Systems for Video Technology

Anarado I (2016) Core Failure Mitigation in Integer Sum-of-Product Computations on Cloud Computing Systems in IEEE Transactions on Multimedia

Chadha A (2016) Voronoi-based compact image descriptors: Efficient Region-of-Interest retrieval with VLAD and deep-learning-based descriptors

Chadha A (2017) Voronoi-Based Compact Image Descriptors: Efficient Region-of-Interest Retrieval With VLAD and Deep-Learning-Based Descriptors in IEEE Transactions on Multimedia

Ren S (2014) Dynamic Scheduling for Energy Minimization in Delay-Sensitive Stream Mining in IEEE Transactions on Signal Processing

Renna F (2016) Query Processing for the Internet-of-Things: Coupling of Device Energy Consumption and Cloud Infrastructure Billing

Key Findings
Impact Summary
Further Funding
Intellectual Property


Description	The energy dissipation and fault rates in future CMOS integration are expected to require the abandonment of traditional system reliability in favour of approaches that control errors across the application, runtime support, and system architecture. Commercial stakeholders of stream processing applications, such as multimedia analysis & retrieval and webpage ranking systems, already feel the strain of inadequate system-level scaling and robustness under increasing user demand. While such applications can tolerate certain imprecision (errors) in their calculations, this aspect is currently not used for system scalability and resilience. The ESP-SD project derivers theory, methods, and a prototype set of tools for scalable adjustment of computation and error-propagation in stream processing applications operating under a fault-generating computing environment. This is achieved via the following innovations: * stochastic models of error tolerance in algorithms for multimedia and linked-data analytics (used in audio/video matching, semantic multimedia retrieval, webpage ranking systems, etc.); * new forms of accelerated error-tolerant computation within numerical stream processing libraries; * opportunistic designs for compiler and runtime support offering graceful resilience to runtime errors. ESP-SD aims for up to two orders of magnitude of throughput and energy scaling against conventional processing (under the same platform), with application results that are reliable in a stochastic sense. That is, all mechanisms for acceleration, energy saving and reliability in ESP-SD are geared towards minimizing the "expected" error in applications, and not the worst-case error. The project has already led to one patent filing and the generation of initial code that has been released as open-source at github: https://github.com/NumericalPacking/Core-Failure-Mitigation-Code
Exploitation Route	* licensing of the technology protected via the patent filing * further collaborations for R&D activities on error-tolerant system design * impact the research direction of the community of error-tolerant systems
Sectors	Digital/Communication/Information Technologies (including Software) Electronics
URL	https://github.com/NumericalPacking/Core-Failure-Mitigation-Code


Description	1) Code from our research has been licensed to the industrial partner (IMGTEC). 2) A patent has been filed on a new concept discovered within the project (numerical entanglement). This patent has already been cited by a number of other patent filings, e.g., by Microsoft (US Patent 10,437,868 B2) and Tesla (US Patent 10,606,678 B2), amongst others. 3) Follow on funding: i) European Commission: 750254 - Enabling Visual IoT Applications with Advanced Network Coding Algorithms (€ 195454; 2018 - 2020) ii) Royal Society Leverhulme Trust Senior Research Fellowship: LTSRF1617/13/28 - Leverhulme Trust Senior Research Fellowship (£ 51380; 2017 - 2018) 4) New memories that are non-volatlile (like disks), yet as fast as volatile memory are on the horizon. In collaboration with Intel, we lead the research on this disruptive new area, on how to integrate these new memories in computing systems. In early 2016, our team and collaborators from Intel made a case for making the memory controllers (the hardware interface to memory) persistent and disclosed this to Intel in April 2016. Intel released a public disclosure in September 2016 stating that Intel's memory controllers will in fact become persistent in future Intel architectures (https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction).
First Year Of Impact	2016
Sector	Digital/Communication/Information Technologies (including Software),Electronics
Impact Types	Economic


Description	Enabling Visual IoT Applications with Advanced Network Coding Algorithms
Amount	€ 195,454 (EUR)
Funding ID	750254
Organisation	European Commission
Sector	Public
Country	Belgium
Start	01/2018
End	01/2020


Description	Leverhulme Trust Senior Research Fellowship
Amount	£51,380 (GBP)
Funding ID	LTSRF1617/13/28
Organisation	The Royal Society
Department	Royal Society Leverhulme Trust Senior Research Fellowship
Sector	Charity/Non Profit
Country	United Kingdom
Start	08/2017
End	09/2018


Title	METHOD AND APPARATUS FOR THE DETECTION OF FAULTS IN DATA COMPUTATIONS
Description	A method and apparatus for detecting and mitigating faults in numerical computations of M input data streams is claimed (embodiments of Figure 1 and Figure 14). Such faults may occur due to circuit or processor malfunctions stemming from (but not limited to): supply voltage or current fluctuation, timing signal errors, hardware device noise, or other signalling, hardware, or software non-idealities. The invented method and apparatus for numerical entanglement linearly superimposes M input data streams to form M numerically-entangled data streams that can optionally be stored in-place of the original inputs (as in the example embodiments of: Step 2 of Figure 1 and item 1054 of Figure 14). A series of operations, such as (but not limited to): scaling, additions/subtractions, inner or outer vector or matrix products and permutations, can then be performed directly using these entangled data streams (as in the example embodiment of Step 3 of Figure 1, operator g of Figure 2, Figures 6-11, item 1053 of Figure 14). The output results are disentangled from the M entangled output streams by additions and arithmetic shifts (example embodiments of Steps 4 and 5 of Figure 1, "disentanglement and fault checking" of Figure 2, item 1056 of Figure 14). A post-computation reliability check detects processing errors affecting disentangled outputs (example embodiments of item 1056 of Figure 14, Figures 15a, 15b, 16a, 16b, 17a, 17b).
IP Reference	WO2016034874
Protection	Patent application published
Year Protection Granted	2016
Licensed	No
Impact	* several on-going discussions for extension of this discovery for other applications beyond fault detection and recovery, e.g., in privacy-preserving computations with the privacy offered by the fact that data computations are performed in entangled form. * potential follow-on research on application of the method for compaction of data.

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications