Error-tolerant Stream Processing System Design (ESP-SD)

Lead Research Organisation: University College London
Department Name: Electronic and Electrical Engineering

Abstract

The energy dissipation and fault rates in future CMOS integration are expected to require the abandonment of traditional system reliability in favour of approaches that control errors across the application, runtime support, and system architecture. Commercial stakeholders of stream processing applications, such as multimedia analysis & retrieval and webpage ranking systems, already feel the strain of inadequate system-level scaling and robustness under increasing user demand. While such applications can tolerate certain imprecision (errors) in their calculations, this aspect is currently not used for system scalability and resilience.

The ESP-SD project will derive theory, methods, and a prototype set of tools for scalable adjustment of computation and error-propagation in stream processing applications operating under a fault-generating computing environment. This will be achieved via the following innovations:

* stochastic models of error tolerance in algorithms for multimedia and linked-data analytics (used in audio/video matching, semantic multimedia retrieval, webpage ranking systems, etc.);

* new forms of accelerated error-tolerant computation within numerical stream processing libraries;

* opportunistic designs for compiler and runtime support offering graceful resilience to runtime errors.

ESP-SD aims for up to two orders of magnitude of throughput and energy scaling against conventional processing (under the same platform), with application results that are reliable in a stochastic sense. That is, all mechanisms for acceleration, energy saving and reliability in ESP-SD are geared towards minimizing the "expected" error in applications, and not the worst-case error.

The project will derive practical designs demonstrating its impact, examples of which are provided in the "Impact Summary" and the "Pathways to Impact" on JeS.

Planned Impact

Cross-layer resiliency will most likely become a requirement for future HPC and embedded stream processing systems. ESP-SD will investigate methods for graceful precision degradation and error tolerance at the system level. The derived scientific publications could become trend setters in the evolution of stream-processing systems, while the provided demonstration results could pioneer future deployments within commercial systems, with the potential to influence consumer services used by millions of people.

Indicative impact cases include (but are not limited to):

* Robust realisation of stream processing algorithms under processor hardware that did not pass the quality control and may exhibit transient errors due to process variations.

* Use of ageing (low-performance, possibly unreliable) processor hardware at a lower-precision setting instead of being decommissioned.

* Development of entirely new single-instruction multiple-data (SIMD) instructions for graceful numerical computing, i.e., fault-tolerant SIMD extensions (termed as "turbo SIMD") with increased throughput and decreased energy consumption when deriving approximate results.

* Derivation of ultra low-power approximate stream processing primitives that could allow for "always on" stream analysis on mobile devices. For example, via the derived primitives, a mobile device could constantly capture and analyse visual, auditory and other sensory inputs (e.g., GPS and acceleration data) with very low overhead in energy consumption. Such analytics can be used to predict user needs and make automated intelligent suggestions based on contextual awareness.

* Creation of an integrated cross-layer framework where energy consumption, throughput increase, precision and fault tolerance are competing cost functions of the design: one can seamlessly be traded for the other.

* Consideration, for the first time, of user-centric questions such as: Do average guarantees of precision and quality-of service satisfy the end-user better than over-provisioned worst-case guarantees? Can a marketplace be created for bidding on prices for higher-precision outputs (e.g., in finance or media) with benefits for developers and more options for end users?

Publications

10 25 50
 
Description The energy dissipation and fault rates in future CMOS integration are expected to require the abandonment of traditional system reliability in favour of approaches that control errors across the application, runtime support, and system architecture. Commercial stakeholders of stream processing applications, such as multimedia analysis & retrieval and webpage ranking systems, already feel the strain of inadequate system-level scaling and robustness under increasing user demand. While such applications can tolerate certain imprecision (errors) in their calculations, this aspect is currently not used for system scalability and resilience.

The ESP-SD project derivers theory, methods, and a prototype set of tools for scalable adjustment of computation and error-propagation in stream processing applications operating under a fault-generating computing environment. This is achieved via the following innovations:

* stochastic models of error tolerance in algorithms for multimedia and linked-data analytics (used in audio/video matching, semantic multimedia retrieval, webpage ranking systems, etc.);

* new forms of accelerated error-tolerant computation within numerical stream processing libraries;

* opportunistic designs for compiler and runtime support offering graceful resilience to runtime errors.

ESP-SD aims for up to two orders of magnitude of throughput and energy scaling against conventional processing (under the same platform), with application results that are reliable in a stochastic sense. That is, all mechanisms for acceleration, energy saving and reliability in ESP-SD are geared towards minimizing the "expected" error in applications, and not the worst-case error.

The project has already led to one patent filing and the generation of initial code that has been released as open-source at github:
https://github.com/NumericalPacking/Core-Failure-Mitigation-Code
Exploitation Route * licensing of the technology protected via the patent filing
* further collaborations for R&D activities on error-tolerant system design
* impact the research direction of the community of error-tolerant systems
Sectors Digital/Communication/Information Technologies (including Software),Electronics

URL https://github.com/NumericalPacking/Core-Failure-Mitigation-Code
 
Description 1) Code from our research has been licensed to the industrial partner (IMGTEC). 2) A patent has been filed on a new concept discovered within the project (numerical entanglement). This patent has already been cited by a number of other patent filings, e.g., by Microsoft (US Patent 10,437,868 B2) and Tesla (US Patent 10,606,678 B2), amongst others. 3) Follow on funding: i) European Commission: 750254 - Enabling Visual IoT Applications with Advanced Network Coding Algorithms (€ 195454; 2018 - 2020) ii) Royal Society Leverhulme Trust Senior Research Fellowship: LTSRF1617/13/28 - Leverhulme Trust Senior Research Fellowship (£ 51380; 2017 - 2018) 4) New memories that are non-volatlile (like disks), yet as fast as volatile memory are on the horizon. In collaboration with Intel, we lead the research on this disruptive new area, on how to integrate these new memories in computing systems. In early 2016, our team and collaborators from Intel made a case for making the memory controllers (the hardware interface to memory) persistent and disclosed this to Intel in April 2016. Intel released a public disclosure in September 2016 stating that Intel's memory controllers will in fact become persistent in future Intel architectures (https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction).
First Year Of Impact 2016
Sector Digital/Communication/Information Technologies (including Software),Electronics
Impact Types Economic

 
Description Enabling Visual IoT Applications with Advanced Network Coding Algorithms
Amount € 195,454 (EUR)
Funding ID 750254 
Organisation European Commission 
Sector Public
Country European Union (EU)
Start 01/2018 
End 01/2020
 
Description Leverhulme Trust Senior Research Fellowship
Amount £51,380 (GBP)
Funding ID LTSRF1617/13/28 
Organisation The Royal Society 
Department Royal Society Leverhulme Trust Senior Research Fellowship
Sector Charity/Non Profit
Country United Kingdom
Start 09/2017 
End 09/2018
 
Title METHOD AND APPARATUS FOR THE DETECTION OF FAULTS IN DATA COMPUTATIONS 
Description A method and apparatus for detecting and mitigating faults in numerical computations of M input data streams is claimed (embodiments of Figure 1 and Figure 14). Such faults may occur due to circuit or processor malfunctions stemming from (but not limited to): supply voltage or current fluctuation, timing signal errors, hardware device noise, or other signalling, hardware, or software non-idealities. The invented method and apparatus for numerical entanglement linearly superimposes M input data streams to form M numerically-entangled data streams that can optionally be stored in-place of the original inputs (as in the example embodiments of: Step 2 of Figure 1 and item 1054 of Figure 14). A series of operations, such as (but not limited to): scaling, additions/subtractions, inner or outer vector or matrix products and permutations, can then be performed directly using these entangled data streams (as in the example embodiment of Step 3 of Figure 1, operator g of Figure 2, Figures 6-11, item 1053 of Figure 14). The output results are disentangled from the M entangled output streams by additions and arithmetic shifts (example embodiments of Steps 4 and 5 of Figure 1, "disentanglement and fault checking" of Figure 2, item 1056 of Figure 14). A post-computation reliability check detects processing errors affecting disentangled outputs (example embodiments of item 1056 of Figure 14, Figures 15a, 15b, 16a, 16b, 17a, 17b). 
IP Reference WO2016034874 
Protection Patent application published
Year Protection Granted 2016
Licensed No
Impact * several on-going discussions for extension of this discovery for other applications beyond fault detection and recovery, e.g., in privacy-preserving computations with the privacy offered by the fact that data computations are performed in entangled form. * potential follow-on research on application of the method for compaction of data.