Error-tolerant Stream Processing System Design (ESP-SD)
Lead Research Organisation:
University College London
Department Name: Electronic and Electrical Engineering
Abstract
The energy dissipation and fault rates in future CMOS integration are expected to require the abandonment of traditional system reliability in favour of approaches that control errors across the application, runtime support, and system architecture. Commercial stakeholders of stream processing applications, such as multimedia analysis & retrieval and webpage ranking systems, already feel the strain of inadequate system-level scaling and robustness under increasing user demand. While such applications can tolerate certain imprecision (errors) in their calculations, this aspect is currently not used for system scalability and resilience.
The ESP-SD project will derive theory, methods, and a prototype set of tools for scalable adjustment of computation and error-propagation in stream processing applications operating under a fault-generating computing environment. This will be achieved via the following innovations:
* stochastic models of error tolerance in algorithms for multimedia and linked-data analytics (used in audio/video matching, semantic multimedia retrieval, webpage ranking systems, etc.);
* new forms of accelerated error-tolerant computation within numerical stream processing libraries;
* opportunistic designs for compiler and runtime support offering graceful resilience to runtime errors.
ESP-SD aims for up to two orders of magnitude of throughput and energy scaling against conventional processing (under the same platform), with application results that are reliable in a stochastic sense. That is, all mechanisms for acceleration, energy saving and reliability in ESP-SD are geared towards minimizing the "expected" error in applications, and not the worst-case error.
The project will derive practical designs demonstrating its impact, examples of which are provided in the "Impact Summary" and the "Pathways to Impact" on JeS.
The ESP-SD project will derive theory, methods, and a prototype set of tools for scalable adjustment of computation and error-propagation in stream processing applications operating under a fault-generating computing environment. This will be achieved via the following innovations:
* stochastic models of error tolerance in algorithms for multimedia and linked-data analytics (used in audio/video matching, semantic multimedia retrieval, webpage ranking systems, etc.);
* new forms of accelerated error-tolerant computation within numerical stream processing libraries;
* opportunistic designs for compiler and runtime support offering graceful resilience to runtime errors.
ESP-SD aims for up to two orders of magnitude of throughput and energy scaling against conventional processing (under the same platform), with application results that are reliable in a stochastic sense. That is, all mechanisms for acceleration, energy saving and reliability in ESP-SD are geared towards minimizing the "expected" error in applications, and not the worst-case error.
The project will derive practical designs demonstrating its impact, examples of which are provided in the "Impact Summary" and the "Pathways to Impact" on JeS.
Planned Impact
Cross-layer resiliency will most likely become a requirement for future HPC and embedded stream processing systems. ESP-SD will investigate methods for graceful precision degradation and error tolerance at the system level. The derived scientific publications could become trend setters in the evolution of stream-processing systems, while the provided demonstration results could pioneer future deployments within commercial systems, with the potential to influence consumer services used by millions of people.
Indicative impact cases include (but are not limited to):
* Robust realisation of stream processing algorithms under processor hardware that did not pass the quality control and may exhibit transient errors due to process variations.
* Use of ageing (low-performance, possibly unreliable) processor hardware at a lower-precision setting instead of being decommissioned.
* Development of entirely new single-instruction multiple-data (SIMD) instructions for graceful numerical computing, i.e., fault-tolerant SIMD extensions (termed as "turbo SIMD") with increased throughput and decreased energy consumption when deriving approximate results.
* Derivation of ultra low-power approximate stream processing primitives that could allow for "always on" stream analysis on mobile devices. For example, via the derived primitives, a mobile device could constantly capture and analyse visual, auditory and other sensory inputs (e.g., GPS and acceleration data) with very low overhead in energy consumption. Such analytics can be used to predict user needs and make automated intelligent suggestions based on contextual awareness.
* Creation of an integrated cross-layer framework where energy consumption, throughput increase, precision and fault tolerance are competing cost functions of the design: one can seamlessly be traded for the other.
* Consideration, for the first time, of user-centric questions such as: Do average guarantees of precision and quality-of service satisfy the end-user better than over-provisioned worst-case guarantees? Can a marketplace be created for bidding on prices for higher-precision outputs (e.g., in finance or media) with benefits for developers and more options for end users?
Indicative impact cases include (but are not limited to):
* Robust realisation of stream processing algorithms under processor hardware that did not pass the quality control and may exhibit transient errors due to process variations.
* Use of ageing (low-performance, possibly unreliable) processor hardware at a lower-precision setting instead of being decommissioned.
* Development of entirely new single-instruction multiple-data (SIMD) instructions for graceful numerical computing, i.e., fault-tolerant SIMD extensions (termed as "turbo SIMD") with increased throughput and decreased energy consumption when deriving approximate results.
* Derivation of ultra low-power approximate stream processing primitives that could allow for "always on" stream analysis on mobile devices. For example, via the derived primitives, a mobile device could constantly capture and analyse visual, auditory and other sensory inputs (e.g., GPS and acceleration data) with very low overhead in energy consumption. Such analytics can be used to predict user needs and make automated intelligent suggestions based on contextual awareness.
* Creation of an integrated cross-layer framework where energy consumption, throughput increase, precision and fault tolerance are competing cost functions of the design: one can seamlessly be traded for the other.
* Consideration, for the first time, of user-centric questions such as: Do average guarantees of precision and quality-of service satisfy the end-user better than over-provisioned worst-case guarantees? Can a marketplace be created for bidding on prices for higher-precision outputs (e.g., in finance or media) with benefits for developers and more options for end users?
Organisations
People |
ORCID iD |
Yiannis Andreopoulos (Principal Investigator) |
Publications
Anam M
(2016)
Reliable Linear, Sesquilinear, and Bijective Operations on Integer Data Streams Via Numerical Entanglement
in IEEE Transactions on Signal Processing
Anam M
(2018)
Generalized Numerical Entanglement for Reliable Linear, Sesquilinear and Bijective Operations on Integer Data Streams
in IEEE Transactions on Emerging Topics in Computing
Anam M
(2014)
Precision-Energy-Throughput Scaling of Generic Matrix Multiplication and Convolution Kernels via Linear Projections
in IEEE Transactions on Circuits and Systems for Video Technology
Anarado I
(2017)
Mitigating Silent Data Corruptions in Integer Matrix Products: Toward Reliable Multimedia Computing on Unreliable Hardware
in IEEE Transactions on Circuits and Systems for Video Technology
Anarado I
(2016)
Core Failure Mitigation in Integer Sum-of-Product Computations on Cloud Computing Systems
in IEEE Transactions on Multimedia
Chadha A
(2017)
Voronoi-Based Compact Image Descriptors: Efficient Region-of-Interest Retrieval With VLAD and Deep-Learning-Based Descriptors
in IEEE Transactions on Multimedia
Ren S
(2014)
Dynamic Scheduling for Energy Minimization in Delay-Sensitive Stream Mining
in IEEE Transactions on Signal Processing
Description | The energy dissipation and fault rates in future CMOS integration are expected to require the abandonment of traditional system reliability in favour of approaches that control errors across the application, runtime support, and system architecture. Commercial stakeholders of stream processing applications, such as multimedia analysis & retrieval and webpage ranking systems, already feel the strain of inadequate system-level scaling and robustness under increasing user demand. While such applications can tolerate certain imprecision (errors) in their calculations, this aspect is currently not used for system scalability and resilience. The ESP-SD project derivers theory, methods, and a prototype set of tools for scalable adjustment of computation and error-propagation in stream processing applications operating under a fault-generating computing environment. This is achieved via the following innovations: * stochastic models of error tolerance in algorithms for multimedia and linked-data analytics (used in audio/video matching, semantic multimedia retrieval, webpage ranking systems, etc.); * new forms of accelerated error-tolerant computation within numerical stream processing libraries; * opportunistic designs for compiler and runtime support offering graceful resilience to runtime errors. ESP-SD aims for up to two orders of magnitude of throughput and energy scaling against conventional processing (under the same platform), with application results that are reliable in a stochastic sense. That is, all mechanisms for acceleration, energy saving and reliability in ESP-SD are geared towards minimizing the "expected" error in applications, and not the worst-case error. The project has already led to one patent filing and the generation of initial code that has been released as open-source at github: https://github.com/NumericalPacking/Core-Failure-Mitigation-Code |
Exploitation Route | * licensing of the technology protected via the patent filing * further collaborations for R&D activities on error-tolerant system design * impact the research direction of the community of error-tolerant systems |
Sectors | Digital/Communication/Information Technologies (including Software) Electronics |
URL | https://github.com/NumericalPacking/Core-Failure-Mitigation-Code |
Description | 1) Code from our research has been licensed to the industrial partner (IMGTEC). 2) A patent has been filed on a new concept discovered within the project (numerical entanglement). This patent has already been cited by a number of other patent filings, e.g., by Microsoft (US Patent 10,437,868 B2) and Tesla (US Patent 10,606,678 B2), amongst others. 3) Follow on funding: i) European Commission: 750254 - Enabling Visual IoT Applications with Advanced Network Coding Algorithms (€ 195454; 2018 - 2020) ii) Royal Society Leverhulme Trust Senior Research Fellowship: LTSRF1617/13/28 - Leverhulme Trust Senior Research Fellowship (£ 51380; 2017 - 2018) 4) New memories that are non-volatlile (like disks), yet as fast as volatile memory are on the horizon. In collaboration with Intel, we lead the research on this disruptive new area, on how to integrate these new memories in computing systems. In early 2016, our team and collaborators from Intel made a case for making the memory controllers (the hardware interface to memory) persistent and disclosed this to Intel in April 2016. Intel released a public disclosure in September 2016 stating that Intel's memory controllers will in fact become persistent in future Intel architectures (https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction). |
First Year Of Impact | 2016 |
Sector | Digital/Communication/Information Technologies (including Software),Electronics |
Impact Types | Economic |
Description | Enabling Visual IoT Applications with Advanced Network Coding Algorithms |
Amount | € 195,454 (EUR) |
Funding ID | 750254 |
Organisation | European Commission |
Sector | Public |
Country | European Union (EU) |
Start | 01/2018 |
End | 01/2020 |
Description | Leverhulme Trust Senior Research Fellowship |
Amount | £51,380 (GBP) |
Funding ID | LTSRF1617/13/28 |
Organisation | The Royal Society |
Department | Royal Society Leverhulme Trust Senior Research Fellowship |
Sector | Charity/Non Profit |
Country | United Kingdom |
Start | 08/2017 |
End | 09/2018 |
Title | METHOD AND APPARATUS FOR THE DETECTION OF FAULTS IN DATA COMPUTATIONS |
Description | A method and apparatus for detecting and mitigating faults in numerical computations of M input data streams is claimed (embodiments of Figure 1 and Figure 14). Such faults may occur due to circuit or processor malfunctions stemming from (but not limited to): supply voltage or current fluctuation, timing signal errors, hardware device noise, or other signalling, hardware, or software non-idealities. The invented method and apparatus for numerical entanglement linearly superimposes M input data streams to form M numerically-entangled data streams that can optionally be stored in-place of the original inputs (as in the example embodiments of: Step 2 of Figure 1 and item 1054 of Figure 14). A series of operations, such as (but not limited to): scaling, additions/subtractions, inner or outer vector or matrix products and permutations, can then be performed directly using these entangled data streams (as in the example embodiment of Step 3 of Figure 1, operator g of Figure 2, Figures 6-11, item 1053 of Figure 14). The output results are disentangled from the M entangled output streams by additions and arithmetic shifts (example embodiments of Steps 4 and 5 of Figure 1, "disentanglement and fault checking" of Figure 2, item 1056 of Figure 14). A post-computation reliability check detects processing errors affecting disentangled outputs (example embodiments of item 1056 of Figure 14, Figures 15a, 15b, 16a, 16b, 17a, 17b). |
IP Reference | WO2016034874 |
Protection | Patent application published |
Year Protection Granted | 2016 |
Licensed | No |
Impact | * several on-going discussions for extension of this discovery for other applications beyond fault detection and recovery, e.g., in privacy-preserving computations with the privacy offered by the fact that data computations are performed in entangled form. * potential follow-on research on application of the method for compaction of data. |