OptoCloud: Ultra-fast optically interconnected heterogeneous Data Centers

Lead Research Organisation: University College London
Department Name: Electronic and Electrical Engineering

Abstract

The majority of human activities, including transport, Internet, banking, public health and entertainment, depend on Data Centers. Cloud traffic is forecasted to grow exponentially and account for 95% of global traffic. In 2015, the total power consumption of data centers worldwide was higher than the national power consumption of the UK and is predicted to increase up to 15-times by 2030.

Currently, all data center networks are formed based on hierarchical electronic packet switched networks; however, they can't keep up with demand creating a ever increasing gap between data growth and Moore's Law. So, while compute node power, measured in flop/s, has increased by 65 times in the last 18 years, the node communication bandwidth has only increased by 4.8 times and the bytes communicated per flop have decreased 8 times. This creates a computation to communication wall, minimizing data movement and constraining applications to operate locally. In addition, these systems also suffer from very high median latencies, O(100microseconds) (order of 100microseconds), and 99.9-percentile tail latencies, O(100ms), to the detriment of the system and application performance.

The OptoCloud fellowship aims to design and build an energy efficient, cost effective, scalable, single hop, and nanosecond speed optical circuit switched network. This will interconnect heterogeneous systems made of servers, CPUs, accelerators, neuromorphic processors, memory elements, storage to support different parts (rack, end-of-row) and sizes of data centers (small-medium size ~10-100,000 to ~1,000,000 server farm). Crucially, the network aims to offer zero data loss, without in-network a) buffering, b) active switching and routing, and c) network header addressing and processing to minimize complexity, and to consume very low power. Furthermore, the system also will inherently support 1-to-1, 1-to-N, N-to-N and N-to-1 connectivity in a synchronous manner without the need for data replication for multi/broad -casting, currently not possible. This is key to support diverse workloads such as storage caching, large-scale database lookups, training distributed deep neural networks, parallel computing that use communication primitives such as allreduce, broadcast and reduce, gather and scatter, all-to-all among others.

To achieve these, OptoCloud will explore the fundamental challenges of sub-nanosecond optical switching, near receiver-less low-power transceivers and nanosecond scheduling able to reconfigure circuits and shape IT and network topologies every 10s-100s of nanoseconds. It aims to offer orders of magnitude improvement in a) switching, b) scheduling and network topology re-configuration, c) power consumption, d) medium and tail latency and finally e) throughput with zero data loss.

The PI will work with the PDRAs, PhD students, industrial partners (Microsoft, Finisar, Xilinx, Sumitomo Electric), as well as universities (Columbia and National Technical University of Athens) and form a unique compute and optical network ecosystem to methodologically answer fundamental questions while reflecting all necessary requirements on the proposed concepts, and rigorously evaluating developed technologies using industrial driven use case scenarios.

Planned Impact

The technologies proposed will provide the means for the design and implementation of a new form of computer and network architecture. This is the heterogeneous and disaggregated Data Center system where the network technologies proposed at its core will unlock its potential to deliver unparalleled modularity and performance. The proposed technologies can support the increasing data volume, diversity and unpredictability of connected computing systems while reducing power consumption and CO2 footprint. It will enable accelerated creation and innovation of new services and applications as well as solve scientific problems currently not possible due to the rigid data centre and high performance computer architecture. This will benefit everyone who uses and relies on networked technologies.

All major communication and computing stakeholders will benefit from the fellowship results.

*Technology manufacturers and vendors: The results of the fellowship will be invaluable in designing the equipment of the future to maximize performance, flexibility and programmability by benefiting from the fusion of photonic and electronic systems. The project partners Finisar, Xilinx, and Sumitomo Electric will be the most immediate beneficiaries but others will benefit.

*Data Center and High Performance Computing operators: Using the fellowship, heterogeneous reconfigurable data center architectures can be reconfigured millions of times per second to deliver maximum utilization, best serve diverse workloads and unlock the ability to perform distributed parallel computation and distributed deep neural network tasks at scale. Microsoft, a partner of this fellowship, will be a direct beneficiary, and others will follow.

*Creation of SMEs: The fellowship will stimulate the generation of future business opportunities by creating a new sector of disaggregated and heterogeneous computer and network architecture technologies. This will allow the creation of a range of SMEs that can create revenues either through the development and licensing of software/hardware function modules or delivering complete hardware solutions. The resulting business opportunities will contribute to job creation and economic prosperity.

*Impact beyond ICT: The principles, concepts and techniques developed are directly transferable to other sectors within and beyond ICT that use and benefit from highly modular computing systems. This includes but not limited to 5G and beyond networks, Internet of Things, smart cities, satellite, High Performance Computing, consumer electronics, embedded and distributed systems, robotics, fundamental engineering, manufacturing, energy, automotive, consumer electronics and health.

*Academic and research community: The fellowship can play a key role towards the realization of a new philosophy of using optical and scheduling technologies to pool together functional modules to form complete computing and network systems. This inherently creates a new research field that will stimulate fundamental rethinking on the design and operation of systems and networks as well as the creation of new programming models and application design. Columbia University and NTUA are partners, so involved directly, but the wider community will also benefit through widespread dissemination.
In collaboration with UCL's Business and industrial partners I will take advantage of existing expertise for the communication, protection and exploitation of the results.

Publications

10 25 50
 
Description We have developed a disruptive way to replace all electronic packet switches in Cloud Data Centers, High Performance Systems and Machine Learning systems with fast optical circuit switches. This allows the network performance to increase by 20 times and reduce the power consumption by 40 times.
Exploitation Route We have filed two patents and exploring the possibility to spin-out a company that will commercialise the research developed.
Sectors Digital/Communication/Information Technologies (including Software)

 
Description We have filed two patents and we are currently exploring the possibility of spinning out a company to exploit and commercialise the research.
First Year Of Impact 2023
Sector Digital/Communication/Information Technologies (including Software)
 
Description Distributed Quantum Computing and Applications
Amount £3,049,365 (GBP)
Funding ID EP/W032643/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 04/2022 
End 03/2026
 
Description Dynamos - DYNAMIC AND RECONFIGURABLE DATA CENTRE NETWORKS WITH MODULAR OPTICAL SUBSYSTEMS
Amount £5,400,000 (GBP)
Funding ID 10038802 
Organisation Innovate UK 
Sector Public
Country United Kingdom
Start 08/2022 
End 07/2026
 
Description The quantum data centre of the future
Amount £8,918,816 (GBP)
Funding ID 10004793 
Organisation Innovate UK 
Sector Public
Country United Kingdom
Start 03/2022 
End 02/2025
 
Title TrafPy 
Description Tool to generate network traffic data for reproducability purposes. 
Type Of Material Data analysis technique 
Year Produced 2021 
Provided To Others? Yes  
Impact Used to conduct research across a team of people. 
URL https://github.com/cwfparsonson/trafpy
 
Description Huber+suhner Polatis 
Organisation Polatis
Country United Kingdom 
Sector Private 
PI Contribution Designed multi-core fibre switch in 2019-2020. Designed and demonstrated Data Center network using optical switches during the same period. Polatis is a project partner.
Collaborator Contribution Provided optical switches and some parameters of the switch constraints.
Impact Numerous papers in top conferences and journals.
Start Year 2007
 
Description Microsoft Collaboration on Distributed Deep Learning 
Organisation Microsoft Research
Department Microsoft Research Cambridge
Country United Kingdom 
Sector Private 
PI Contribution Work on developing optical switched interconnects and analytical models to design and operate AI-based computing systems.
Collaborator Contribution Information on the Cloud provider requirements and processor profiler.
Impact Not yet outputs. We are working on a potential patent and research paper.
Start Year 2019
 
Description Optical switching and networking for Quantum and Classical Data Centres 
Organisation BT Group
Department BT Research
Country United Kingdom 
Sector Private 
PI Contribution This is a EPSRC iCASE funding were we develop optical switching technologies for quantum computing
Collaborator Contribution Input of industrial requirements and specifications.
Impact None yet.
Start Year 2021
 
Description Sumitomo Electric on Multi-Core Fibre networks 
Organisation Sumitomo Corporation
Country Japan 
Sector Private 
PI Contribution We extensively characterized a multi-core fibre and modelled its behaviour.
Collaborator Contribution They provided 4 spools of multi-core fibre.
Impact Published joint paper.
Start Year 2018
 
Description Xilinx 
Organisation Xilinx Research
Country United States 
Sector Private 
PI Contribution Provided insight on technologies pioneered by my researchers.
Collaborator Contribution Xilinx Labs hosted two of my researchers in San Jose for one week and provided in depth training and knowledge on their latest solutions under NDA. They provided access to latest software tools and aim also to donate up to 2 high performance development platforms.
Impact Expanded the collaboration with other Xilinx research labs with closer proximity and strong interests in our research. The Xilinx labs in Dublin, Ireland were keen to support a H2020 ITN research proposal that I participated.
Start Year 2014
 
Title MPI operations 
Description Collective operations (scatter-reduce, all-gather, all-reduce, broadcast, all-to-all, etc.) among computing nodes that minimize the number of communication steps to just four. Inventors: Georgios Zervas, Alessandro Ottino, Joshua Benjamin 
IP Reference 2217578.0 
Protection Patent / Patent application
Year Protection Granted 2022
Licensed No
Impact The collective operations can speed up parallel and distributed tasks by x10. The network overhead can reduce from 95% to less than 1% increasing the computational efficiency from 10-20% up to 95%.
 
Title Methods and apparatus for optical fibre design and production 
Description The present technique relates to the field of design and production of multi-core optical fibres. Multi-core fibres can provide significantly improved capacity relative to singlecore fibres. However, the design parameters (for example the composition, number and geometry of the fibre cores) and corresponding transmission properties (for example signal10 to-noise ratio and level of crosstalk between cores) relate to each other in many nonlinear ways, both directly and indirectly. The design of a multi-core fibre is thus complex. Some methods for fibre design utilise a "brute force" approach, for example by modelling a large number of combinations of design parameters. However, this is inefficient, and can lead to optical fibres with suboptimal transmission properties. There is thus a desire 15 for improved methods and apparatus for designing and producing multi-core optical fibres. 
IP Reference  
Protection Patent application published
Year Protection Granted 2021
Licensed No
Impact Lead to early commercialization funding and collaboration with fibre manufacturers on improving the design tool.
 
Title Network Architecture 
Description Modular optical network architectures for data center networks for higher performance and very low power networking. 
IP Reference GB2217579.8 
Protection Patent / Patent application
Year Protection Granted 2022
Licensed No
Impact It can eliminate the use of electronic switching. It leads to x40 reduction in power consumption, x20 in network performance.
 
Title PID tuning 
Description A one-shot, offline, reinforcement learning method to identify optimal PID parameters of N^2 piezo-electric actuators of a beam steering free space optical switch. The inventors are Georgios Zervas and Zacharaya Zhabka. 
IP Reference GB2210433.5 
Protection Patent / Patent application
Year Protection Granted 2022
Licensed Yes
Impact We have licensed the IP to Huber Suhner Polatis. Polatis has already used it to improve the performance of their switches in terms of a) increased switching speed, b) increased resilience to thermal effects, c) lower insertion loss as well as significantly increased fabrication yield and manufacturing efficiency.
 
Description Invited talk at STW2021 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact Invited talk at STW2021, a Huawei-organized conference. I delivered a talk on published work on sub-nanosecond optical switching for data centers and high performance computing.
Year(s) Of Engagement Activity 2021
 
Description Invited talk at TOP Conference 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact I was invited to present the OptoCloud Fellowship program. Over 100 people from UK and abroad attended that span across telecom/datacom industries as well as academic and research institutions. There was lots of interest on my work and numerous meetings were arranged for collaboration and potential exploitation paths of the work.
Year(s) Of Engagement Activity 2022
URL https://topconference.com/
 
Description Poster presentation on multi-core fiber design using artificial intelligence and machine learning 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact The presentation covered our work on AI/ML methods and the design of novel multi-core fibres that can increase the bandwidth density and capacity of optical fiber interconnects in cloud data center networks.
Year(s) Of Engagement Activity 2022
URL https://topconference.com/
 
Description Poster presentation on optical networks for distributed machine learning systems. 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact Poster presentation of work related to optical networking and collective operations for parallel and distributed computing including machine learning systems.
Year(s) Of Engagement Activity 2022
URL https://topconference.com/
 
Description Poster presentation on ultra-fast hardware based control of optical circuit switching for cloud data centers 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact Poster presentation on TDM/WDM scheduling using hardware-based methods for large-scale cloud data center systems.
Year(s) Of Engagement Activity 2022
URL https://topconference.com/
 
Description Talk/seminar on fast optical switching for intra-satelite communications. 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact I delivered a seminar on optical switching technologies we developed for cloud data centers and how these can be used to support networking requirements of satellites for low earth orbit internet applciations.
Year(s) Of Engagement Activity 2021