# Sublinear Algorithms for Big Graphs

Lead Research Organisation:
University of Warwick

Department Name: Computer Science

### Abstract

A fundamental task in the study of large networks is to efficiently analyze their structural properties. For example, we may want to know if a network is well-connected, is well-clusterable, has many copies (instances) of some specific sub-structures, etc. Given that modern networks are large, often consisting of millions and billions of nodes (web graph, social networks, etc.), the task of analyzing their structure has become recently increasingly challenging, and the running-time efficiency of this task is becoming of critical importance. Indeed, being able to quickly analyze important features in the gigantic amount of information already is a key technology of a multi-billion dollar industry (see, e.g., Google, Yahoo, Facebook, etc.) and its significance will likely increase further in the near future. To efficiently manage and analyze large networks, in recent years we have seen a rise of the importance of sublinear time algorithms, that is, algorithms that use significantly less resources than the input size. Since sublinear algorithms can process only a small fraction of the input, they are not suitable for many applications, especially if exact solutions are sought; but recently we have seen a number of sublinear algorithms that compute approximate solutions for a variety of optimization and decision problems arising in such diverse areas as algebraic computations, networks, geometry, and computer graphics.

To cope with these modern challenges of large networks, this proposal will exploit the expertise of the PI in the area of randomized algorithms to develop new algorithmic techniques for the analysis of big graphs by making significant advances in the area of sublinear algorithms for combinatorial problems. The central goal is to push forward the barriers of our knowledge in the area of sublinear-time algorithms for graph problems by enlarging the class of problems for which sublinear-time are known and by characterizing problems for which sublinear-time algorithms are impossible to exist. The main technical goal is to develop algorithmic technology for the analysis of large graphs in the context of two central models: property testing algorithms and sublinear-time approximation algorithms. We plan also to apply the techniques developed to further models related to sublinear algorithms: data streaming, dynamic and online algorithms. Our objective is to attack grand challenges in the area of sublinear algorithms and we aim to make major advances.

The focus of this project is on fundamental research in this area, aiming at advances in the area of theoretical aspects of the design and analysis of algorithms.

To cope with these modern challenges of large networks, this proposal will exploit the expertise of the PI in the area of randomized algorithms to develop new algorithmic techniques for the analysis of big graphs by making significant advances in the area of sublinear algorithms for combinatorial problems. The central goal is to push forward the barriers of our knowledge in the area of sublinear-time algorithms for graph problems by enlarging the class of problems for which sublinear-time are known and by characterizing problems for which sublinear-time algorithms are impossible to exist. The main technical goal is to develop algorithmic technology for the analysis of large graphs in the context of two central models: property testing algorithms and sublinear-time approximation algorithms. We plan also to apply the techniques developed to further models related to sublinear algorithms: data streaming, dynamic and online algorithms. Our objective is to attack grand challenges in the area of sublinear algorithms and we aim to make major advances.

The focus of this project is on fundamental research in this area, aiming at advances in the area of theoretical aspects of the design and analysis of algorithms.

### Planned Impact

The research proposed in this project, while of fundamental and theoretical character has a clear potential to impact the area of algorithms, and with this it has a long term potential to impact many areas of modern technologies.

To cope with modern requirements of big data it is of critical importance to advance the foundational research in algorithms, especially for algorithms that use computational resources significantly less than the input size - sublinear algorithms. By developing new algorithmic techniques for the analysis of big graphs and making significant advances in the area of sublinear algorithms for graph problems, the project is addressing one of the central challenges of modern computing - of computations and algorithms with limited resources - and is advancing the area of algorithms whose importance for modern technologies has been distinctly emphasized in various venues.

Interaction with industry:

The area of the proposal-design and analysis of sublinear algorithms for big graphs and networks-is of significant interest in several industrial research laboratories in North America, e.g., Microsoft, Google, Yahoo, IBM. While many of most important ideas in the area of sublinear algorithms and their big data applications originated in academia, more recently we have seen an increasing number of algorithmic developments in big high-tech companies. The PI has direct contacts with many leading researchers from main research laboratories in North America and he plans to enhance the collaboration with leading research labs with the goal of making further scientific advances in the field and also streamlining the impact of this research.

A significant part of the interaction with industry, especially in the context of the UK industry, is planned to be facilitated through interdisciplinary research centers at the University of Warwick. The PI plans to be involved in knowledge transfer activities in Warwick Data Science Institute, Warwick Institute for the Science of Cities, and Maths for Real-World Systems CDT. These Warwick research and training centers have close links to industry in the areas related to the project and the findings of the project will enhance this interaction.

The PI plans also to actively contribute to the interdisciplinary activities in the recently established Alan Turing Institute for Data Science in London that should facilitate close interaction of researchers in big data with leading industry.

Dissemination:

The results will be disseminated by publications in high quality international journals and conferences, and all papers will be published in open access venues. The PI will also maintain a web page with main activities of the project.

Besides publishing, the PI will give presentations about the results of this project in leading industrial research laboratories. The PI plans to visit, make presentations, and discuss further knowledge transfer activities in Microsoft Research Labs, IBM Research Center, Yahoo Labs, Google Research, BT Labs, etc.

Further impact will be achieved through national and international conferences. The PI has organized in the past several international workshops in the areas related to this project and those have been very well attended by industrial participants. The PI plans to organize further 2-3 workshops on sublinear algorithms and their applications, possibly going beyond this topic to a more general study of foundations of big data. These events would bring together leading academics and industrial researchers, system architects working in this area. To achieve a greater impact of these activities, the PI requests from EPSRC funding to organize one of these events, where he will invite leading academics, industrial researchers, system architects, and practitioners to discuss the advances of the design of efficient (sublinear) algorithms for big data problems with the special focus on the challenges of incorporating these advances in industry.

To cope with modern requirements of big data it is of critical importance to advance the foundational research in algorithms, especially for algorithms that use computational resources significantly less than the input size - sublinear algorithms. By developing new algorithmic techniques for the analysis of big graphs and making significant advances in the area of sublinear algorithms for graph problems, the project is addressing one of the central challenges of modern computing - of computations and algorithms with limited resources - and is advancing the area of algorithms whose importance for modern technologies has been distinctly emphasized in various venues.

Interaction with industry:

The area of the proposal-design and analysis of sublinear algorithms for big graphs and networks-is of significant interest in several industrial research laboratories in North America, e.g., Microsoft, Google, Yahoo, IBM. While many of most important ideas in the area of sublinear algorithms and their big data applications originated in academia, more recently we have seen an increasing number of algorithmic developments in big high-tech companies. The PI has direct contacts with many leading researchers from main research laboratories in North America and he plans to enhance the collaboration with leading research labs with the goal of making further scientific advances in the field and also streamlining the impact of this research.

A significant part of the interaction with industry, especially in the context of the UK industry, is planned to be facilitated through interdisciplinary research centers at the University of Warwick. The PI plans to be involved in knowledge transfer activities in Warwick Data Science Institute, Warwick Institute for the Science of Cities, and Maths for Real-World Systems CDT. These Warwick research and training centers have close links to industry in the areas related to the project and the findings of the project will enhance this interaction.

The PI plans also to actively contribute to the interdisciplinary activities in the recently established Alan Turing Institute for Data Science in London that should facilitate close interaction of researchers in big data with leading industry.

Dissemination:

The results will be disseminated by publications in high quality international journals and conferences, and all papers will be published in open access venues. The PI will also maintain a web page with main activities of the project.

Besides publishing, the PI will give presentations about the results of this project in leading industrial research laboratories. The PI plans to visit, make presentations, and discuss further knowledge transfer activities in Microsoft Research Labs, IBM Research Center, Yahoo Labs, Google Research, BT Labs, etc.

Further impact will be achieved through national and international conferences. The PI has organized in the past several international workshops in the areas related to this project and those have been very well attended by industrial participants. The PI plans to organize further 2-3 workshops on sublinear algorithms and their applications, possibly going beyond this topic to a more general study of foundations of big data. These events would bring together leading academics and industrial researchers, system architects working in this area. To achieve a greater impact of these activities, the PI requests from EPSRC funding to organize one of these events, where he will invite leading academics, industrial researchers, system architects, and practitioners to discuss the advances of the design of efficient (sublinear) algorithms for big data problems with the special focus on the challenges of incorporating these advances in industry.

## People |
## ORCID iD |

Artur Czumaj (Principal Investigator) |

### Publications

Czumaj A
(2016)

*Faster Deterministic Communication in Radio Networks*
Czumaj A
(2016)

*Communicating with Beeps*
Czumaj A
(2017)

*Encyclopedia of Machine Learning and Data Mining*
HalldÃ³rsson M
(2017)

*Computing large independent sets in a single round*in Distributed Computing
Cormode G
(2017)

*Independent Set Size Approximation in Graph Streams*
Konrad C
(2017)

*Preemptive Online Partitioning of Sequences*
HalldÃ³rsson M
(2017)

*Structural Information and Communication Complexity*Description | The project has been focusing on the theoretical study of sublinear algorithms for graphs. While the project is still in the progress, the project has already made some important contributions to the area of sublinear algorithms. (There are further a few papers in preparations that we hope will address several central questions posed in the original proposal; they will we reported in the next submission round.) In STOC'2016 paper (A. Czumaj, P. Peng, and C. Sohler. Relating Two Property Testing Models for Bounded Degree Directed Graphs. In Proceedings of the 48th Annual ACM Symposium on Theory of Computing (STOC 2016), pages 1033 - 1045, 2016), following one of the core topics in the project, we have studied the complexity of testing properties in directed graphs (digraphs) with bounded maximum indegree and maximum outdegree. For directed graphs with bounded degree, there are two different models in property testing introduced by Bender and Ron (2002). In the bidirectional model, one can access both incoming and outgoing edges while in the unidirectional model one can only access outgoing edges (think about a web graph, where one can only follow the links, but one cannot go back to the page pointing to a given page). In the STOC'2016 paper, we provide a new relation between the two models: we prove that if a property can be tested with constant query complexity in the bidirectional model, then it can be tested with sublinear query complexity in the unidirectional model. A corollary of this result is that in the unidirectional model (the model allowing only queries to the outgoing neighbors), every property in planar digraphs (or even in hyperfinite digraphs) is testable with sublinear query complexity. Not only no such results have been known before, but in fact, essentially almost no sublinear-time results have been known for the very natural unidirectional model. Some extensions of this work are work in progress, where we expect to publish them within the next year. I'm very happy with the outcome of the FOCS'2019 paper (A. Czumaj and C. Sohler. A Characterization of Graph Properties Testable for General Planar Graphs with One-sided Error (It's all about forbidden subgraphs). In Proceedings of the 60th IEEE Symposium on Foundations of Computer Science (FOCS 2019), pages 1513 - 1536, Baltimore, MD, November 9 - 12, 2019. For a full version, see CoRR abs/1909.10647.), which provides a new characterization result for constant-time result for testing graph properties in planar graphs without any degree bounds (arbitrary planar graphs). Further, the very recent SODA'2020 paper (A. Czumaj and C. Sohler. Sublinear Time Approximation of the Cost of a Metric k-nearest Neighbor Graph. In Proceedings of the 31st ACM-SIAM Symposium on Discrete Algorithms (SODA 2020), pages 2973 - 2992, Salt Lake City, UT, January 5 - 8, 2020; cf. https://epubs.siam.org/doi/abs/10.1137/1.9781611975994.180) studies the problem of approximating the cost of the k-nearest neighbor (k-NN) graph for a given metric space (X, d) on n points. The task of analysing the structure of k_NN has received a lot of attention in the community, and the SODA'2020 paper presents two sublinear-time algorithms that compute an estimate of k-NN with a small approximation error. Further, we complement these results with near matching lower bounds. Several other papers have been studying the impact and relationship between sublinear algorithms and distributed and parallel computations. In STOC'2018 paper (A. Czumaj, J. Lacki, A. Madry, S. Mitrovic, K. Onak, and P. Sankowski. Round Compression for Parallel Matching Algorithms), we have been studying the impact of speeding up computations in parallel environment where individual machines have low local memory but they can run quickly their own computations, and the main challenge is to ensure good coordination between the machines. In particular, this work shows that one can approximate the size of maximum matching in parallel environment (for example, in MapReduce setting) in o(log n) round (in fact, in O((loglog n)^2)) even if each parallel machine has sublinear memory available. This work is considered a major breakthrough in the area (and have already several follow up works). In a series of papers, we have been also studying several fundamental distributed computing problems for basic communication primitives, leading to optimal speed up (papers with Davies, DISC'2018, PODC'2017, PODC'2016, ICALP'2016 and wih Konrad at DISC'2018). Further works include studies the fundamental problem of estimating the size of independent sets in a graph defined by a stream of edges, using only limited space (G. Cormode, J. Dark, and C. Konrad). An interesting setting of sublinear computations for texts has been also studied in a STOC'2019 paper (Dominik Kempa, Tomasz Kociumaka. String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure. STOC 2019: 756-767) which presents the first algorithm that breaks the O(n)-time barrier for the Burrows-Wheeler Transform (BWT) construction. An important activity of the project is a workshop organised in Warwick in March 2018 - Workshop on Data Summarization (co-organized with Graham Cormode) - sponsored by this EPSRC project - that brought together a number of UK and international researchers and practitioners working in various areas of algorithms for data summarization (sublinear algorithms, property testing, streaming, sampling, sketching, and more). URL: https://warwick.ac.uk/dcs/research/focs/conf2017/. Further, in Autumn 2018 the PI spent 1 months attending the Simons Institute Special Semester on Foundations of Data Science at Simons Institute for the Theory of Computing, UC Berkeley (https://simons.berkeley.edu/programs/datascience2018). During this stay the PI has been actively working with world leading scientists in the area of the project. Among the central activities of the Semester at the Simons Institute, I have been involved in organising Workshop on Sublinear Algorithms and Nearest-Neighbor Search, November 27 - 30, 2018 (co-organized with Robert Krauthgamer (Weizmann Institute), Aarti Singh (Carnegie Mellon University), and Rachel A. Ward (University of Texas at Austin)). Regarding unpublished works: I have completed (now in submission) a very nice work on the connection between property testing and low-space streaming algorithms (joint work with Hendrik Fichtenberger, Pan Peng, and Christian Sohler), available at https://arxiv.org/abs/1905.01644. |

Exploitation Route | Sublinear algorithms, property testing and streaming algorithms provide a potentially interesting framework to design super-fast algorithms. The results in this area, including some of the results studied in this project, demonstrated that some fundamental combinatorial problems on graphs and networks can be solved very quickly. These ideas are frequently taken on by research laboratories, including leading research labs in Google, Yahoo, Microsoft, to transfer these findings into more applied products. In particular, the Workshop on Data Summarization sponsored by this project, has brought together several renowned experts from some research labs to discuss about the funding associated with the project, and to discuss further advances in this area. |

Sectors | Digital/Communication/Information Technologies (including Software),Other |