Sublinear Algorithms for Big Graphs

Lead Research Organisation: University of Warwick
Department Name: Computer Science

Abstract

A fundamental task in the study of large networks is to efficiently analyze their structural properties. For example, we may want to know if a network is well-connected, is well-clusterable, has many copies (instances) of some specific sub-structures, etc. Given that modern networks are large, often consisting of millions and billions of nodes (web graph, social networks, etc.), the task of analyzing their structure has become recently increasingly challenging, and the running-time efficiency of this task is becoming of critical importance. Indeed, being able to quickly analyze important features in the gigantic amount of information already is a key technology of a multi-billion dollar industry (see, e.g., Google, Yahoo, Facebook, etc.) and its significance will likely increase further in the near future. To efficiently manage and analyze large networks, in recent years we have seen a rise of the importance of sublinear time algorithms, that is, algorithms that use significantly less resources than the input size. Since sublinear algorithms can process only a small fraction of the input, they are not suitable for many applications, especially if exact solutions are sought; but recently we have seen a number of sublinear algorithms that compute approximate solutions for a variety of optimization and decision problems arising in such diverse areas as algebraic computations, networks, geometry, and computer graphics.

To cope with these modern challenges of large networks, this proposal will exploit the expertise of the PI in the area of randomized algorithms to develop new algorithmic techniques for the analysis of big graphs by making significant advances in the area of sublinear algorithms for combinatorial problems. The central goal is to push forward the barriers of our knowledge in the area of sublinear-time algorithms for graph problems by enlarging the class of problems for which sublinear-time are known and by characterizing problems for which sublinear-time algorithms are impossible to exist. The main technical goal is to develop algorithmic technology for the analysis of large graphs in the context of two central models: property testing algorithms and sublinear-time approximation algorithms. We plan also to apply the techniques developed to further models related to sublinear algorithms: data streaming, dynamic and online algorithms. Our objective is to attack grand challenges in the area of sublinear algorithms and we aim to make major advances.

The focus of this project is on fundamental research in this area, aiming at advances in the area of theoretical aspects of the design and analysis of algorithms.

Planned Impact

The research proposed in this project, while of fundamental and theoretical character has a clear potential to impact the area of algorithms, and with this it has a long term potential to impact many areas of modern technologies.
To cope with modern requirements of big data it is of critical importance to advance the foundational research in algorithms, especially for algorithms that use computational resources significantly less than the input size - sublinear algorithms. By developing new algorithmic techniques for the analysis of big graphs and making significant advances in the area of sublinear algorithms for graph problems, the project is addressing one of the central challenges of modern computing - of computations and algorithms with limited resources - and is advancing the area of algorithms whose importance for modern technologies has been distinctly emphasized in various venues.

Interaction with industry:
The area of the proposal-design and analysis of sublinear algorithms for big graphs and networks-is of significant interest in several industrial research laboratories in North America, e.g., Microsoft, Google, Yahoo, IBM. While many of most important ideas in the area of sublinear algorithms and their big data applications originated in academia, more recently we have seen an increasing number of algorithmic developments in big high-tech companies. The PI has direct contacts with many leading researchers from main research laboratories in North America and he plans to enhance the collaboration with leading research labs with the goal of making further scientific advances in the field and also streamlining the impact of this research.
A significant part of the interaction with industry, especially in the context of the UK industry, is planned to be facilitated through interdisciplinary research centers at the University of Warwick. The PI plans to be involved in knowledge transfer activities in Warwick Data Science Institute, Warwick Institute for the Science of Cities, and Maths for Real-World Systems CDT. These Warwick research and training centers have close links to industry in the areas related to the project and the findings of the project will enhance this interaction.
The PI plans also to actively contribute to the interdisciplinary activities in the recently established Alan Turing Institute for Data Science in London that should facilitate close interaction of researchers in big data with leading industry.

Dissemination:
The results will be disseminated by publications in high quality international journals and conferences, and all papers will be published in open access venues. The PI will also maintain a web page with main activities of the project.
Besides publishing, the PI will give presentations about the results of this project in leading industrial research laboratories. The PI plans to visit, make presentations, and discuss further knowledge transfer activities in Microsoft Research Labs, IBM Research Center, Yahoo Labs, Google Research, BT Labs, etc.
Further impact will be achieved through national and international conferences. The PI has organized in the past several international workshops in the areas related to this project and those have been very well attended by industrial participants. The PI plans to organize further 2-3 workshops on sublinear algorithms and their applications, possibly going beyond this topic to a more general study of foundations of big data. These events would bring together leading academics and industrial researchers, system architects working in this area. To achieve a greater impact of these activities, the PI requests from EPSRC funding to organize one of these events, where he will invite leading academics, industrial researchers, system architects, and practitioners to discuss the advances of the design of efficient (sublinear) algorithms for big data problems with the special focus on the challenges of incorporating these advances in industry.

Publications

10 25 50
publication icon
Chlamtác E (2018) The Densest $k$-Subhypergraph Problem in SIAM Journal on Discrete Mathematics

publication icon
Chlamtác E (2018) The Densest $k$-Subhypergraph Problem in SIAM Journal on Discrete Mathematics

publication icon
Czumaj A (2019) Communicating with beeps in Journal of Parallel and Distributed Computing

 
Description The project has been focusing on the theoretical study of sublinear algorithms for graphs. The project has made some important contributions to the area of sublinear algorithms, and also to applications of this area to other related topics.

The FOCS'2019 paper (A. Czumaj and C. Sohler. A Characterization of Graph Properties Testable for General Planar Graphs with One-sided Error (It's all about forbidden subgraphs). In Proceedings of the 60th FOCS 2019, pages 1513 - 1536, 2019. For a full version, see CoRR abs/1909.10647.) studies the central topic of the proposal, which is the complexity of testing graph properties in graphs without any bounds for the degrees. In property testing, after many years of study, we have a good understanding of the complexity for various graph properties either for "dense" graphs, or for bounded-degree graphs. However, our understanding of the complexity of testing graph properties is almost non-existing for sparse graphs without any upper bounds for vertex degrees. The FOCS'2019 paper makes an important advance in this topic and provides a characterization for constant-time algorithms for testing graph properties in planar graphs without any degree bounds (arbitrary planar graphs). The result gives a sufficient and necessary condition; informally, all properties that can be tested in arbitrary planar graphs in a constant number of rounds can be described as properties of determining whether the input graph is H-free for some constant size fixed graph H.

In the STOC'2016 paper (A. Czumaj, P. Peng, and C. Sohler. Relating Two Property Testing Models for Bounded Degree Directed Graphs. In Proceedings of the 48th STOC 2016, pages 1033 - 1045, 2016), following one of the core topics in the project, we study the complexity of testing properties in directed graphs (digraphs) with bounded maximum indegree and maximum outdegree. For directed graphs with bounded degree, there are two different models in property testing introduced by Bender and Ron (2002). In the bidirectional model, one can access both incoming and outgoing edges while in the unidirectional model one can only access outgoing edges (think about a web graph, where one can only follow the links, but one cannot go back to the page pointing to a given page). In the STOC'2016 paper, we provide a new relation between the two models: we prove that if a property can be tested with constant query complexity in the bidirectional model, then it can be tested with sublinear query complexity in the unidirectional model. A corollary of this result is that in the unidirectional model (the model allowing only queries to the outgoing neighbors), every property in planar digraphs (or even in hyperfinite digraphs) is testable with sublinear query complexity. Not only no such results have been known before, but in fact, essentially almost no sublinear-time results have been known for the very natural unidirectional model.

The SODA'2020 paper (A. Czumaj and C. Sohler. Sublinear Time Approximation of the Cost of a Metric k-nearest Neighbor Graph. In Proceedings of the 31st ACM-SIAM SODA 2020, pages 2973 - 2992, 2020; cf. https://epubs.siam.org/doi/abs/10.1137/1.978161197 5994.180) advances our understanding of sublinear-time algorithms for important optimization problems. It studies the fundamental problem of approximating the cost of the k-nearest neighbor (k-NN) graph for a given metric space (X, d) on n points. The task of analysing the structure of k_NN has received a lot of attention in the community, and the SODA'2020 paper presents two sublinear-time algorithms that compute an estimate of k-NN with a small approximation error. Further, we complement these results with near matching lower bounds.

The central topic of this project of establishing close links between fast property testing algorithms and more conventional algorithms is studied in the RANDOM'2020 paper (Artur Czumaj, Hendrik Fichtenberger, Pan Peng, Christian Sohler. Testable Properties in General Graphs and Random Order Streaming. In Proceedings of the 24th RANDOM 2020, pages 16:1 - 16:20, 2020. Also in CoRR abs/1905.01644.). The paper presents a novel framework closely linking property testing and data streaming in the setting of general graphs in the context of constant-query complexity testing and constant-space streaming. The main result is a generic transformation of a one-sided error property tester in the random-neighbor model with constant query complexity into a one-sided error property tester in the streaming model with constant space complexity. Previously such a generic transformation was only known for bounded-degree graphs.

Several other papers have been studying the impact and relationship between sublinear algorithms and distributed and parallel computations. In the STOC'2018 paper (A. Czumaj, J. Lacki, A. Madry, S. Mitrovic, K. Onak, and P. Sankowski. Round Compression for Parallel Matching Algorithms, pages 471 - 484, 2018; also in SIAM Journala on Computing, 49(5), 2020), we study the impact of speeding up computations in parallel environment where individual machines have low local memory but they can run quickly their own computations, and the main challenge is to ensure good coordination between the machines. In particular, this work shows that one can approximate the size of maximum matching in parallel environment (for example, in MapReduce setting) in o(log n) round (in fact, in O((loglog n)^2)) even if each parallel machine has sublinear memory available. This work is considered a major breakthrough in the area (and have already several follow up works).

Our next papers at SPAA'2020 (A. Czumaj, P. Davies, and M. Parter. Graph Sparsification for Derandomizing Massively Parallel Computation with Low Space. In Proceedings of the 32nd ACM SPAA, pages 175 - 185, 2020. Also in CoRR abs/1912.05390.) and PODC'2020 (A. Czumaj, P. Davies, and M. Parter. Simple, Deterministic, Constant-round Coloring in the CONGESTED CLIQUE. In Proceedings of the 39th ACM PODC, pages 309 - 318, 2020. Also in CoRR abs/2009.06043.) extend this line of research and develop new techniques to design very fast deterministic algorithms for the model of Massive Parallel Computations and for the classical distributed model of CONGESTED CLIQUE. The highlight of the SPAA'2020 paper are O(log n)-time deterministic parallel algorithms for several fundamental optimization problems, e.g., of maximal matching and maximal independent set. The highlight of the PODC'2020 paper is the first constant-time deterministic CONGESTED CLIQUE algorithm for (Delta+1)-coloring, another fundamental graph problem in distributed computing.

In a series of papers, we were also studying several fundamental distributed computing problems for basic communication primitives, leading to optimal speed up (papers with Davies, DISC'2018, PODC'2017, PODC'2016, ICALP'2016 and wih Konrad at DISC'2018). Further works include studies the fundamental problem of estimating the size of independent sets in a graph defined by a stream of edges, using only limited space (G. Cormode, J. Dark, and C. Konrad). An interesting setting of sublinear computations for texts has been also studied in a STOC'2019 paper (Dominik Kempa, Tomasz Kociumaka. String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure. STOC 2019: 756-767) which presents the first algorithm that breaks the O(n)-time barrier for the Burrows-Wheeler Transform (BWT) construction.

An important activity of the project is a workshop organised in Warwick in March 2018 - Workshop on Data Summarization (co-organized with Graham Cormode) - sponsored by this EPSRC project - that brought together a number of UK and international researchers and practitioners working in various areas of algorithms for data summarization (sublinear algorithms, property testing, streaming, sampling, sketching, and more). URL: https://warwick.ac.uk/dcs/research/focs/conf2017/.

In Autumn 2018 the PI spent a month attending the Simons Institute Special Semester on Foundations of Data Science at Simons Institute for the Theory of Computing, UC Berkeley (https://simons.berkeley.edu/programs/datascience2 018). During this stay the PI has been actively working with world leading scientists in the area of the project. Among the central activities of the Semester at the Simons Institute, the PI was a co-organiser of Workshop on Sublinear Algorithms and Nearest-Neighbor Search, November 27 - 30, 2018 (co-organized with Robert Krauthgamer (Weizmann Institute), Aarti Singh (Carnegie Mellon University), and Rachel A. Ward (University of Texas at Austin)).
Exploitation Route Sublinear algorithms, property testing and streaming algorithms provide a potentially interesting framework to design super-fast algorithms. The results in this area, including some of the results studied in this project, demonstrated that some fundamental combinatorial problems on graphs and networks can be solved very quickly. These ideas are frequently taken on by research laboratories, including leading research labs in Google, Yahoo, Microsoft, to transfer these findings into more applied products. In particular, the Workshop on Data Summarization sponsored by this project, has brought together several renowned experts from some research labs to discuss about the funding associated with the project, and to discuss further advances in this area.
Sectors Digital/Communication/Information Technologies (including Software),Other