Developing Efficient Numerical Algorithms Using Fast Bayesian Random Forests

Lead Research Organisation: University of Liverpool
Department Name: Electrical Engineering and Electronics

Abstract

Is working with Random Forests a key interest of yours? Would you like to become adept at using Sequential Monte Carlo Samplers and parallel computing techniques? GCHQ and the CDT is looking for a keen PhD candidate who is conversant in Random Forest algorithms to assist them in developing new divide-and-conquer approaches that can help generate more accurate predictions. Does this sound like the challenging project you are looking for?

Random Forests (e.g. in sk-learn) are in pervasive use in data science and machine learning. In such algorithms, each tree describes succinct rules that relate the inputs to the outputs (which can be both continuous values, in the context of regression, and discrete labels, in the context of classification). Random Forests then use the diversity of the set of trees to convey uncertainty about which rules apply to any datum. This combination of succinctness and diversity is perhaps why the training algorithms for random forests are both fast and often (remarkably) effective.

When seen through the lens of statistics, the training algorithms for Random Forests are ad-hoc. The belief that a more principled approach could lead to improved performance has motivated historic attempts to use numerical Bayesian techniques to estimate the parameters of tree-based data science algorithms. The resulting approaches, e.g. Bayesian Additive Regression Trees (BART) and Classification Additive Regression Trees (CART), despite using models that are arguably less sophisticated than those used in Random Forests, are typically slow. This lack of speed comes about because algorithms like BART and CART use a specific numerical Bayesian technique, Markov Chain Monte Carlo (MCMC). While efficient general-purpose variants of MCMC (e.g. the No-U-Turn-Sampler (NUTS)) exist and can applied to many problems, these MCMC variants are not applicable in contexts where the number of parameters is unknown. Since trees can have different numbers of nodes and so different numbers of parameters, algorithms like NUTS can't be used to improve the run-time for tree-based algorithms like BART and CART.

NUTS achieves its efficiency by using gradient information to identify the directions in which to move to optimise the parameters of the model. While one cannot calculate gradients in the context of trees, one can emulate the calculation of gradients in such settings using a hierarchy of what is known as mini-batches in the neural network literature. An algorithm that exploits this idea, Hierarchical Importance with Nested Training Samples (HINTS), was developed in 2004, but it is now obscure and its potential largely untapped. Furthermore, there is potential to use more recent advances in the context of Sequential Monte Carlo (SMC) samplers, a family of algorithms that can be exploit parallel processing resources (e.g. GPUs) and that can remove the need for the sequential burn-in phase that is often responsible for MCMC algorithms' slow run-time.

This PhD will seek to develop efficient numerical Bayesian algorithms (likely to be based on a combination of HINTS and SMC samplers) that can be used to develop a variant of a Random Forest algorithms. The intent is that the new algorithms would be drop-in replacements to the Random Forest algorithms used in sk-learn that offer the same interfaces, but can use parallel computational resources to use a given training dataset to provide GCHQ more accurate predictions in the same elapsed time.

Planned Impact

This CDT's focus on using "Future Computing Systems" to move "Towards a Data-driven Future" resonates strongly with two themes of non-academic organisation. In both themes, albeit for slightly different reasons, commodity data science is insufficient and there is a hunger both for the future leaders that this CDT will produce and the high-performance solutions that the students will develop.

The first theme is associated with defence and security. In this context, operational performance is of paramount importance. Government organisations (e.g., Dstl, GCHQ and the NCA) will benefit from our graduates' ability to configure many-core hardware to maximise the ability to extract value from the available data. The CDT's projects and graduates will achieve societal impact by enabling these government organisations to better protect the world's population from threats posed by, for example, international terrorism and organised crime.

There is then a supply chain of industrial organisations that deliver to government organisations (both in the UK and overseas). These industrial organisations (e.g., Cubica, Denbridge Marine, FeatureSpace, Leonardo, MBDA, Ordnance Survey, QinetiQ, RiskAware, Sintela, THALES (Aveillant) and Vision4ce) operate in a globally competitive marketplace where operational performance is a key driver. The skilled graduates that this CDT will provide (and the projects that will comprise the students' PhDs) are critical to these organisations' ability to develop and deliver high-performance products and services. We therefore anticipate economic impact to result from this CDT.

The second theme is associated with high-value and high-volume manufacturing. In these contexts, profit margins are very sensitive to operational costs. For example, a change to the configuration of a production line for an aerosol manufactured by Unilever might "only" cut costs by 1p for each aerosol, but when multiplied by half a billion aerosols each year, the impact on profit can be significant. In this context, industry (e.g., Renishaw, Rolls Royce, Schlumberger, ShopDirect and Unilever) is therefore motivated to optimise operational costs by learning from historic data. This CDT's graduates (and their projects) will help these organisations to perform such data-driven optimisation and thereby enable the CDT to achieve further economic impact.

Other organisations (e.g., IBM) provide hardware, software and advice to those operating in these themes. The CDT's graduates will ensure these organisations can be globally competitive.

The specific organisations mentioned above are the CDT's current partners. These organisations have all agreed to co-fund studentships. That commitment indicates that, in the short term, they are likely to be the focus for the CDT's impact. However, other organisations are likely to benefit in the future. While two (Lockheed Martin and Arup) have articulated their support in letters that are attached to this proposal, we anticipate impact via a larger portfolio of organisations (e.g., via studentships but also via those organisations recruiting the CDT's graduates either immediately after the CDT or later in the students' careers). Those organisations are likely to include those inhabiting the two themes described above, but also others. For example, an entrepreneurial CDT student might identify a niche in another market sector where Distributed Algorithms can deliver substantial commercial or societal gains. Predicting where such niches might be is challenging, though it seems likely that sectors that are yet to fully embrace Data Science while also involving significant turn-over are those that will have the most to gain: we hypothesise that niches might be identified in health and actuarial science, for example.

As well as training the CDT students to be the leaders of tomorrow in Distributed Algorithms, we will also achieve impact by training the CDT's industrial supervisors.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/S023445/1 01/04/2019 30/09/2027
2748743 Studentship EP/S023445/1 01/10/2022 30/09/2026 Harvinder Lehal