Burst the filter bubble: Bayesian nonparametrics for recommender systems

Lead Research Organisation: University of Oxford
Department Name: Statistics

Abstract

Recommender systems aim at providing automated targeted recommendations to individuals based on items/people they like. They have gained a lot of attention over the past few years thanks to the famous Netflix prize, and are now ubiquitous and used by companies like Amazon, Apple or Youtube. Recommender systems are of particular economic interest in business. In this case, given an observed set of purchased items by customers, we aim at providing relevant recommendations to potential buyers of a given product. One may also be interested in obtaining a market segmentation of the customers and/or products, and in identifying trends in the evolution of the popularity of products. Recommender systems arise in several application domains for news, music, books, web searches or restaurants. When looking at the popularity of the items, the datasets often exhibit a heavy tail behavior: most purchases concern only a small number of very popular items, the majority of the items being bought very rarely. The ability of recommender systems to provide personalized recommendations tailored to the user tastes are particularly attractive; but this personalization has raised a number of concerns regarding recommender systems. One of them has been popularized under the term "Filter bubble", coined by Eli Pariser in a recent popular book: the fear that personalized recommendations are acting as an echo chamber, only suggesting items which are the most popular and the closest to the user's tastes and not exposing him to contradictory/iconoclastic opinions or exotic/unusual products. Regarding the recommendation of products, this may have the negative effect to only recommend popular items users already know about. A significant amount of research in the computer science literature has actually recently be devoted to deriving recommender systems favoring diversity and serendipity. Regarding news and web searches, this effect is sometimes called ``Information cocoon", and some see this effect as a threat for democracy. A few days after the Brexit vote, Katharina Viner, the Editor-in-Chief of Guardian News \& Media, wrote a long article on this issue. She illustrated her point with the blog post of Tom Steinberg, a British internet activist and mySociety founder:

"I am actively searching through Facebook for people celebrating the Brexit leave victory, but the filter bubble is SO strong, and extends SO far into things like Facebook's custom search that I can't find anyone who is happy *despite the fact that over half the country is clearly jubilant today* and despite the fact that I'm *actively* looking to hear what they are saying."
There is currently a debate on whether or not this algorithmic filter bubble is actually stronger or not than the typical "real-life" bubble. Whether or not this is currently true, it is primordial to derive recommendation algorithms that are able to provide a fair representation of the diverse set of items/opinions an individual may be exposed to, or to potentially be able to choose metrics that favor diversity or serendipity instead of accuracy. Any algorithm with such objectives has to adequately handle the rare products and the heavy tail properties of the datasets. The objective of this project is to provide such a method, in a theoretically grounded and interpretable statistical framework.

Planned Impact

This project in statistical machine learning is at the interface between mathematical sciences and information and communication technologies. A large body of the literature on recommender systems is concerned with non-model based, black-box methods that cannot capture the salient properties of the datasets considered. We expect this proposal to be able to demonstrate that the use of more advanced Bayesian nonparametric tools, can make decisive advances to this problem.

We expect this project to have an economic impact by designing recommender systems able to take fully into account the heavy tail behavior of most datasets, and make predictions taking into account diversity and serendipity. These properties are expected to be of interest to businesses in
order to make recommendations more appealing to customers. The algorithms will also be more transparent to the users, by building on model-based approaches with interpretable parameters able to provide an explanation of the recommendations.

Regarding the societal impacts, some of the current "blackbox" recommender systems are sometimes considered as a threat to democracy, by creating a "filter bubble". This project will be able to provide interpretable recommendations, fully taking into account the wide diversity of opinions may be exposed to. The development of such a system would demonstrate the feasibility of a transparent, balanced approach, beneficial to the general public.

Finally, the publication of the research outcomes to top journals and conferences will finally contribute to the UK's international reputation for excellence in computational statistics and machine learning.

Publications

10 25 50
 
Description The work funded by this award has led to the development of more realistic statistical models for the analysis of complex, structured network and text data. A key aspect of the developed line of work is to take into consideration the long tail of the data, whether it is in networks (few nodes have many connections, many nodes have few connections), or in text data (most words appear very rarely, a few words appear very frequently). The models developed allow to capture these salient features of real world datasets, and associated algorithms have been proposed to estimate their parameters and make predictions. For network data, besides these long tail properties, the developed models also allow to discover latent communities in the network.
Exploitation Route The proposed models may be used to analyse large real-world communication or social networks that exhibit this long-tail behaviour.
Sectors Digital/Communication/Information Technologies (including Software)