"Large-scale modelling of transport and energy choices using emerging big data sources"

Lead Research Organisation: University of Leeds
Department Name: Institute for Transport Studies


Panagiotis' proposed project aims to make effective use of big data sources for modelling transport
choices (e.g. departure time, activity, mode, destination choices) by developing and deploying
novel modelling techniques.
The handling, processing and analysis of the proposed big data sources will require using
quantitative skills above and beyond those required for normal doctoral research. The datasets
used will not only be large in nature, but will have complex interdependencies amongst individual
components. Additionally, the analysis of such data relies on making links with secondary datasets,
a process which presents a high level of complexity itself.
Panagiotis' proposed research consists of three methodological components:
1. Combination of different big data sources (social media, smart card-credit card, mobile phone
and floating car data)
2. Combination of big data sources with traditional data sources (e.g. surveys, counts)
3. Development of a framework for real-time updating of the developed models with newly available
All three components pose significant challenges and demand mastery of advanced quantitative
techniques. For example, for data combination (components 1 and 2), he proposes to investigate
econometric and simulation-based methods. On the econometric side, he proposes to investigate
novel techniques like Broad Choice Models (Brownstone et al. 2017), Latent demographic models
(Bwambale et al. 2017), etc. to address the problems associated with coarse data resolution and
missing data. He also proposes applying Bayesian Inference techniques in this regard. Mastering
these methodologies will require advanced mathematics and statistics. Furthermore, it is likely that
more methodological innovations will be required to address other issues of the big data sources
like bias and data gaps.
On the simulation side, Panagiotis proposes to use agent based modelling techniques (where
agents are generated based on household survey data) and their behaviour are calibrated with Big
Data sources. These are interesting ideas which can lead to novel methodological advancements in
the field of data fusion. Again, the links between individual agents will require advanced quantitative
Developing a framework for real time updating of the models (component 3) can also be a very
challenging problem which may require borrowing techniques from other disciplines (e.g. weather
forecasting) where data assimilation techniques are more common. Bringing these techniques into
transport and mobility research (with potential modifications) can lead to interesting cross
fertilization of ideas between disciplines


10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
ES/P000746/1 30/09/2017 29/09/2027
2114183 Studentship ES/P000746/1 30/09/2018 31/12/2021 Panagiotis Tsoleridis
Description The first study of my research focused on reducing the high computational cost of estimating choice models with large numbers of alternatives, such as in the case of spatial choice models, e.g. destination models of discretionary activities like leisure and shopping. Sampling of alternatives is a method proposed in the literature capable of reducing the high computational cost by providing a reduced sampled choice set for estimation. More sophisticated sampling protocols have the potential of achieving unbiased estimates using an even lower number of alternatives than pure random sampling. The current study proposes a sampling protocol utilising the geography-based concepts of activity spaces, which provide proxy measures of spatial awareness and time-space constraints. A dataset captured with GPS tracking is being utilised in the practical application providing an increased number of information of individuals' usual areas of travel at a high spatial resolution and a range of Multinomial Logit models for joint mode and destination choices of shopping activities are estimated. The proposed protocol is compared with random sampling and is able to achieve more accurate parameters, higher sampling stability and statistical efficiency. That study and the corresponding paper titled "Utilising activity space concepts to sampling of alternatives for mode and destination choice modelling of discretionary activities" has been published in the Journal of Choice Modelling, 2021, 42(1).

The second study focused on specifying a new correlation structure among the alternative locations of a destination choice model. A Cross-Nested Logit model (CNL) is specified allowing all alternatives to belong to each nest with a certain allocation probability. The allocation probabilities of the alternatives to each nest are specified based on the distance between each alternative location. The methodology is empirically tested on models of destination choice and of joint mode and destination choices of shopping activities. The proposed nesting structure is able to capture significant unobserved correlation among the alternatives and provide behaviourally accurate estimates, more accurate forecasts and demand elasticities. The proposed nesting structure also provides an empirical proof of Tobler's first law of Geography suggesting that objects located at a closer distance are more similar than distant ones. That paper has been submitted in Transportation Research Part B: Methodological and is currently under review.

The third study aims to highlight the practical applicability of using new emerging data sources, in this case GPS data, to derive accurate transport appraisal measures,i.e. Values of Travel Time (VTT) estimates, similar to the ones suggested by the Transport Appraisal Guidance in the UK (WEBTag). Historically, Stated Preference (SP) surveys have been used to derive VTT estimates capturing the trade-offs the decision makers are willing to make in terms of travel time and cost in a controlled manner. SP surveys, however, are only able to provide hypothetical scenarios to the individuals. New emerging data sources that are steadily gaining popularity in the field are able to provide information on real-life choices. The present study utilises such a dataset captured through a smartphone GPS application tracking the trips of individuals for a period of 2 weeks and offering a large panel of observation per person. Mixed Multinomial Logit models of mode choice are estimated and the VTT estimates are weighted based on the trip distances of the National Travel Survey dataset to ensure the representativeness of the outcomes. The GPS-based weighted VTT estimates have only non-statistical differences with theofficial SP-based VTTs suggesting that new emerging data sources and methods of data collection can be safely implemented for the purpose of VTT estimation or they can even be used as a complement to SP studies for the purpose of updating the official values used in appraisal. That paper has been submitted in Transportation Research Part A: Policy and Practice and is currently under review having passed the first round of revision.

The fourth ongoing study aims to incorporate a Machine Learning (ML) clustering algorithm into an existing state-of-the-art behavioural specification, namely a Latent Class Choice Model (LCCM). LCCM is used for the purpose of capturing heterogeneity in the sample by allocating the individuals of the sample into a finite number of classes, initially latent, based on their socio-demographic characteristics and their observed choice behaviour. The LCCM is composed of two parts, a class allocation model at the higher level with the purpose of allocating individuals into the class and a class-specific choice model at the lower level. For achieving the goals of the study, K-Means clustering has been selected for the purpose of segmenting the sample into clusters, thus taking the role of the class allocation model of a traditional LCCM. K-Means was selected due to its simplicity and widespread use, although the same principles can be applied with more advanced clustering algorithms, as well. The proposed approach is tested on four different applications, two mode choice and two shopping destination choice models. The results highlight the improvements in model fit and, in many cases, the improvements in behavioural interpretability of the estimated classes/clusters with the use of an ML algorithm. The proposed approach excels even further in cases of larger samples with more individuals and more trips, thus showing that it can be more effective in uncovering patterns in the data. The plan is to submit that paper in Transportation Part C; Emerging Technologies.

The fifth ongoing study aims to capture the latent spatial constraints of the individuals during their decision making process in the context of shopping destination choices. A probabilistic choice set formation model is specified as an LCCM framework. Three classes are being specified with each class having a different choice set from which to consider their potential choices. Utility spaces are incorporated again for the purpose of delineating those choice sets for each class aiming to capture individuals that are subject to either time-space constraints and/or lack of spatial awareness. The results so far indicate that allocating individuals into classes of latent constraints will improve the model fit and also provide interesting insights into the most likely characteristics of individuals who are facing constraints during their decision making process. Those insights could be used for more efficient policy making aiming to combat issues of social exclusion
Exploitation Route The outcomes of my research so far can be of high importance for transport practitioners focusing on spatial choice models using new emerging data sources. The first study proposed a new sampling protocol taking advantage of the high spatial resolution provided by GPS data to reduce the computational cost of estimation and still obtain unbiased parameters at just a fraction of the estimation time of the full choice set model.

The second study proposed a new nesting structure to capture the correlation among the alternatives based on spatial proximity. Capturing the correlation among alternatives provides more accurate estimates resulting in more accurate forecasts of future demand. The finding can be of importance to researchers and practitioners working on spatial choice models, where it is expected to have correlated alternatives but in many cases is very difficult to capture it in an effective and behaviourally accurate manner.

The outcomes of the third study have the potential of establishing the use of emerging data sources in the field of transport. VTT estimates are arguably the most important outcomes derived from a choice model, since they are used in the subsequent cost benefit analysis of transport projects. Providing evidence that new emerging data sources can be as effective for yielding accurate VTT estimates as SP data, if not better, can be of high importance for policy makers.

The fourth study aims to add further evidence in the increasing ML-DCM literature with regard to the benefits that could be achieved by taking the best of both worlds. In that framework, an ML algorithm is used to provide a more effective pattern recognition and to segment individuals into clusters, while a DCM is used at the lower level to try and understand their choices, thus providing estimates that could be further used for policy making, such as VTT estimates.

The fifth study aims to put forward the need for capturing spatial constraints during our decision making process in a spatial context, e.g. for shopping location. The incorporation of spatial constraints in a probabilistic manner allows not only for achieving model fit improvements but more importantly for identifying individuals that are more likely not to take full advantage of all the opportunities existing across the urban environment.
Sectors Environment,Retail,Transport