Enriching public transportation data using Bayesian methods

Philip Stefan Puge Lemaitre

Research output: Book / Anthology / Report / Ph.D. thesisPh.D. thesis


Public transportation has gone from data-poor to data-rich through the widespread use of Automatic Data Collection (ADC) systems such as the Automatic Fare Collecting (AFC), Automatic Vehicle Location data (AVL) and Automatic Passenger Count (APC) systems. These systems are designed for specific purposes: the AVL system monitors the public transit agencies' fleet, the AFC system collects revenue with financial accountability in mind, and the APC system counts the number of passengers in the vehicle at the stop level. These systems have separately and together contributed to a flourishing of valuable insights in public transportation research.
Nevertheless, some information remains unavailable to transit agencies due to the systems not collecting the information of interest, or agencies lacking access to the relevant systems. In the case of limited information in the data, transit agencies face the challenge of utilising the full potential of the information from these systems. This thesis focuses on the data from the AFC system generated with smart cards and how a Bayesian framework can infer the missing information of interest. The Bayesian framework has been found useful for handling this challenge since it makes it possible to model the complete data-generating process, even with sparse data, and even when it is possible to observe only parts of the data-generating process. The use of the Bayesian framework is demonstrated in the thesis by two published papers and a exploratory study.
The first paper investigates the case of not being able to access the recorded timetable information from the AVL system, and how using scheduled timetable information can affect train-to-passenger assignments. The paper presents a hierarchical Bayesian mixture model to infer the latent arrival times.
The second paper focusses on the challenge of the information of interest not being stored by the system. In this case, that is the activity of travellers transferring from bus to trains. When this information is not available, it is difficult for transit agencies to evaluate whether scheduled transfer times between vehicles are reasonable, since travellers could have engaged in some activity such as shopping, buying coffee, etc., affecting the observed distribution of walking times. The paper proposes a hierarchical Bayesian mixture model to infer latent behaviour, making it possible to infer the walking time distributions of walking directly and conducting an activity during the transfer.
Finally, this thesis contains an exploratory study. The study investigates the possibility of combining smart card data with journey planner search data to identify areas of interest, these being areas where people want to go, but which are not supplied by public transportation.
This PhD thesis presents new methods for using a Bayesian framework to infer missing data whose absence originates in a-priori system design.
Original languageEnglish
PublisherIT-Universitetet i København
Number of pages135
ISBN (Print)978-87-7949-390-2
Publication statusPublished - 2022


Dive into the research topics of 'Enriching public transportation data using Bayesian methods'. Together they form a unique fingerprint.

Cite this