Bayesian model-based clustering of multi-source data

Coleman, Stephen

Bayesian model-based clustering of multi-source data

Repository URI

https://www.repository.cam.ac.uk/handle/1810/349554

Repository DOI

https://doi.org/10.17863/CAM.96556

Files

Thesis (25.49 MB)

Type

Thesis

Authors

Coleman, Stephen

Abstract

Inferring a partition of a dataset can help in downstream analyses and decision making. However, there often exist many feasible partitions, which makes the problem of inferring clusters challenging. A demanding problem is analysis of data generated across multiple sources. Bayesian mixture models and their extensions are effective tools for partition inference in this setting as we can use these to describe and infer the relationship between different sources. I consider applying such methods to two cases of multi-source data: multi-view, where the same items have data generated across different contexts, and multi-batch, where the same measurements are taken on sets of items.

I develop and explore a consensus clustering approach to navigate the problem of poor mixing, which refers to a failure of Markov chain Monte Carlo methods wherein the sampler becomes trapped in local high posterior density modes. This problem is commonly encountered when seeking to infer latent structure in high-dimensional data. I propose running many short Markov chains in parallel and using the final sample from each chain. My results suggest that performing inference this way frequently better describes model uncertainty than individual long chains. I use the method in a multi-omics analysis of the cell cycle of Saccharomyces cerevisiae and identify biologically meaningful structure.

I subsequently implement Multiple Dataset Integration (MDI), a Bayesian integrative clustering method, in C++ with a wrapper in R, correcting an error that was present in previous implementations, and extending MDI to be semi-supervised. My implementation allows a range of models for a variety of different data types, such as t-augmented mixtures of Gaussians and Gaussian processes. I then consider a semi-supervised multi-omics analysis of the model apicomplexan, Toxoplasma gondii.

In my final content chapter I consider the problem of analysing data generated across multiple batches. Such data can have structural differences which should be accounted for when inferring a partition. I propose a mixture model that includes both cluster/class and batch parameters to simultaneously model batch effects upon location and scale with the partition. I validate my method in a simulation study and using held out seroprevalence data, and compare to existing methods.

Finally, I discuss the state of the field of Bayesian mixture models and some potential future research directions.

Date

2022-11-28

Advisors

Wallace, Chris
Kirk, Paul

Keywords

Batch effects, Bayesian statistics, Classification, Clustering, Computational statistics, Systems biology

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights

Attribution 4.0 International (CC BY 4.0)

Sponsorship

MRC (2266954)

Collections

Theses - MRC Biostatistics Unit