Semiparametric Methods for Two Problems in Causal Inference using Machine Learning

Klyne, Harvey

Semiparametric Methods for Two Problems in Causal Inference using Machine Learning

Repository URI

https://www.repository.cam.ac.uk/handle/1810/363809

Repository DOI

https://doi.org/10.17863/CAM.105716

Files

Primary Thesis (2.74 MB)

Type

Thesis

Authors

Klyne, Harvey

Abstract

Scientific applications such as personalised (precision) medicine require statistical guarantees on causal mechanisms, however in many settings only observational data with complex underlying interactions are available. Recent advances in machine learning have made it possible to model such systems, but their inherent biases and black-box nature pose an inferential challenge. Semiparametric methods are able to nonetheless leverage these powerful nonparametric regression procedures to provide valid statistical analysis on interesting parametric components of the data generating process.

This thesis consists of three chapters. The first chapter summarises the semiparametric and causal inference literatures, paying particular attention to doubly-robust methods and conditional independence testing. In the second chapter, we explore the doubly-robust estimation of the average partial effect — a generalisation of the linear coefficient in a (partially) linear model and a local measure of causal effect. This framework involves two plug-in nuisance function estimates, and trades their errors off against each other. The first nuisance function is the conditional expectation function, whose estimate is required to be differentiable. We propose convolving an arbitrary plug-in machine learning regression — which need not be differentiable — with a Gaussian kernel, and demonstrate that for a range of kernel bandwidths we can achieve the semiparametric efficiency bound at no asymptotic cost to the regression mean-squared error. The second nuisance function is the derivative of the log-density of the predictors, termed the score function. This score function does not depend on the conditional distribution of the response given the predictors. Score estimation is only well-studied in the univariate case. We propose using a location-scale model to reduce the problem of multivariate score estimation to conditional mean and variance estimation plus univariate score estimation. This enables the use of an arbitrary machine learning regression. Simulations confirm the desirable properties of our approaches, and code is made available in the R package drape (Doubly-Robust Average Partial Effects) available from https://github.com/harveyklyne/drape.

In the third chapter, we consider testing for conditional independence of two discrete random variables X and Y given a third continuous variable Z. Conditional independence testing forms the basis for constraint-based causal structure learning, but it has been shown that any test which controls size for all null distributions has no power against any alternative. For this reason it is necessary to restrict the null space, and it is convenient to do so in terms of the performance of machine learning methods. Previous works have additionally made strong structural assumptions on both X and Y. A doubly-robust approach which does not make such assumptions is to compute a generalised covariance measure using an arbitrary machine learning method, reducing the test for conditional correlation to testing whether an asymptotically Gaussian vector has mean zero. This vector is often high-dimensional and naive tests suffer from a lack of power. We propose greedily merging the labels of the underlying discrete variables so as to maximise the observed conditional correlation. By doing so we uncover additional structure in an adaptive fashion. Our test is calibrated using a novel double bootstrap. We demonstrate an algorithm to perform this procedure in a computationally efficient manner. Simulations confirm that we are able to improve power in high-dimensional settings with low-dimensional structure, whilst maintaining the desired size control. Code is made available in the R package catci (CATegorical Conditional Independence) available from https://github.com/harveyklyne/catci.

Date

2023-06-16

Advisors

Shah, Rajen

Keywords

causal inference, machine learning, statistics

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights

Sponsorship

Engineering and Physical Sciences Research Council (2261074)

Collections

Theses - Pure Mathematics and Mathematical Statistics