Repository logo
 

Understanding and predicting pulmonary exacerbations in Cystic Fibrosis using Machine Learning


Type

Thesis

Change log

Authors

Sutcliffe, Damian 

Abstract

Cystic Fibrosis (CF) is the most prevalent, life-limiting, multi-system, genetic disorder, affecting over 100,000 individuals globally. The condition is caused by mutations in the cystic fibrosis trans-membrane conductance regulator (CFTR) gene, which corrupts the instructions for making the corresponding CFTR protein. In the lungs, the defective protein leads to a reduced depth to, and acidification of, the airway surface liquid, a sticky and viscous outer mucous layer, and impaired mucociliary clearance. This in turn leads to airway obstruction and chronic bacterial infection.

Recurring, sudden, clinical deteriorations in respiratory symptoms - termed acute pulmonary exacerbations (APEs) - cause cumulative damage to the lungs, and are the most significant driver of mortality and morbidity in CF. However, despite this, currently little is known about their pathophysiology, nor what triggers them. The ability to accurately predict impending APEs would permit earlier treatment, reduce inflammatory lung damage, and directly benefit life expectancy. The average time to treatment varies from a few days to around a month, and so there is a significant opportunity to reduce the delay from current levels. Achieving this outcome was the driving force behind my research.

First, I developed an unsupervised machine learning model (the Alignment Model) that, for the first time, was able to generate a characteristic profile of the changes in physiology and symptoms during an APE, and to define an accurate start point for exacerbations. Of particular interest was the existence of a partial interim recovery approximately 10 days after the start of the APE. By extending the model I was also able to identify three distinct classes of APE - one closely resembled the global profile, another showed declines in symptoms before FEV1, and finally one that showed signs of repeated exacerbation and a steeper decline.

Second, I used the inferred exacerbation start dates from the Alignment Model to categorise the full set of study days into stable vs unstable (APE episodes). Using this training data, I developed a supervised ML model (the Predictive Classifier) that was able to predict the onset of APEs with 83.6% reliability, and on average 9.5 days earlier than current clinical practice. For this reseach, I leveraged the SmartCareCF study home monitoring data-set, and importantly, I was also able to quantify that including physiological measures in the data collection process resulted in a 51% improvement over self-reported symptom scores alone. Given the already time-consuming treatment regimen for people with CF, this is a key justification of the value of providing these additional measurements.

I observed that the performance of the Predictive Classifier was negatively affected by increasing amounts of missing input data. In order to use the predictive algorithm in a clinical setting, it is critical to understand the extent to which any missing input data might have affected the predictions, and, therefore, whether any given prediction can be trusted (is safe) or not. Consequently, in my next area of research, I generated a synthetic data-set that represented the sensitivity of performance of the Predictive Classifier to the amount and pattern of missing data points. I then used this to train a separate ML model (the Safety Classifier) that was able to determine whether the APE predictions were safe or not. It achieved a PR-AUC of 89.7% and an ROC-AUC of 91.2%. Additionally, I was able to use the Safety Classifier iteratively to determine the minimum amount of data that was required to guarantee a safe APE prediction, which could be used in future studies to optimise the data collection requirement.

Finally, I was able to apply my research to a new ongoing adult CF home-monitoring study (Project Breathe). The results for both the Alignment Model and the Predictive Classifier were consistent with my earlier findings. However there were three significant environmental factors that should be noted: i) The study period coincided with the broader rollout of triple modulator therapy, which I was able to show reduces the frequency of APE’s by 75%; ii) Also there was an extended period of covid-19 enforced isolation, which reduced cross-infection risk; iii) The data collection compliance was significantly lower - for the five most important measures, it was less than half that of SmartCareCF. Despite the overall data-set being nearly five times the size, there were relatively fewer usable APE events - both in absolute count (55) as well as in overall proportion (1%) - and so this reduces my confidence somewhat in the generalisability of the results. Additionally, the Safety Classifier showed that only 10% of the overall Project Breathe study days would be determined as safe to make an APE prediction which was approximately five times lower than for SmartCareCF. A material increase in data collection would be required to be able to use the algorithms in a clinical setting.

Overall I was able to meet my research objectives, demonstrating there is a signal in the home measurement data that can be used to identify APEs, building an ML model that could reliably and accurately predict the onset of APEs over a week earlier than current clinical practice, and developing a framework to ensure these APE predictions are safe in the context of missing input data. I am excited to be taking the results of my research into a clinical trial later this year.

Description

Date

2022-09-13

Advisors

Floto, Andres
Winn, John

Keywords

Adaptive Boost Classifier, Characteristic Profile of Exacerbation, Classification, Cough Frequency, Cystic Fibrosis, Data Quality Assurance, Different kinds of exacerbation, Ensemble Learning, Exacerbation, Expectation Maximization, FEV1, Home Monitoring, Linear Logistic Classifier, Machine Learning, Missing Data, O2 Saturation, Prediction, Project Breathe, Pulse Rate, SmartCareCF, Supervised Machine Learning, Tele-monitoring, Unsupervised Machine Learning, Visualisation, Wellness

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge
Sponsorship
EPSRC (RCAG/909) Microsoft Research (RCAG/914)