Repository logo
 

Deep learning of regulatory sequence variation in Pulmonary Arterial Hypertension


Type

Thesis

Change log

Authors

Abstract

Pulmonary arterial hypertension (PAH) is a rare and fatal lung disease. To date, in only a third of idiopathic patients, the cause can be attributed to rare genetic variation in the protein-coding space. The sequencing of 13,343 whole genomes by the NIHR BioResource for Translational Research – Rare Diseases (NBR), including 1,216 PAH samples, provides an unprecedented opportunity to estimate the contribution of regulatory genome variation to the development of PAH. This work aims to determine whether sequence-based predictions of epigenetic features can be used to narrow down the possible regions of interest and allow aggregation of variants into functional groups for association testing. A convolutional neural network (CNN) has been trained using publicly available data sets to predict epigenetic features from DNA sequences. The model was tested against known enhancer regions and its accurate performance was verified; two approaches were developed for the evaluation of the epigenetic features. Firstly, an epigenetic importance score supplies general information about the availability of epigenetic profiles within a region to explore the non-coding space. Secondly, a regulation score combines the predicted features into activating and repressing subsets for more detailed analyses to gauge the regulatory impact of variants. Based on the regulatory impact and other common variant annotations, variants were filtered and aggregated for over-representation analysis, comparing cases with controls. These scores were used in an outlier analysis using the Fisher's exact test and a sweep across the landscape of 1,135 PAH gene-associated enhancers, using SKAT-O. After p-value adjustment, over 80 regions were found significant. The statistical analysis revealed likely disease-causing sequence variation in ENG enhancers, as well as strong associations in ACVRL1 and KLK1 enhancers. Here, I present an extended search into enhancer networks associated with PAH, unlocking the non-coding space for genomic medicine.

Description

Date

2023-03-01

Advisors

Gräf, Stefan
Morrell, Nicholas
Lio, Pietro

Keywords

Bioinformatics, Deep Learning, Genomics, Medicine, PAH, Pulmonary Arterial Hypertension, WGS

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge