Repository logo
 

Theses - Computer Science and Technology

Browse

Recent Submissions

Now showing 1 - 20 of 187
  • ItemOpen Access
    Distributional and relational inductive biases for graph representation learning in biomedicine
    Scherer, Paul; Scherer, Paul [0000-0002-2240-7501]
    The immense complexity in which DNAs, RNAs, proteins and other biomolecules interact amongst themselves, with one another, and the environment to bring about life processes motivates the mass collection of biomolecular data and data-driven modelling to gain insights into physiological phenomena. Recent predictive modelling efforts have focused on deep representation learning methods which offer a flexible modelling paradigm to handling high dimensional data at scale and incorporating inductive biases. The emerging field of representation learning on graph structured data opens opportunities to leverage the abundance of structured biomedical knowledge and data to improve model performance. Grand international initiatives have been coordinated to organise and structure our growing knowledge about the interactions and putative functions of biomolecular entities using graphs and networks. This dissertation considers how we may use the inductive biases within recent graph representation learning methods to leverage these structures and incorporate biologically relevant relational priors into machine learning methods for biomedicine. We present contributions in two parts with the aim to foster research in this multidisciplinary domain and present novel methods that achieve strong performance through the use of distributional and relational inductive biases operating on graph-structured biomedical knowledge and data. The first part is concerned with consolidating and expanding the current ecosystem of practical frameworks dedicated to graph representation learning. Our first contribution presents Geo2DR, the first practical framework and software library for constructing methods capable of learning distributed representations of graphs. Our second contribution, Pytorch Geometric Temporal, is the first open source representation learning library for dynamic graphs, expanding the scope of research software on graph neural networks that were previously limited to static graphs. The second part presents three methods wherein each contribution tackles an active biomedical research problem using relational structures that exist within different aspects of the data. First we present a methodology for learning distributed representations of molecular graphs in the context of drug pair scoring. Next, we present a method for leveraging structured knowledge on the variables of gene expression profiles to automatically construct sparse neural models for cancer subtyping. Finally, we present a state-of-the-art cell deconvolution model for spatial transcriptomics data using the positional relationships between observations in the dataset.
  • ItemOpen Access
    Computational criminology: at-scale quantitative analysis of the evolution of cybercrime forums
    Hughes, Jack; Hughes, Jack [0000-0002-0730-1055]
    Cybercrime forums and marketplaces are used by members to share hacking techniques, general community-building discussions, and trade hacking tools. While there is a large corpus of literature studying these platforms, from a cross-forum ecosystem comparison to smaller qualitative analyses of specific crime types within a single forum, there has been little research into studying these over time. Using the CrimeBB dataset from the Cambridge Cybercrime Centre, this first contribution of the thesis explores the evolution of a large cybercrime forum, from growth to gradual decline from peak activity, with research questions grounded in the digital drift framework from criminological theory. This finds a trend towards financially-driven cybercrime over time, by users and the forum as a whole. The second contribution of the thesis presents a method for detecting trending terms, using a lightweight natural language processing method to handle queries, given the size of the dataset. Evaluation using manual annotations showed more relevant salient terms were detected over TF-IDF. Finally, the third contribution of the thesis applies signalling theory to analyse the usage of argot (jargon and slang) on the forum, finding a negative correlation with reputation usage, and clustering to find a decreasing use of argot over time. Part of this contribution includes a lightweight argot detection pipeline with word embeddings aligned with manual annotations. Overall, the combination of different approaches, including criminological theory driving research directions, natural language processing to analyse forum text data, machine learning for classifications, and data science techniques, all contribute to provide a unique interdisciplinary perspective within the field of cybercrime community research, both drawing insights into these communities and contributing novel tools for measurements of large, noisy text data.
  • ItemOpen Access
    Strong metadata privacy for mobile devices and applications
    Hugenroth, Daniel; Hugenroth, Daniel [0000-0003-3413-1722]
    Smartphones have become the primary computing devices for many. Living inconspicuously in our pockets, they store our most intimate personal messages and pictures as well as sensitive corporate information and government secrets. This has already motivated widespread adoption of end-to-end encryption for mobile messaging applications, such as WhatsApp and Signal, which protect the confidentiality of messages. However, metadata, such as who has been messaging whom and when, can still be observed by platform operators, local internet providers, and other adversaries tapping into network traffic. This dissertation presents protocols and applications for mobile devices that not only protect the content of messages but also communication patterns. Anonymity networks provide metadata privacy, but the most popular ones, like Tor, remain vulnerable to traffic analysis, while strong alternatives, like Loopix, use cover traffic at the expense of higher bandwidth and latency. In this context smartphones raise two important challenges: battery constraints dictate conservative power usage and connectivity is often intermittent. In order to better understand power consumption on modern smartphones we run experiments on real hardware and find that cryptographic operations are cheap while radio transmission can be costly. In particular, popular solutions such as VPN and Tor are practical with negligible impact on the battery life. However, more secure designs using cover traffic are impractical and highlight the need for protocol design that takes energy limitations into account. The latency and bandwidth requirements of protocols with strong metadata privacy are particularly challenging when sending messages to many recipients---especially on mobile devices where users are often offline. We design Rollercoaster, a multicast scheme for mix networks which incorporates these constraints and allows better utilisation of the underlying network for sporadic group communication. This enables decentralised applications such as group messaging and collaborative text editing while retaining efficient mix parameters. Finally, we present CoverDrop, a practical system for initial contact between whistleblowers and journalists. CoverDrop integrates into a standard news reader app such that all its users contribute cover traffic to achieve unobservable communication for sources while having negligible impact on battery life. In addition, we implement plausibly-deniable storage to keep previous usage of CoverDrop secret even if the phone is captured by an adversary. To achieve this, our key stretching scheme, called Sloth, uses the Secure Element found in many modern smartphones, preventing the adversary from parallelising brute-force attacks and therefore allowing for shorter, more memorable passphrases.
  • ItemOpen Access
    Eavesdropping risks of the DisplayPort video interface
    Erdeljan, Dimitrije; Erdeljan, Dimitrije [0009-0000-1863-5221]
    The switching activity of digital circuits unintentionally generates electromagnetic signals, which may leak sensitive information processed by the device to nearby radio receivers. This problem, known as *compromising emanations* or *TEMPEST*, has been demonstrated for computer displays using analog video interfaces (VGA) and older digital interfaces (LVDS, HDMI, DVI). DisplayPort is a newer interface with a significantly more complex signal structure, and in particular uses a linear-feedback shift register to scramble the transmitted pixel data. Due to scrambling, images produced by applying previously published eavesdropping techniques to DisplayPort appear as random noise, and the interface is thought to be a far more difficult target. I start by showing that DisplayPort is vulnerable to electromagnetic eavesdropping, assuming that the displayed image mainly consists of a small set of colours. The attack starts by recovering scrambler timing parameters and synthesising a replica of the scrambler synchronised with the target. This replica is then used to build templates for each of the expected colours, and to identify pixel colours from short-term cross-correlation between the received signal and templates. The two main limitations of this initial attack are limited accuracy of the reset-timing model and a requirement that the attacker already knows which colours are present in the image. I address the former by designing a scrambler tracking algorithm based on a phase-locked loop that keeps the local replica closely synchronised with the target. For the latter, I exploit several properties of the 8b/10b encoding used together with this accurate scrambler alignment to efficiently enumerate colours and produce a list of candidate colours likely to be present in the image. Finally, I extend the tracking algorithm to also align signal phase across frames, which enables coherent periodic averaging of template correlations. This averaging technique further improves the signal-to-noise ratio in the reconstructed image and thus increases eavesdropping range. Accurate time alignment additionally improves horizontal resolution over that achieved using the simpler timing model. I demonstrate that the algorithms developed in this thesis can be used to recover clearly readable text from 8 m distance in realistic circumstances, even using a software-defined radio receiver with a bandwidth that is an order of magnitude lower than the bitrate used in the DisplayPort video link.
  • ItemOpen Access
    Argument mining with informal text
    Ye, Yuxiao
    The rapid growth of online discussions has led to a wealth of user-generated text, rich in arguments and diverse in nature. However, the complexity of these informal arguments presents a challenge for argument mining. Argument mining is the task of automatically analysing arguments, such that the unstructured information contained in them is converted into structured representations. Current practice in argument mining largely focuses on well-structured and edited formal text, but the annotation schemes and models developed for these simpler texts cannot account well for the phenomena found in informal text. To capture the characteristics of informal arguments, I designed an annotation scheme which includes undercuts, a counterargument device that challenges the relationship between a premise and a claim. Other computational approaches conflate undercuts with direct attacks, a device where the truth of the claim or the premise itself is challenged. I also presented the resultant large-scale Quora dataset featuring informal arguments, complemented by a layer of annotation detailing complete argument structures. I then proposed an end-to-end approach to argument mining based on dependency parsing. My approach uses new dependency representations for arguments and two new neural dependency parsers, one based on biaffine parsing and the other on GNNs. It comfortably beats a strong baseline on the Quora dataset. When applied to an existing benchmark dataset of formal arguments, my approach establishes a new state of the art. It is also the first automatic argument mining approach that is able to recognise undercuts. Furthermore, I conducted a study on external knowledge integration for end-to-end argument mining, such as information from syntax, discourse, knowledge graphs, and large language models. I found that feature-based integration using GPT-3.5 is the most effective method among those I have surveyed. Overall, I hope that my work, by providing automatic analyses of arguments in online discussions, will eventually foster better understanding among people with different opinions.
  • ItemOpen Access
    Coding for emerging archival storage media
    Sella, Omer; Sella, Omer [0000-0002-2795-8580]
    The race between generating digital data and storing it prompted a search for new media to hold our data for centuries, with fused Silica and DNA in the lead. These media are in a rapid stage of research and development. Error Correcting Codes and coding schemes must be designed for these emerging media’s constraints and noise characteristics, similar to the large body of work on coding for communication applications. Unlike communication standards, digital data storage, primarily archival, can and should capitalise on longer block sizes and more complex coding. Longer blocks have the potential to reduce coding overhead and therefore cost, while longer retrieval latency allows for more complex algorithms. This cycle of noise characterisation and code design for storage media could be made more efficient by automation and generalisation. In this work, we present the use of Reinforcement Learning to construct long Error Correcting Codes. We show that Reinforcement Learning is effective when targeting the end goal of reducing Bit Error Rate rather than proxy metrics used in the state-of-the-art heuristics. In addition, we present a unified approach to handle constraints in coding data into DNA. Together these provide a practical toolbox that would allow a co-design of a storage medium and its accompanying coding scheme. Finally, we show that our toolbox requires little human expert intervention, which facilitates designing coding schemes in lockstep with rapid development.
  • ItemOpen Access
    Securing encrypted communication
    Vasile, Diana-Alexandra; Vasile, Diana [0000-0002-3476-3060]
    Secure messaging has led to the mass adoption of strong security principles such as end-to-end encryption and perfect forward secrecy, which had previously failed to gain traction. While this marks tremendous progress in the effort to enhance the security of communication, there are still many open challenges. This dissertation looks at two open problems: providing key transparency in secure messaging apps in an attempt to detect targeted wiretaps, and securing the initial contact for journalists. We begin by formalising the different combinations of key-to-name bindings seen in popular secure messaging apps into key-name-graphs, which we then use to verify that the key server provides the same snapshot of a key-name-graph for a user to all his friends. This approach is proposed as a baseline gossip protocol between friends who physically co-locate, however, when coupled with some enhancements, it has broader applicability to both different underlying network technologies and to expanding verification beyond the friendship connection. We analyse the deployability of the baseline gossip protocol using secondary data from two datasets: Wi-Fi usage and call-detail records. We also implement the protocol as a discrete event simulator and use it to perform model checking to analyse the protocol's security. Secure messaging is not enough for everyone, though. There are certain cases in which further enhancements such as anonymity and metadata privacy are needed to protect those communicating. As such, we analysed the options available to journalists to communicate with sources who may later become whistleblowers. Through the insights from two workshops organised with large British news organisations we show that the options available to journalists are inadequate. We identify several open problems, such as low-latency secure and anonymous communication and secure collaboration. We focus our efforts on initial contact, a problem that appeared during the workshop to have a significant detrimental effect on the security of sources. We discovered that often sources do not place significant emphasis on secure communication from the start, and retrospectively applying security is non-trivial by the time they are ready to share sensitive information. We thus propose a new and secure initial contact system, called CoverDrop, which is integrated as a secure library directly inside newsreader apps.
  • ItemOpen Access
    Large-scale inference and imputation for multi-tissue gene expression
    Viñas Torné, Ramon; Viñas Torné, Ramon [0000-0003-2411-4478]
    Integrating molecular information across tissues and cell types is essential for understanding the coordinated biological mechanisms that drive disease and characterise homoeostasis. Effective multi-tissue omics integration promises a system-wide view of human physiology, with potential to shed light on intra- and multi-tissue molecular phenomena, but faces many complexities arising from the intricacies of biomedical data. This integration problem challenges single-tissue and conventional techniques for omics analysis, often unable to model a variable number of tissues with sufficient statistical strength, necessitating the development of scalable, non-linear, and flexible methods. This dissertation develops inference and imputation methods for the analysis of gene expression data, an immensely rich and complex biomedical data modality, enabling integration across multiple tissues. The imputation task can strongly influence downstream applications, including performing differential expression analysis, determining co-expression networks, and characterising cross-tissue associations. Inferring tissue-specific gene expression may also play a fundamental role in clinical settings, where gene expression is often profiled in accessible tissues such as whole blood. Due to the fact that gene expression is highly context-specific, imputation methods may facilitate the prediction of gene expression in inaccessible tissues, with applications in diagnosing and monitoring pathophysiological conditions. The modelling approaches presented throughout the thesis address four important methodological problems. The first work introduces a flexible generative model for the in-silico generation of realistic gene expression data across multiple tissues and conditions, which may reveal tissue- and disease-specific differential expression patterns and may be useful for data augmentation. The second study proposes two deep learning methods to study whether the complete transcriptome of a tissue can be inferred from the expression of a minimal subset of genes, with potential application in the selection of tissue-specific biomarkers and the integration of large-scale biorepositories. The third work presents a novel method, hypergraph factorisation, for the joint imputation of multi-tissue and cell-type gene expression, providing a system-wide view of human physiology. The fourth study proposes a graph representation learning approach that leverages spatial information to improve the reconstruction of tissue architectures from spatial transcriptomic data. Collectively, this thesis develops flexible and powerful computational approaches for the analysis of tissue-specific gene expression data.
  • ItemOpen Access
    Transient execution vulnerabilities in the security context of server hardware
    Randal, Allison
    The thesis of this work is that eliminating speculation is a feasible approach to mitigating the transient execution vulnerabilities on large-scale server hardware. Many mitigations have been proposed and implemented for many variants of the transient execution vulnerabilities, and while the Meltdown-type exception-based transient execution vulnerabilities have proven to be tractable, Spectre-type vulnerabilities and other speculation-based transient execution vulnerabilities have been far more resistant to countermeasures. After years of research and development by academia and industry, eliminating speculation is still the only reliable countermeasure against Spectre. For smaller-scale embedded systems or security-focused hardware such as a cryptographic system or a root-of-trust (RoT), eliminating speculation is widely accepted as a reasonable approach to improving security. But, for larger-scale and general-purpose hardware, eliminating speculation is often rapidly dismissed as inconceivable, though the claim that speculation is required for adequate performance is rarely supported by concrete performance results. The performance results we do have from several independent strands of research over the past few decades have shown that speculation features on large-scale server hardware do not offer the same performance advantages as on smaller-scale hardware, so eliminating speculation on large-scale server hardware does not harm performance as much as we might suspect. And selective speculation techniques have shown that speculation-based transient execution vulnerabilities can be mitigated by a partial elimination of speculation, so we can preserve some of the performance of speculation while subduing the security risk. In order to demonstrate the feasibility of eliminating speculation from modern server hardware microarchitectures, I consider three alternative approaches that partially or completely eliminate speculative execution. Heterogeneous multicore systems that combine speculative and non-speculative cores make it possible to entirely disable speculation for security-critical or untrusted sections of code, by running that code on a non-speculative core. Code running on a speculative core performs as well as it would on a fully speculative hardware architecture. The systems software developer has the power to choose which code runs with the performance advantage of speculation, and which code runs with the security advantage of no speculation. However, heterogeneous multicores only offer the ability to disable speculation at the process or thread level. A finer-grained approach is desirable, to limit the performance penalty of disabled speculation to the smallest possible region of code. Non-speculative cores keep the performance advantages of most common features in modern hardware architectures---such as dynamic multiple issue, dynamic pipeline scheduling, out-of-order execution, and register renaming---while avoiding the risk of speculative execution. Such processors do not perform as well as equivalent speculative processors, but the results of this work indicate that they can perform as well or better than equivalent speculative processors with all relevant mitigations for the transient execution vulnerabilities applied. The performance penalty of eliminating speculation can also be partially offset by increasing the size of fetch and issue stage components in the pipeline. Non-speculative cores do not give systems software developers the option to choose between performance and security. However, these cores may be desirable for large-scale server deployments that exclusively serve privacy-centered workloads, such as processing hospital patient data. Selective speculation combines speculative and non-speculative features on a single core. The performance of selective speculation cores is proportional to the use of speculative and non-speculative features, so only regions of code that disable speculation pay a performance penalty. Out of the three approaches considered in this dissertation, selective speculation cores are best for large-scale general-purpose server deployments, because they simplify resource allocation by keeping all cores identical, have no performance penalty for code run as entirely speculative, and give systems software developers the most precise control over speculation.
  • ItemOpen Access
    Context-conscious fairness throughout the machine learning lifecycle
    Lee, Seng Ah
    As machine learning (ML) algorithms are increasingly used to inform decisions across domains, there has been a proliferation of literature seeking to define “fairness” narrowly as an error to be “fixed” and to quantify it as an algorithm’s deviation from a formalised metric of equality. Dozens of notions of fairness have been proposed, many of which are both mathematically incompatible and morally irreconcilable with one another. There is little consensus on how to define, test for, and mitigate unfair algorithmic bias. One key obstacle is the disparity between academic theory and practical and contextual applicability. The unambiguous formalisation of fairness in a technical solution is at odds with the contextualised needs in practice. The notion of algorithmic fairness lies at the intersection of multiple domains, including non-discrimination law, statistics, welfare economics, philosophical ethics, and computer science. Literature on algorithmic fairness has predominantly been published in computer science, and while it has been shifting to consider contextual implications, many approaches crystallised into open source toolkits are tackling a narrowly defined technical challenge. The objective of my PhD thesis is to address this gap between theory and practice in computer science by presenting context-conscious methodologies throughout ML de- velopment lifecycles. The core chapters are organised by each phase: design, build, test, and monitor. In the design phase, we propose a systematic way of defining fairness by understanding the key ethical and practical trade-offs. In the test phase, we introduce methods to identify and measure risks of unintended biases. In the deploy phase, we identify appropriate mitigation strategies depending on the source of unfairness. Finally, in the monitor phase, we formalise methods for monitoring fairness and adjusting the ML model appropriately to any changes in assumptions and input data. The primary contribution of my thesis is methodological, including improving our understanding of limitations of current approaches and proposal of new tools and in- terventions. It shifts the conversation in academia away from axiomatic, unambiguous formalisations of fairness towards a more context-conscious, holistic approach that covers the end-to-end ML development lifecycle. This thesis aims to provide end-to-end coverage in guidance for industry practitioners, regulators, and academics on how fairness can be considered and enforced in practice.
  • ItemOpen Access
    Game comonads and beyond: compositional constructions for logic and algorithms
    Connolly, Adam; Connolly, Adam [0000-0002-3032-5514]
    Game comonads represent a rare application of category theoretic methods to the fields of finite model theory and descriptive complexity. First introduced by Abramsky, Dawar and Wang in 2017, these new constructions exposed connections between Spoiler-Duplicator games used in logic, related algorithms for constraint satisfaction and structure isomorphism, and well-known parameters such as treewidth and treedepth. The compositional framework for logical resources emerging from these comonads has proved an important tool in generalising results from finite model theory and new game comonads have been invented for a range of different logics and algorithms. However, this framework has previously been limited by its inability to express logics which are strictly stronger than those captured by Abramsky, Dawar and Wang's pebbling comonad, $\mathbb{P}$*k*. In this thesis, we show for the first time how to overcome these limitations by extending the reach of compositional techniques for logic and algorithms in a number of directions. Firstly, we deepen our understanding of the comonad $\mathbb{P}$*k*, which previously captured the strongest logic of any game comonad. Doing so, we reveal new connections between the Kleisli category of $\mathbb{P}$*k* and *k*-variable logics extended with different forms of quantification, including limited counting quantifiers and unary generalised quantifiers. Secondly, we show how to construct a new family of game comonads $\mathbb{H}$*n,k* which capture logics extended by generalised quantifiers of all arities. This construction leads to new variants of Hella's *k*-pebble *n*-bijective game, new structural parameters generalising treewidth, and new techniques for constructing game comonads. Finally, we expand the realm of compositional methods in finite model theory beyond comonads, introducing new constructions on relational structures based on other aspects of category theory. In the first instance, we show that lifting well-known linear-algebraic monads on Set to the category of relational structures gives a compositional semantics to linear programming approximations of homomorphism and an elegant framework for studying these techniques. Furthermore, we use presheaves to give a new semantics for pebble games and algorithms for constraint satisfaction and structure isomorphism. Building on analogous work in quantum contextuality, we use a common invariant based on cohomology to invent efficient algorithms for approximating homomorphism and isomorphism and prove that these are far more powerful than those currently captured by game comonads.
  • ItemOpen Access
    Balanced allocations under incomplete information: New settings and techniques
    Los, Dimitrios; Los, Dimitris [0000-0002-1574-5166]
    In the balanced allocations framework, there are 𝑚 balls to be allocated into 𝑛 bins with the aim of minimising the maximum load of any of the bins, or equivalently minimising the 𝑔𝑎𝑝, i.e., the difference between the maximum load and the average load. In this dissertation, we focus on the ℎ𝑒𝑎𝑣𝑖𝑙𝑦-𝑙𝑜𝑎𝑑𝑒𝑑 𝑐𝑎𝑠𝑒 where 𝑚 ≫ 𝑛, which tends to be more challenging to analyse. In a decentralised setting, the simplest process is One-Choice, which allocates each ball to a bin sampled uniformly at random. It is well-known that w.h.p. Gap(𝑚) = Θ( sqrt( 𝑚/𝑛 · log 𝑛 ) ) for any 𝑚 ≫ 𝑛. A great improvement over this is the Two-Choice process [ABKU99, KLM96], which allocates each ball to the least loaded of 𝑡𝑤𝑜 bins sampled uniformly at random. Berenbrink, Czumaj, Steger, and Vöcking (2006) showed that w.h.p. Gap(𝑚) = log₂ log 𝑛 + Θ(1) for any 𝑚 ⩾ 𝑛. This improvement is known as the "power of two choices". It has found several applications in hashing, load balancing and routing; and its importance was recently recognised in the 2020 ACM Theory and Practice Award. In this dissertation, we introduce a set of techniques based on 𝑝𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑠. These enable us to analyse (both in terms of gap and load distribution) a wide range of processes and settings in the heavily-loaded case and to establish interesting insights in the balanced allocations framework: - We analyse variants of the Two-Choice process which trade sample efficiency, completeness of information and gap guarantees. For the (1+β)-process which mixes One-Choice and Two-Choice with probability β in (0, 1], we prove tight bounds for small and large β, extending the results of Peres, Talwar and Wieder (2015). Another sample efficient family is that of Two-Thinning processes, which allocate to the two sampled bins in an online manner. For Two-Thinning processes that use as a decision function thresholds relative to the average load or thresholds in the rank domain, we establish tight bounds and also resolve a conjecture by Feldheim and Gurel-Gurevich (2021). We also quantify trade-offs for two-sample processes between the number of queries and the gap bound, establishing a "power of two queries" phenomenon. - We analyse the Two-Choice process with random, adversarial and delay noise, proving tight bounds for various settings. In the adversarial setting, the adversary can decide in which of the two sampled bins the ball is allocated to, only when the two loads differ by at most 𝑔. The analysis of this setting implies bounds for settings with random noise and delay. For the setting where load information is updated periodically every 𝑏 steps, for 𝑏 = 𝑛 we tighten the bound of [BCEFN12] to Θ( log 𝑛 / log log 𝑛 ) and prove that Two-Choice is optimal in this setting for any in [𝑛 · exp(-logᶜ 𝑛), 𝑛 log 𝑛] for any constant 𝑐 > 0. For 𝑏 in [𝑛 log 𝑛, 𝑛³], we show that Two-Choice achieves w.h.p. a Θ(𝑏/𝑛) gap, while surprisingly the (1+β)-process with appropriately chosen β achieves w.h.p. a Θ( sqrt( 𝑏/𝑛 · log 𝑛) ) gap, which is optimal over a large family of processes. This proves that in the presence of outdated information, less aggressive strategies can outperform the greedy processes (such as Two-Choice), which has been empirically observed in the queuing setting [D00, M00] for centralised processes since 2000, but to the best of our knowledge has not been formally proven. - Next we analyse Two-Choice in the graphical setting, where bins are vertices of a graph and each ball is allocated to the lesser loaded of the vertices adjacent to a randomly sampled edge. We extend the results of Kenthapadi and Panigrahy (2006) proving that for dense expanders in the heavily-loaded case the gap is w.h.p. O(log log 𝑛). In the presence of weights, we make progress towards [Open Problem 1, PTW15] by proving that for graphs with conductance φ, the gap is w.h.p. Ο(log 𝑛 / φ). - Further, we introduce and analyse processes which can allocate more than one balls to a sampled bin. We prove that these processes achieve w.h.p. an O(log 𝑛) gap (which also applies for any 𝑑-regular graph), while still being more sample-efficient than One-Choice ("power of filling"). - For the Memory process that can store bins in a cache, we generalise the O(log log 𝑛) gap bound by Mitzenmacher, Prabhakar and Shah (2002) to the heavily-loaded case and prove a matching lower bound. Further, in the presence of heterogeneous sampling distributions, we establish a striking difference between Two-Choice (or even 𝑑-Choice with 𝑑 = O(1)) and Memory, showing that for the later the gap is bounded, while for the former it is known to diverge [W07] ("power of memory").
  • ItemOpen Access
    Deep concept reasoning: beyond the accuracy-interpretability trade-off
    Barbiero, Pietro; Barbiero, Pietro [0000-0003-3155-2564]
    Deep learning researchers stockpile ground-breaking achievements almost as fast as they find flaws in their models. Although deep learning models can achieve superhuman performances, explaining deep learning decisions and mistakes is often impossible even for "explainable AI" specialists, causing lawmakers to question the ethical and legal ramifications of deploying deep learning systems. For this reason, the key open problem in the field is to increase deep neural networks transparency and trustworthiness to enable a safe deployment of such technologies. The lack of human trust in deep learning is affected by three key factors. Firstly, the absence of a formal and comprehensive theory undermines the field of explainable AI. This leads to ill-posed questions, induces re-discovery of similar ideas, and impedes researchers to approach the domain. Secondly, the explainable AI literature is mostly dominated by methods providing post-hoc, qualitative, and local explanations, which are often inaccurate and misleading. Finally, machine learning systems---including deep neural networks---struggle in striking a balance between task accuracy and interpretability. Existing solutions either sacrifice model transparency for task accuracy or vice versa, making it difficult to optimize both objectives simultaneously. This thesis includes four research works contributing in addressing these challenges. The first work addresses the lack of a formal theory of explainable AI. This work proposes the first-ever theory of explainable AI and concept learning, which formalizes some of the fundamental ideas used in this field. The key innovation of this chapter is the use of categorical structures to formalize explainable AI notions and processes. The use of category theory is particularly noteworthy as it provides a sound and abstract formalism to examine general structures and systems of structures, avoiding contingent details and focusing on their fundamental essence. This theoretical foundation serves as a solid basis for the other chapters in the thesis. The second work aims to overcome the limitations of current explainable AI techniques providing post-hoc, qualitative, and local explanations. To this end, this work proposes Logic Explained Networks, a novel class of concept-based models that can solve and explain classification problems simultaneously. The key innovation of Logic Explained Networks is a sparse attention layer that selects the most relevant concepts in neural concept-based models. This way, the model learns to generate simple logic explanations. The third work tackles the accuracy-explainability trade-off, a major limitation of concept-based models. To address this issue, this work proposes Concept Embedding Models. The key innovation of Concept Embeddings Models is a fully supervised high-dimensional concept representation. The high-dimensional representation enables Concept Embedding Models to overcome the information bottleneck, enabling them to achieve state-of-the-art accuracy without sacrificing model transparency. The fourth work addresses the limitations of Concept Embeddings Models which are unable to provide concept-based logic explanations for their predictions. To fill this gap, this work presents the Deep Concept Reasoner, the first interpretable concept-based model using concept embeddings. The key innovation of the Deep Concept Reasoner is the use of neural networks to generate interpretable rules which are executed symbolically to make task predictions. This enables the Deep Concept Reasoner to attain state-of-the-art performance in complex tasks and to provide human-understandable and formal explanations for its predictions. Overall, this thesis makes significant contributions by introducing the first formal theory of explainable AI and presenting novel deep learning techniques going beyond the current accuracy-interpretability trade-off. The results of the experiments demonstrate how these innovations lead to a new generation of deep learning architectures that are both transparent and accurate. The introduction of these new techniques lays the groundwork to increase deep learning transparency and trustworthiness, enabling a safe deployment of robust and controllable machine learning agents.
  • ItemEmbargo
    Towards a psychological science of neural network behaviour
    Bell, Samuel
    The pace of progress in machine learning is astounding. As a community, we have made great leaps forward across a variety of tasks, from complex vision challenges such as scene segmentation and object recognition, to striking language understanding capability and remarkably fluent text generation. As a result, machine learning finds itself increasingly deployed in high-stakes domains such as healthcare, justice, hiring, and driverless vehicles. Without diminishing such achievements, these remarkable claims often obscure a more nuanced reality in which machine learning performs well under narrow and specific conditions. This thesis is motivated by the desire for a machine learning that works equally well for all people, for reliable systems we can trust, and for a fuller understanding of the way our models perform. The typical model analysis toolkit, comprising standardised evaluation datasets, aggregated performance metrics, and off-the-shelf methods for interpretability and explainability, often proves useful in the appropriate circumstances. But, in situations where we seek precise and detailed understanding of exactly what our models actually do—i.e., a systematic account of their *behaviour* in response to various inputs—we may stand to benefit from new approaches. In search of complementary ideas to support the scientific study of machine learning behaviour, this thesis turns for inspiration to the prototypical science of behaviour: experimental psychology. Drawing on experimental psychology along epistemological, methodological and metascientific lines, this thesis will explore ideas including the differing nature of explanatory practices, the role of experimental design and a focus on observable behaviour, and lessons from psychology’s most recent replication crisis. In doing so, we’ll apply psychological principles and techniques to a range of contemporary machine learning settings, with a particular focus on neural networks. First, we will introduce simple behavioural experiments to shed light on the phenomenon of catastrophic forgetting. In particular, through the use of synthetic parameterisable stimuli, we’ll investigate the role of task similarity and task ordering in a continual learning setup. Our analysis leads us to an easy-to-compute heuristic that is predictive of catastrophic forgetting. Second, we apply our approach to the important context of machine learning fairness, and develop novel behavioural experiments designed to isolate a new form of algorithmic bias. Specifically, we demonstrate that the particular aspects of a dataset that a model finds challenging (be they classes, demographic groups, or otherwise) will vary from model to model, and it is hard to identify what is “difficult” without first training a model. When difficulty varies along demographic lines, we show that this can lead to amplified performance disparities when using popular deep learning architectures. Third, following a brief survey of psychology’s own crisis of confidence, we then introduce the multiverse analysis to machine learning research. While the multiverse analysis is first developed in psychology, we make significant adaptations so as to render it tractable in the machine learning setting, through the use of a Gaussian Process surrogate and Bayesian experimental design. We demonstrate the utility of the machine learning multiverse through two case studies, one on the relative merit of adaptive versus non-adaptive gradient-based optimisers, and the other on the large batch generalisation gap. The common thread that runs throughout this thesis is the application of ideas from experimental psychology to the machine learning setting. If we hope to maintain the tangible sense of rapid progress in machine learning, yet also to build confidence and trust that our systems consistently work *as we expect*, it is essential we develop thorough accounts of exactly what our models do in key scenarios. We hope to contribute to this endeavour in our work, by way of an exploration of approaches informed by the discipline of experimental psychology, and their application to the scientific study of machine learning behaviour.
  • ItemOpen Access
    Out-of-distribution generalisation in machine learning
    Słowik, Agnieszka; Słowik, Agnieszka [0000-0001-7113-4098]
    Machine learning has proven extremely useful in many applications in recent years. However, a lot of these success stories stem from evaluating the algorithms on data very similar to that they were trained on. When applied to a new data distribution, machine learning algorithms have been shown to fail. Given the non-stationary and heterogeneous nature of real-world data, a better grasp of out-of-distribution generalisation is needed for algorithms to be widely deployed and trusted. My thesis presents three research studies that aim to investigate and develop the field of out-of-distribution generalisation. The central goal of these research efforts is to produce new tools, such as algorithms, theoretical results, experimental results and datasets, to improve understanding and performance of machine learning methods in the face of distribution shift. The high-level idea that drives these research efforts across three machine learning scenarios is *modularity* -- the quality of consisting of separate parts that form a whole when combined. Modular approaches are hypothesised to steer the machine learning methods away from rigid memorisation of examples and towards more flexible and `more intelligent' learning that supports generalisation. In my first contribution, I approach the thesis goal from the perspective of learning from multiple training distributions. The contribution to this line of research is twofold. First, I present a new standardised suite of tasks for evaluation and comparison of out-of-distribution generalisation algorithms. Second, I state a set of new theoretical results that fill an existing gap between data-centric and algorithmic approaches to out-of-distribution generalisation. These theoretical findings guide a new set of practical recommendations on how to employ the algorithmic approach. In the second contribution, I tackle generalisation in the common learning setup of supervised image recognition. In this context, I first investigate the effect of multi-level feature aggregation on generalisation, and demonstrate that augmentation with one of the considered methods consistently improves the performance. Second, I propose a set of simple image datasets that can be used as a stepping stone for evaluation and comparison of image classification methods in terms of out-of-distribution generalisation. Finally, I delve into the learning scenarios where multiple neural networks communicate to solve a shared task. This work supports the thesis goal in two ways. First, I propose a new environment, *graph referential games*, and present results on the influence of data representation and the corresponding data representation learning methods on out-of-distribution generalisation. These results connect the previously disjoint fields of graph representation learning and emergent communication. Second, I tackle the challenging domain of population-based communication grounded in realistic images. The datasets, algorithms, theorems and experimental results in this thesis represent a few steps towards understanding and improving out-of-distribution generalisation in machine learning. They provide researchers with new tools and results that aim to foster research in this field, some of which have already proved useful to the research community. Finally, this work suggests important future directions in the machine learning subfields of learning from multiple distributions, image classification and multi-agent communication.
  • ItemOpen Access
    Deep neural networks for medical image super-resolution
    Zhu, Jin
    Super-resolution plays an essential role in medical imaging because it provides an alternative way to achieve high spatial resolutions with no extra acquisition cost. In the past decades, the rapid development of deep neural networks has ensured high reconstruction fidelity and photo-realistic super-resolution image generation. However, challenges still exist in the medical domain, requiring novel network architectures, training tricks, and SR image evaluation techniques. This dissertation concentrates on backbone networks for supervised single-image super-resolution tasks on various medical images with challenging magnification scales. Besides incorporating widespread methods designed for natural images, I explore progressive learning, adversarial learning and meta-learning in end-to-end frameworks based on convolution neural networks, generative adversarial networks and vision transformers for robust medical image super-resolution. In addition to general image quality assessments, task-specific objective and subjective evaluation metrics are implemented for comprehensive comparisons. Specifically, the proposed approaches contain three directions, achieving state-of-the-art performance on diverse medical image modalities. First, I implement progressive and adversarial learning for perceptually realistic texture generation in super-resolution tasks with challenging magnification scales (i.e. x4). I present a CNN-based multi-scale super-resolution image generator that decomposes the complex mapping problem into simpler sub-problems to avoid over-smoothing the structural information and introducing non-realistic high-frequency textures in super-resolved images. Moreover, it involves a lesion-focused training strategy and an advanced adversarial loss based on the Wasserstein distance for more efficient and stabilised training. This proposed method dramatically improves the perceptual quality of generated images, achieving comparable subjective scores of experienced radiologists with ground truth high-resolution images in the experiments of the brain and cardiac magnetic resonance images. It competed for state-of-the-art perceptual quality in medical image super-resolution in 2019 and became the pioneer of GAN-based medical image research with enduring effects. Second, I introduce meta-learning and transfer learning to GANs for efficient and robust medical image super-resolution with arbitrary scales (e.g. (1, 4]). In the post-upsampling framework, I implement a lightweight network based on EDSR for productive low-resolution feature extraction and a weight prediction module for scale-free feature map upsampling. Compared with existing SISR networks, this framework supports non-integer magnification with no adverse effects of pre-/post- processing. Specifically, this approach achieves comparable reconstruction accuracy and objective perceptual quality performance with much fewer parameters than SOTA methods. Additionally, I robustly transfer the pre-trained SR model of one medical image dataset (i.e. brain MRI) to various new medical modalities (e.g. chest CT and cardiac MR) with a few fine-tuning steps. Moreover, exhaustive ablation studies are conducted to discuss the perception-distortion tradeoff and to illustrate the impacts of residual block connections, hyper-parameters, loss components and adversarial loss variants on medical image super-resolution performance. Finally, I propose an efficient vision transformer with residual dense connections and local feature fusion to achieve superior single-image super-resolution performances of medical modalities. Due to the improved information flow, this CNN-transformer hybrid model has advanced representation capability with fewer training computational requirements. Meanwhile, I implement a general-purpose perceptual loss with manual control for desired image quality improvements by incorporating prior knowledge of medical image segmentation. Compared with state-of-the-art methods on four public medical image datasets, the proposed method achieves the best PSNR scores of 6 modalities among seven modalities with only 38% parameters of SwinIR (the most recent SOTA method). On the other hand, the segmentation-based perceptual loss increases by +0.14 dB PSNR on average for prevalent super-resolution networks without extra training costs. Additionally, I discuss potential factors for the superior performance of vision transformers over CNNs and GANs and the impacts of network and loss function components in a comprehensive ablation study. In conclusion, this dissertation represents my research contributions of applying deep neural networks on robust medical image super-resolution tasks, including efficient network architectures, broad applicability training techniques, and clinically meaningful image quality evaluation. When publishing, these proposed approaches perform state-of-the-art on various public and private medical image datasets in simulation experiments. These algorithms potentially apply in hospitals for advanced clinical processes with proper case-specific modifications and supplementary techniques. Moreover, the novel methods and findings of super-resolution may also benefit other low-level image processing tasks, while the discussion and ablation studies provide exciting future research directions.
  • ItemOpen Access
    Single-trace template attacks on permutation-based cryptography
    You, Shih-Chun; You, Shih-Chun [0000-0002-6359-7866]
    The Template Attack introduced by Chari, Rao, and Rohatgi has been widely used in Side-Channel Attacks on cryptographic algorithms running on microcontrollers. In 2014, Choudary and Kuhn successfully optimized a variant of this technique, based on Linear Discriminant Analysis (LDA), to reconstruct the actual values of a byte handled by a single microcontroller machine instruction, instead of only its Hamming weight. While their attack targeted single LOAD instructions, I believe this method can be even more powerful when attackers target intermediate values inside a cryptographic algorithm, for such values can be related to more than single instructions, and further mathematical tools can be applied for value enumeration or error correction when multiple target values can be checked against one another. In my dissertation, I first describe how I successfully built LDA-based templates for full-state recovery on target intermediate bytes in the SHA3-512 hash function implemented on an 8-bit device, which I combined with a three-layer enumeration technique for error correction to recover all the input values of this hash function from a single trace recording. To demonstrate an alternative technique, I also combined these template recovery results with a modified belief-propagation procedure for error recovery, adapting a 2020 design by Kannwischer et al. In combination, these techniques reached success rates near 100% in recovering all SHA3-512 input bytes. Secondly, I introduce the fragment template attack to make this technique feasible for targeting 32-bit microcontrollers. It cuts a 32-bit intermediate value into smaller pieces, applying the LDA-based template attack by independently building templates for these pieces. For a SHA-3 implementation on a 32-bit device, the quality of these fragment templates is good enough that their predictions can reconstruct the full arbitrary-length SHA-3 or SHAKE inputs with a very high success rate when combined with belief propagation. Thirdly, I also show that a combination of fragment template attack, belief propagation, and key enumeration can recover the key used in an Ascon-128 implementation. My experiments show how LDA-based templates can pose a threat to cryptographic algorithms once it is combined with belief propagation and key enumeration, even when they are implemented on a 32-bit device and in applications where keys are only used once. Therefore, we should not underestimate these risks and it is important to analyze the resilience against template attacks, in addition to DPA-style correlation attacks, when designing or implementing cryptographic algorithms and evaluating their security level.
  • ItemOpen Access
    Graph Neural Networks for Multi-Robot Coordination
    Li, Qingbiao
    In this thesis, we are particularly interested in investigating machine learning (especially graph neural network) based approaches to find the trade-off between optimality and complexity by offloading online computation into an offline training process. Yet, learning-based methods also yield the need for sim-to-real systems and solutions to minimize the gap and provide interpretability and guarantee for the generated solutions. Hence, we first developed a framework that can learn to communicate between robots based on Graph Neural Networks (GNNs) towards better individual decision-making given its local information in a decentralized manner. This framework is composed of an encoder (i.e. Convolutional Neural Network) that extracts adequate features from local observations, and a GNN that learns to explicitly communicate these features among robots, and Multilayer Perceptron for action selection. By jointly training these components, the system can learn to determine best what information is relevant for the team as a whole and share this to facilitate efficient path planning. Following up with that, we propose Message Aware Graph Attention neTwork (MAGAT) to combine a GNN with a key-query-like attention mechanism to improve the effectiveness of inter-robot communication. We demonstrate the generalizability of our model by training the model on small problem instances and testing it on increasing robot density, varying map size, and much larger problem instances (up to~\times 100 the number of robots). To port our solution into the real world, we developed a ROS-based system that allows for the fully decentralized execution of GNN-based policies.We demonstrated our framework on a case study that requires tight coordination between robots, and presented first-of-a-kind results that showed successful real-world deployment of GNN-based policies on a decentralized multi-robot system relying on ad-hoc communication. Extending this system, we proposed a vision-only-based learning approach that leverages a GNN to encode and communicate relevant viewpoint information to the mobile robots. During navigation, the robot is guided by a model that we train through imitation learning to approximate optimal motion primitives, thereby predicting the effective cost-to-go (to the target). Our experiments demonstrated its generalizability in guiding robots in previously unseen environments with various sensor layouts. Vanilla GNN-based decentralized path planning has demonstrated its performance empirically via an end-to-end learning approach. However, these black box approaches are facing challenges to directly deploy in the actual workplace, as they are hard to find a guaranteed and interpretable solution. Therefore, we designed Graph Transformer, as a heuristic function, to accelerate the focal search within Conflict-Based Search (CBS) in a non-grid setting, especially dense graphs. Our framework guarantees both the completeness and bounded suboptimality of the solution. For the explainability and interpretability for RL, we introduced a global path planning algorithm (for example, A*) to generate a globally optimal path, which act as part of the reward function to encourage the robot to explore all potential solutions `weekly supervised' by the optimal path. As our reward function is independent of the environment, our trained framework generalizes to arbitrary environments and can be used to solve the multi-robot path planning problem in a fully distributed reactive manner. Throughout my Ph.D. research, I first proposed communication-aware motion planning for multi-robot coordination, where GNNs are introduced to build communication channels for multi-robot teams so that they can learn how to communicate with each other explicitly. The feasibility of this novel research idea has been validated by various simulation experiments based on an end-to-end imitation learning pipeline. To port them into reality, we built a ROS2-based system with adhoc communication to demonstrate our idea in a multi-robot passage scenario and single-robot navigation assisted by randomly sampled camera-based sensors in an unknown environment. Finally, we developed methods that provide interpretation and performance guarantees in the previous black box approaches by introducing a heuristic function into the focal search of CBS and designing a novel reward mechanism called G2RL.
  • ItemOpen Access
    Electronic Long-Term Archiving of Complex Textual Artefacts
    Bruder, Daniel
    Digital long-term archiving and data curation, whether in the Digital Humanities (DH) or elsewhere, depends on a suitable data model and must fulfill many requirements. One prerequisite for long-term archiving is interoperability of the data, across machines, computer architectures, and operating systems. But there are many use cases in philology where the requirements are even more stringent, for example the philological reconstruction of textual artefacts and their gestation. Such reconstruction depends on a format which natively supports non-linear text. It should also provide native support for multiple hierarchies over the data. In addition, the format should ideally enable different teams of philologists to work together successfully and sustainably on the same project, and over long stretches of time. Documents in this format should therefore be easily readable for humans, while still being machine-readable. In this work, I show that the two document models in common use today fall short of these requirements. I then set out to provide my solution to the problem: a topological document format in which symbols gain their meaning through their topological arrangement. Annotation is expressed in a stand-off manner and therefore able to support multiple hierarchies and concurrent text. My design includes operations that can programmatically support the format: how to create data in the format, how to access and mutate the data by systematic means, how to check whether the data is consistent, and how to print out the data after work in the DH project is concluded, possibly keeping the data in that format for centuries. One part of the solution is to extend the classic diff model of editorial operations by adding an open variant. Structural data access and mutation in my document model relies on Region Algebra, which was invented in 2002 by Miller. String search over the non-linear data uses an object called a variant graph, which can be systematically derived from my topological notation. After showing that my solution does not share the problems of its predecessors, I will lay out the implementation of the model. I will also show how import from existing formats works, by using the Wiener Ausgabe as my showcase. The design is based on a combination of insights from philology with techniques from computer science, hopefully enabling philologists to systematicise the editorial operations they use, while exposing computer scientists to interaction techniques from a century-old endeavour.
  • ItemOpen Access
    Transformations for linguistic steganography
    Chang, Ching-Yun
    Linguistic steganography is a form of covert communication using natural language to conceal the existence of the hidden message. It is usually achieved by systematically making changes to a cover text, such that the manipulations, namely the very act of communication, are undetectable to an outside observer (human or computer). In this thesis, we explore three possible linguistic transformations — lexical substitution, adjective deletion and word ordering — which are able to generate alternatives for a cover text. For each transformation, we propose different transformation checkers in order to certify the naturalness of a modified sentence. Our lexical substitution checkers are based on contextual n-gram counts and the αskew divergence of those counts derived from the Google n-gram corpus. For adjective deletion, we propose an n-gram count method similar to the substitution n-gram checker and a support vector machine classifier using n-gram counts and other measures to classify deletable and undeletable adjectives in context. As for word ordering, we train a maximum entropy classifier using some syntactic features to determine the naturalness of a sentence permutation. The proposed transformation checkers were evaluated by human judged data, and the evaluation results are presented using precision and recall curves. The precision and recall of a transformation checker can be interpreted as the security level and the embedding capacity of the stegosystem, respectively. The results show that the proposed transformation checkers can provide a confident security level and reasonable embedding capacity for the steganography application. In addition to the transformation checkers, we demonstrate possible data encoding methods for each of the linguistic transformations. For lexical substitution, we propose a novel encoding method based on vertex colouring. For adjective deletion, we not only illustrate its usage in the steganography application, but also show that the adjective deletion technique can be applied to a secret sharing scheme, where the secret message is encoded in two different versions of the carrier text, with different adjectives deleted in each version. For word ordering, we propose a ranking-based encoding method and also show how the technique can be integrated into existing translation based embedding methods.