Coding for emerging archival storage media

Sella, Omer

Coding for emerging archival storage media

Repository URI

https://www.repository.cam.ac.uk/handle/1810/363956

Repository DOI

https://doi.org/10.17863/CAM.105802

Files

Thesis (32.66 MB)

Type

Thesis

Authors

Sella, Omer

https://orcid.org/0000-0002-2795-8580

Abstract

The race between generating digital data and storing it prompted a search for new media to hold our data for centuries, with fused Silica and DNA in the lead. These media are in a rapid stage of research and development. Error Correcting Codes and coding schemes must be designed for these emerging media’s constraints and noise characteristics, similar to the large body of work on coding for communication applications. Unlike communication standards, digital data storage, primarily archival, can and should capitalise on longer block sizes and more complex coding. Longer blocks have the potential to reduce coding overhead and therefore cost, while longer retrieval latency allows for more complex algorithms. This cycle of noise characterisation and code design for storage media could be made more efficient by automation and generalisation. In this work, we present the use of Reinforcement Learning to construct long Error Correcting Codes. We show that Reinforcement Learning is effective when targeting the end goal of reducing Bit Error Rate rather than proxy metrics used in the state-of-the-art heuristics. In addition, we present a unified approach to handle constraints in coding data into DNA. Together these provide a practical toolbox that would allow a co-design of a storage medium and its accompanying coding scheme. Finally, we show that our toolbox requires little human expert intervention, which facilitates designing coding schemes in lockstep with rapid development.

Date

2022-08-04

Advisors

Moore, Andrew

Keywords

archival data storage, coding, DNA data storage, error correcting code, LDPC

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights

Sponsorship

This work was supported by Microsoft Research through its PhD Scholarship Programme.

Collections

Theses - Computer Science and Technology