Electronic Long-Term Archiving of Complex Textual Artefacts

Bruder, Daniel

Electronic Long-Term Archiving of Complex Textual Artefacts

Repository URI

https://www.repository.cam.ac.uk/handle/1810/354453

Repository DOI

https://doi.org/10.17863/CAM.100239

Files

Thesis (5.24 MB)

Type

Thesis

Authors

Bruder, Daniel

Abstract

Digital long-term archiving and data curation, whether in the Digital Humanities (DH) or elsewhere, depends on a suitable data model and must fulfill many requirements. One prerequisite for long-term archiving is interoperability of the data, across machines, computer architectures, and operating systems. But there are many use cases in philology where the requirements are even more stringent, for example the philological reconstruction of textual artefacts and their gestation. Such reconstruction depends on a format which natively supports non-linear text. It should also provide native support for multiple hierarchies over the data. In addition, the format should ideally enable different teams of philologists to work together successfully and sustainably on the same project, and over long stretches of time. Documents in this format should therefore be easily readable for humans, while still being machine-readable. In this work, I show that the two document models in common use today fall short of these requirements. I then set out to provide my solution to the problem: a topological document format in which symbols gain their meaning through their topological arrangement. Annotation is expressed in a stand-off manner and therefore able to support multiple hierarchies and concurrent text. My design includes operations that can programmatically support the format: how to create data in the format, how to access and mutate the data by systematic means, how to check whether the data is consistent, and how to print out the data after work in the DH project is concluded, possibly keeping the data in that format for centuries. One part of the solution is to extend the classic diff model of editorial operations by adding an open variant. Structural data access and mutation in my document model relies on Region Algebra, which was invented in 2002 by Miller. String search over the non-linear data uses an object called a variant graph, which can be systematically derived from my topological notation. After showing that my solution does not share the problems of its predecessors, I will lay out the implementation of the model. I will also show how import from existing formats works, by using the Wiener Ausgabe as my showcase. The design is based on a combination of insights from philology with techniques from computer science, hopefully enabling philologists to systematicise the editorial operations they use, while exposing computer scientists to interaction techniques from a century-old endeavour.

Date

2023-05-04

Advisors

Teufel, Simone

Keywords

digital humanities, archiving, digital scholarly editions

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights

Sponsorship

Funding from the Wittgenstein Trust, Cambridge

Collections

Theses - Computer Science and Technology