Analysis of Environmental Treaty Design: A Data Science Approach
Repository URI
Repository DOI
Change log
Authors
Abstract
There are hundreds if not thousands of international agreements governing all sorts of environmental problems, from endangered species and pollution to stratospheric ozone depletion and climate change. Analysing and describing the provisions of all these treaties using the traditional `reading and writing' approach has become all but impossible. The main proposals for solving this epistemic challenge involve either time-consuming manual approaches to building datasets, or use statistical natural language processing (NLP) for a different kind of content analysis. This thesis proposes an intermediate approach, leveraging rule-based NLP for dataset construction and employing statistics and machine learning only for downstream analysis. Traditional legal research can thus be supported and complemented while taking advantage of data science and automation. The approach is developed with a set of about 120 open multilateral environmental agreements and about 50 treaty design variables. Regular expression pattern matching is found to be well suited for accurate and precise extraction of information from common treaty provisions such as those on entry into force, amendment, supplementary agreements, treaty organs, withdrawal, termination and dispute settlement. Implementation-related provisions, including national reporting, international verification of compliance, treaty progress review, non-compliance procedures and sanctions are more difficult to capture and compare across treaties, but this difficulty itself is of interest for the analysis of treaty design. The variables, their distribution and associations are described and the speed of entry into force is predicted using various techniques including linear regression and neural networks.
Regarding the larger epistemic challenge, the scalability of the approach is assessed and limitations of existing treaty databases and research practices are identified. Drawing from achievements of the bioinformatics and linked open data communities, I argue that a collaborative, incrementally expanding database, or findable, accessible, interoperable and reusable (FAIR) datasets would make the approach scalable. This relies on a standardised vocabulary or formal ontology for data integration. Accordingly, the thesis builds a proof-of-concept Public International Law Ontology and an NLP pipeline to populate the ontology with data gathered from treaty texts and participation records. Output formats and interfaces are designed for wide accessibility, without requiring programming skills. All software and data accompanying this thesis are available under a free and open source licence.