Concept Lab: Precomputed Associations for Shared Lexis Tool and Associated Files (Public)

Name: Concept Lab: Precomputed Associations for Shared Lexis Tool and Associated Files (Public)
Published: 2022-04-13T09:24:06Z
Keywords: text and data mining, digital humanities

Recchia, Gabriel; Jones, Ewan; Nulty, Paul; de Bolla, Peter; Regan, John

Concept Lab: Precomputed Associations for Shared Lexis Tool and Associated Files (Public)

Repository URI

https://www.repository.cam.ac.uk/handle/1810/336054

Repository DOI

https://doi.org/10.17863/CAM.43499

Files

ca-grapher-source.zip (159.18 MB)

Grapher_Documentation.docx (423.18 KB)

sharedlexis-source.zip (170.05 MB)

Shared_Lexis_Documentation.docx (1.24 MB)

README.txt (29.65 KB)

Type

Dataset

Authors

Recchia, Gabriel

Jones, Ewan

Nulty, Paul

de Bolla, Peter

Regan, John

Description

This dataset consists of:

I. Source code and documentation for the "Shared Lexis Tool", a Windows desktop application that provides a means of exploring all of the words that are statistically associated with a word provided by the user, in a given corpus of text (for certain predefined corpora), over a given date range.

II. Source code and documentation for the "Coassociation Grapher", a Windows desktop application. Given a particular word of interest (a “focal token”) in a particular corpus of text, the Coassociation Grapher allows you to view the relative probability of observing other terms (“bound tokens”) before or after the focal token.

III. Numerous precomputed files that need to be hosted on a webserver in order for the Shared Lexis Tool to function properly;

IV. Files that were created in the course of conducting the research described in "Tracing shifting conceptual vocabularies through time" and "The idea of liberty" (full citations in above section 'SHARING/ACCESS INFORMATION'), including "cliques" (https://en.wikipedia.org/wiki/Clique_(graph_theory)) of words that frequently appear together;

V. Source code of text-processing scripts developed by the Concept Lab, primarily for the purpose of generating precomputed files described in section III, and associated data.

The Shared Lexis Tool and Coassociation Grapher (and the required precomputed files) are also being hosted at https://concept-lab.lib.cam.ac.uk/ from 2018 to 2023, and therefore those who are merely interested in using the tools within this time frame will have no use for the present dataset. However, these files may be useful for individuals who wish to host the files on their own webserver, for example, in order to use the Shared Lexis tool past 2023. See README.txt for more information.

Software / Usage instructions

The Shared Lexis Tool and Coassociation Grapher (and the required precomputed files) are being hosted at https://concept-lab.lib.cam.ac.uk/ from 2018 to 2023. Currently, when either of these Windows applications require precomputed files, they retrieve them from https://concept-lab.lib.cam.ac.uk/. If these files are no longer hosted at https://concept-lab.lib.cam.ac.uk/, they will need to be hosted elsewhere in order to make use of either of these tools. The majority of the precomputed files described in Section III above need to be hosted in a directory called "counts" in order for the Shared Lexis Tool to be able to find them. This directory should be an immediate subfolder of the primary domain - e.g. https://concept-lab.lib.cam.ac.uk/counts . Each file with a filename of the form 5-counts-[CORPUSNAME]..clean-dist[DISTANCE].tar contains many smaller files with the extension .gz. In order to allow the Shared Lexis Tool to find these files, they must each be untarred and placed at the following directory within the primary domain: counts/[CORPUSNAME]..clean/dist_[DISTANCE]/FILENAME For example, 5-counts-eebo..clean-dist10.tar includes the file 000000.the.csv.gz, among others. It needs to be placed at counts/eebo..clean/dist_10/000000.the.csv.gz. To accomplish this, the contents of 5-counts-eebo..clean-dist10.tar would need to be untarred and the contents placed in the folder counts/eebo..clean/dist_10 on your online server. It is only necessary to do this with the corpora you plan to use. Furthermore, if you plan to access metadata for ECCO using the Shared Lexis Tool, the accompanying files ecco-donut..docs.tar.gz, ecco-donut..gini.tar.gz, ecco-donut..spec.tar.gz, and ecco-donut..stdev.tar.gz need to be untarred and placed in directories named ecco-donut..docs, ecco-donut..gini, ecco-donut..spec, and ecco-donut..stdev, respectively. These directories should be immediate subfolders of the primary domain - e.g. https://concept-lab.lib.cam.ac.uk/ecco-donut..docs. See README.txt for more information.

Keywords

text and data mining, digital humanities

Rights

CC BY (except for hist-words-eng-all.zip, which is made available under Public Domain Dedication and License v1.0)

Sponsorship

Foundation for the Future

Foundation for the Future, DIGITAL KNOWLEDGE, RG74515

Collections

Research Data - Pure Mathematics and Mathematical Statistics (DPMMS)

Concept Lab: Precomputed Associations for Shared Lexis Tool and Associated Files (Public)

Repository URI

Repository DOI

Files

Type

Change log

Authors

Description

Version

Software / Usage instructions

Keywords

Publisher

Rights

Sponsorship

Collections