-------------------
GENERAL INFORMATION
-------------------

Title of Dataset: "Concept Lab: Precomputed Associations for Shared Lexis Tool and Associated Files"

Author Information:

	The Concept Lab at the University of Cambridge
	
	Principal Investigator: 
	Professor Peter de Bolla, University of Cambridge, pld20@hermes.cam.ac.uk
	
	Other Members of the Concept Lab:
	Dr John Regan, Royal Holloway University of London, John.Regan@rhul.ac.uk
	Dr Ewan Jones, University of Cambridge, ejj25@cam.ac.uk
	Dr Gabriel Recchia, University of Cambridge, glr29@cam.ac.uk
	Dr Paul Nulty, University College Dublin, paul.nulty@gmail.com

	This data was generated over a period of four years from October 2014 to October 2018.


--------------------------
SHARING/ACCESS INFORMATION
-------------------------- 

License:

The file histwords-eng-all.zip is a modified (unpickled) version of the version of the HistWords English vectors that were available at
https://nlp.stanford.edu/projects/histwords/ as of 29 Aug 2016, and as such is available under the Public Domain Dedication and License v1.0: https://opendatacommons.org/licenses/pddl/

The file code-corrector.tar contains a modified version of Ted Underwood's OCR Normalizer https://github.com/tedunderwood/DataMunging/tree/master/OCRnormalizer circa 6 July 2017, and is made available under CC BY 3.0, and is made available under https://creativecommons.org/licenses/by/3.0/

All other files are made available under CC BY 3.0: https://creativecommons.org/licenses/by/3.0/

Recommended citation for the dataset:

de Bolla, P., et al. (2019). Distributional concept analysis: A computational model for parsing conceptual forms. Contributions to the History of Concepts. https://doi.org/10.17863/CAM.35748

Other works associated with this data (see DATA & FILE OVERVIEW for details): 

Recchia, G., et al. (2016). Tracing shifting conceptual vocabularies through time. In Ciancarini, P. et al. (Eds.): Knowledge Engineering and Knowledge Management: EKAW 2016 Satellite Events, EKM and Drift-an-LOD, Bologna, Italy, November 19–23, 2016, Revised Selected Papers (pp. 19-28). Cham, Switzerland: Springer International AG.

Jones, E., et al. (2019). The Idea of Liberty 1600-1800: a distributional concept analysis. Journal of the History of Ideas, https://doi.org/10.17863/CAM.35230


--------------------
DATA & FILE OVERVIEW
--------------------

This dataset consists of:

I. Source code and documentation for the "Shared Lexis Tool", a Windows desktop application that provides a means of exploring all of the words that are statistically associated with a word provided by the user, in a given corpus of text (for certain predefined corpora), over a given date range.

II. Source code and documentation for the "Coassociation Grapher", a Windows desktop application. Given a particular word of interest (a “focal token”) in a particular corpus of text, the Coassociation Grapher allows you to view the relative probability of observing other terms (“bound tokens”) before or after the focal token.

III. Numerous precomputed files that need to be hosted on a webserver in order for the Shared Lexis Tool to function properly;

IV. Files that were created in the course of conducting the research described in "Tracing shifting conceptual vocabularies through time" and "The idea of liberty" (full citations in above section 'SHARING/ACCESS INFORMATION'), including "cliques" (https://en.wikipedia.org/wiki/Clique_(graph_theory)) of words that frequently appear together;

V. Source code of text-processing scripts developed by the Concept Lab, primarily for the purpose of generating precomputed files described in section III, and associated data.



The Shared Lexis Tool and Coassociation Grapher (and the required precomputed files) are also being hosted at https://concept-lab.lib.cam.ac.uk/ from 2018 to 2023, and therefore those who are merely interested in using the tools within this time frame will have no use for the present dataset. However, these files may be useful for individuals who wish to host the files on their own webserver, for example, in order to use the Shared Lexis tool past 2023. See the "Instructions for Use" section towards the end of this README for information on the correct directory structure to place these files in in order to make this possible.



I. THE SHARED LEXIS TOOL

The Shared Lexis Tool is a Windows desktop application that provides a means of exploring all of the words that are statistically associated with a word provided by the user, at a given distance, in a given corpus of text, using a given measure of coassociation, over a given date range. In contrast to the Coassociation Grapher, which requires all words of interest to be provided explicitly by the user, the Shared Lexis tool allows the user to discover what terms are are highly statistically associated with a word of interest, and to explore how these differ across time/corpora/distance/measures. It is restricted to a set of predefined corpora described in section III (Precomputed Files).

The output of the Shared Lexis Tool, given a particular query word, distance, corpus, etc., is sometimes referred to in the documentation as that query word's "list" (since the output is a list of words, ordered by their degree of association with the query word).

More information about the Shared Lexis Tool, along with detailed documentation and source code, can be found at https://concept-lab.lib.cam.ac.uk/ from 2018 to 2023, or within the following files in this dataset:

sharedlexis-source.zip:				Source code for the Shared Lexis Tool, including local dependencies (last updated January 7, 2019).
code-sharedlexis-12-Aug-2018.tar:	Source code for the Shared Lexis Tool, including local dependencies (last updated August 12, 2018).
Shared_Lexis_Documentation.docx:	Detailed documentation for using the Shared Lexis Tool.



II. THE COASSOCIATION GRAPHER

The Coassociation Grapher is another tool developed by the Concept Lab. Given a particular word of interest (a “focal token”) in a particular corpus of text, the Coassociation Grapher allows you to view the relative probability of observing other terms (“bound tokens”) before or after the focal token (anywhere from 999 words prior to the focal token to 999 words after it). Graphs can be generated that illustrate how much more or less likely the focal token is to appear immediately after the bound token than it is to appear at a randomly selected point in the corpus. Such graphs can yield information about the overall patterns with which two words appear relative to each other in a body of texts. Furthermore, because this tool allows for particular timeslices to be selected from a diachronic corpus, it allows for comparisons of the ways words conjoin with each other (or fail to) across time. Like the Shared Lexis Tool, it is restricted to the set of predefined corpora described in section III (Precomputed Files).

More information about the Coassociation Grapher, along with detailed documentation and source code, can be found at https://concept-lab.lib.cam.ac.uk/ from 2018 to 2023, or within the following files in this dataset:

ca-grapher-source.zip:				Source code for the Coassociation Grapher, including local dependencies (last updated January 7, 2019).
code-grapher-12-Aug-2018.tar:		Source code for the Coassociation Grapher, including local dependencies (last updated August 12, 2018).
Grapher_Documentation.docx:			Detailed documentation for using the Coassociation Grapher.



III. PRECOMPUTED FILES

As mentioned, the Shared Lexis Tool allows the user to query a specific corpus of text with a word, date range, etc. (e.g. "In the 'Early English Books Online' corpus, what words are the most statistically associated with 'science' from 1670 to 1690, at a distance of approximately 100 words away from 'science'?") Because the co-occurrence statistics required to conduct such queries have been precomputed, such queries are only possible to conduct for specific corpora, distances, and words. Furthermore, precomputed co-occurrence files containing the requisite statistics need to be accessible to the tool.

Specifically, for each searchable word, the tool needs to be able to access a co-occurrence file with title of the form corpusname.dD.word.txt, where the capital “D” should be replaced by the “distance” away from the word.  

Each file within this dataset whose filename begins with "5-counts" is a tar archive consisting of many such files. The dataset contains 199 files of with names of the form 
5-counts-[CORPUSNAME]..clean-dist[DISTANCE].tar

The value of DISTANCE corresponds to the distance at which the co-occurrence computation as conducted. More details about this computation are available in the "Data-Specific Information" section at the end of this README file. We computed these files for distances 5, 10, 20, 30, 40, 50, 60, 70, 80, 90 & 100.

CORPUSNAME refers to the corpus (textual dataset) that the counts were computed from, and can take on the following values. Please note that we can only distribute counts and cannot distribute the original corpora due to license restrictions:

* eebo: Early English Books Online [restricted to items published 1600-1700]. ProQuest.

* ecco-donut: Eighteenth Century Collections Online [restricted to items published 1701-1800]. Gale Cengage.

* msbooks: The Microsoft Books/BL 19th Century collection [restricted to items published 1800-1899]. The British Library.

* nant: The North American News Text Corpus [items published 1994-1997]. Linguistic Data Consortium.

* ft: Text Research Collection Volume 5 [restricted to material from the Financial Times Limited, 1991-1994]. The National Institute of Standards and Technology.

* nyt: The New York Times Annotated Corpus [items published 1987-2007]. Linguistic Data Consortium.

* times: The Times Digital Archive [restricted to items published 1785-2010]. Gale Cengage.

* adamsmith: The Adam Smith corpus consists of three documents:  Theory of Moral Sentiments (1759), Lectures on Jurisprudence (1763), and The Wealth of Nations (1776). Obtained from https://oll.libertyfund.org/.

* hume: The Hume corpus consists of the following documents: A Treatise of Human Nature [1739]; Enquiries Concerning the Human Understanding and Concerning the Principles of Morals (posthumous edition); The Natural History of Religion (1889 ed.) [1757]; Essays Moral, Political, Literary (LF ed.) (Essays and Treatises on Several Subjects. In Two Volumes.); The History of England [1778], all volumes (excluding front material: the preface & the 'LETTER FROM ADAM SMITH, LL.D. TO WILLIAM STRAHAN, ESQ.'). Obtained from https://oll.libertyfund.org/.

* ecco-religion: Same dataset as "ecco-donut", but restricted to documents annotated by Gale Cengage as having a topic of "Religion".

* aberdeen: Same dataset as "ecco-donut", but restricted to documents published in Aberdeen.

* edinburgh: Same dataset as "ecco-donut", but restricted to documents published in Edinburgh.

* glasgow: Same dataset as "ecco-donut", but restricted to documents published in Glasgow.

* aberdeen-filtered: Same dataset as "aberdeen", but restricted to documents published by the following publishers: Angus & Son; Brown; Alexander Brown

* edinburgh-filtered: Same dataset as "edinburgh", but restricted to documents published by the following publishers: Apollo Press; Balfour; John Balfour; J & E Balfour; Hamilton and Balfour; Balfour & Neil; Balfour, Auld and Smellie; Elphinston; Elphingstone; William Auld; John Bell; Bell & Bradfute; John Bradfute; Bell & Macfarquhar; Creech; William Creech; James Dickson; Donaldson; Alexander Donaldson; Drummond; William Drummond; Elliot; Charles Elliot; William Gordon; Gordon & Elliot; George Gray; Gray & Peter; Guthrie; Alexander Guthrie; Hamilton Balfour; Hill; Peter Hill; Jack; Robert Jack; Kincaid; Alexander Kincaid; Kincaid & Bell; Kincaid & Creech; Kincaid & Donaldson; Laing; William Laing; Lawrie; Alexander Lawrie; Manners & Miller; Mudie; George Mudie; Mudie & Sons; Neill; Adam Neill; James Neill; Ogle; Ruddiman; William Ruddiman; Sands; William Sands; Simpson; James Simpson; Smellie; Alexander Smellie; Ruthven & Co; Watson; James Watson

* glasgow-filtered: Same dataset as "glasgow", but restricted to documents published by the following publishers: Barry; John Barry; Chapman & Duncan; Dunlop & Wilson; Foulis; Robert Foulis

* allpublishers: Same dataset as "ecco-donut", but restricted to documents published by the publishers mentioned in the above publisher lists (see: aberdeen-filtered, edinburgh-filtered, glasgow-filtered)

* all3cities: Same dataset as "ecco-donut", but restricted to documents published in Aberdeen, Edinburgh, or Glasgow

* ecco-directional: Same dataset as "ecco-donut", but using the "directional" rather than the "donut" method of co-occurrence counting. See Shared_Lexis_Documentation.docx, p. 6, subsection "Distance" for details. ecco-directional is the only dataset that uses the "directional" method of co-occurrence counting.

Other precomputed files included in this dataset are as follows:

* ecco-donut..docs.tar.gz: A collection of files containing document counts for each searchable word in Eighteenth Century Collections Online. Each such file contains one line for every year in the corpus, each of which has a number indicating the number of documents that word appeared in that year.

* ecco-donut..gini.tar.gz: For each searchable word in Eighteenth Century Collections Online, the "relative frequency" at which this word appeared was computed for each document (relative to the overall number of words in the document), and the Gini coefficient of this distribution of frequencies was computed separately for documents in each year (1701 through 1800). The Gini Coefficient can be viewed as a measure of “inequality” with respect to how the term is distributed across documents. Higher numbers correpond to less equitable distributions across documents. Each such file contains one line for every year in the corpus, each of which has a number indicating the number of documents that word appeared in that year.

* ecco-donut..spec.tar.gz: A collection of files containing the frequency-per-million-words with which each searchable word in Eighteenth Century Collections Online appears in different categories of documents within Eighteenth Century Collections Online (Fine Arts, General Reference, History and Geography, Law, Literature and Language, Religion and Philosophy, Science Medicine and Technology, and Social Sciences). Each such file contains one line for every year in the corpus, with these statistics stratified by year.

* ecco-donut..stdev.tar.gz: For each searchable word in Eighteenth Century Collections Online, the "relative frequency" at which this word appeared was computed for each document (relative to the overall number of words in the document), and the standard deviation of this distribution of frequencies was computed separately for documents in each year (1701 through 1800). The standard deviations can be viewed as a measure of “inequality” with respect to how the term is distributed across documents. Higher numbers correpond to less equitable distributions across documents. Each such file contains one line for every year in the corpus, each of which has a number indicating the number of documents that word appeared in that year.

* john-nontruncated-freqs-allcities-allpublishers.zip: Files listing the frequency of each searchable word in the Shared Lexis Tool in each year 1701-1800, for all documents in Eighteenth Century Collections online that were published in Aberdeen, Edinburgh, or Glasgow.



V. FILES ASSOCIATED WITH RECCHIA, G. ET AL. (2016) AND JONES, E. ET AL. (2019)


code-vector_similarity.zip: Code for computing the cosine similarity between two vectors generated by the shared lexis tool. Includes differences between vectors referring to different forms of government, discussed in Jones et al. (2019).

code-word2vec_similarities.tar: Code for computing similarities between vectors generated by word2vec. Used to compute similarities discussed in Recchia et al. (2016).

driftalod-code_and_data.zip: Source code and resulting data corresponding to sections 1-3 of Recchia et al. (2016)

driftalod-code_only.zip: Source code corresponding to sections 1-3 of Recchia et al. (2016) (no data, code only)

driftalod-future_work.zip: Code and data corresponding to section 4 of Recchia et al. (2016)

code-driftalod_sort.tar: Code for sorting cliques by size and summarizing other properties of cliques in various ways

histwords-eng-all.zip: A modified (unpickled) version of the version of the HistWords English vectors available at
https://nlp.stanford.edu/projects/histwords/ as of 29 Aug 2016 (available under the Public Domain Dedication and License v1.0: https://opendatacommons.org/licenses/pddl/ )

MutualDependencySets.zip: Lists of cliques (https://en.wikipedia.org/wiki/Clique_(graph_theory)) of words that frequently appear together within various corpora, calculated for various clique sizes. These at one time were referred to by the Concept Lab as "mutual dependency sets".

MutualDependencySets-old-data.zip: Lists of cliques of words that frequently appear together within Eighteenth Century Collections Online, calculated for various clique sizes. These at one time were referred to by the Concept Lab as "mutual dependency sets". NOTE: Some of these lists of cliques were generated with an older version of 'code-clique_computer.tar' that was less reliable, and may be incomplete. 

code-clique_computer.tar: Code for identifying cliques of particular sizes. A clique of size N in this context refers to a set of N words, each of which is highly statistically associated with every other word in the clique.



IV. SOURCE CODE (OTHER THAN CODE ALREADY DESCRIBED IN SECTIONS I-III) AND ASSOCIATED DATA

code-autosearch_analyser.tar: Code for sorting words based on the percentage of stopwords that appear in their 'lists' (see Section I)

code-bytepack.tar: Code for computing the 'list' corresponding to a particular query word in Eighteenth Century Collections Online without requiring the manual use of the Shared Lexis Tool; useful for computing very large numbers of lists

code-clique_counter.tar: Code that takes as input files generated by the code in 'code-clique_computer.tar', and generates as output a summary of the number of cliques of specific sizes.

code-clique_prediction.tar: Code that tests whether cliques with empty "common sets" (e.g., cliques such that no word x in the clique has a word y that is (1) strongly associated with x, and (2) strongly associated with at least one other word in the clique, and (3) y is not one of the words in the clique) are better predictors of what words are likely to appear in a document than cliques that do not have empty common sets

code-conceptualcore-cleanup.tar: Code for processing a large number of automatically generated outputs of the shared lexis tool to determine what happens to the "conceptual core" as distance from the query token increases.

code-concreteness_annotator.tar: Code for annotating nouns that count as 'concrete' (concreteness rating > 3) or 'abstract' (concreteness rating < 3) according to the following set of concreteness norms: Brysbaert, M., Warriner, A.B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46, 904-911.

code-corpus_integrity_verifier.tar: Code for verifying that no files had gone missing after transferring a very large number of files from a failing hard drive to a new one.

code-corpus-formatter.tar: The code in this file is for preparing corpora for integration into the Concept Lab tools. Every corpus is different in terms of the original format of the files, and some amount of custom code nearly always has to be written. However, the comments and example code in this project will provide guidance that can hopefully be adapted to match the needs of any specific case.

code-corrector.tar: Code for correcting optical character recognition errors. Modified from Ted Underwood's OCR Normalizer https://github.com/tedunderwood/DataMunging/tree/master/OCRnormalizer circa 6 July 2017.

code-count.tar: Main code for counting co-occurrences and creating the precomputed files.

code-curve_investigator.tar: Code for investigating differences between the "best fit" to a word's DPF curve and the actual data. 

code-decrypter.tar: Code for decrypting files that had been encrypted by code-encrypter.tar.

code-encrypter.tar: Code for encrypting private files.

code-doc_dissimilarity.tar: This code was never completed and is nonfunctional.

code-doc-counter.tar: Code for computing document counts, Kullbach-Leibler divergences, standard deviations, and Gini coefficients for words in ECCO (see descriptions of ecco-donut..docs.tar.gz, ecco-donut..gini.tar.gz, ecco-donut..stdev.tar.gz earlier in this file for more info.)

code-dpfquery.tar: Code to hold all co-occurrences for all ECCO lexicon words in main memory for a single year (1701), enabling extremely fast queries for one year only

code-ecco_xmlstripper.tar: Code for stripping tags from the XML of Eighteenth Century Collections Online, among other corpora.

code-ecco-sliding-cleanup.tar: Code for appending the frequencies of "extra" words (infrequent words that are not part of the main ECCO lexicon, but for which co-occurrences were desired nonetheless) to the main "frequency file" for ECCO, so that the Shared Lexis Tool would be aware of the global frequencies of these words in the corpus as a whole, and would be able to find them. This code has since been integrated into code-count.tar.

code-ecco-subsetter.tar: Each document within ECCO is classified in the metadata as belonging to one of 8 genres: Fine Arts ; General Reference ; History and Geography ; Law ; Literature and Language ; Science, Medicine and Technology ; and Social Sciences. This code subsets ECCO to generate a subcorpus for each genre.

code-eccotcp-formatter.tar: Code for formatting ECCO-TCP so that it is appropriately formatted to serve as input to word2vec.

code-eebo-cleaner.tar: Code for 'cleaning' EEBO input files (removing tags, fixing HTML entities, etc.)

code-format_clique_csvs.tar: Code that takes as input files of the form clique-common-sets-size-[X].csv, which are generated by the code in code-clique_computer.tar, and postprocesses them to create new csvs that contain subsets of the cliques in the original csvs filtered by various properties.

code-ginicoefficient.tar: Code for computing Gini coefficients described in section III.

code-grepper.tar: Code for searching plain text files for simple strings, producing KWIC files as output

code-hiddenlexis.tar: Experimental tool that permits an individual to paste in an eighteenth-century document, highlight a word, and generate alternative words that one might expect to appear in that position, based on the surrounding words and the statistics of Eighteenth Century Collections Online

code-indexer.tar: Code for generating "index files", which the grapher requires for each corpus

code-index_validator.tar: Code to identify and rectify discrepancies in "index files" caused by the indexer skipping over documents that were blank (i.e., which corresponded to empty lines) after cleaning

code-john_publishers.tar: Code for extracting the previously mentioned 'Scottish publishers' subcorpora of Eighteenth Century Collections Online: edinburgh, aberdeen, glasgow, edinburgh-filtered, aberdeen-filtered, glasgow-filtered, all3cities, and allpublishers

code-john_publishers_descriptives.tar: Code for summarizing various descriptives of the 'Scottish publishers' subcorpora of Eighteenth Century Collections Online: total word count, total document count, word counts by decade, document counts by decade, word counts by year, document counts by year

code-line_counter.tar: Code for counting the number of lines in a file.

code-punctuation_attacher.tar: Code for 'reattaching' punctuation to files from which punctuation was stripped, while maintaining the property that words are separated by single spaces.

code-word2vec_vocab_generator.tar: Code for converting a word list (vector) outputted by the Shared Lexis Tool into a vocabulary file suitable for inputting into word2vec.

code-old-stuff.tar: Code from analyses in the early Concept Lab which did not result in publications or other outputs. Includes earlier versions of several of the scripts mentioned above.

OneDrive-Code-May-2017.zip: Code from analyses in the early Concept Lab which did not result in publications or other outputs. Includes earlier versions of several of the scripts mentioned above.


--------------------
INSTRUCTIONS FOR USE
--------------------

The Shared Lexis Tool and Coassociation Grapher (and the required precomputed files) are being hosted at https://concept-lab.lib.cam.ac.uk/ from 2018 to 2023. Currently, when either of these Windows applications require precomputed files, they retrieve them from https://concept-lab.lib.cam.ac.uk/. If these files are no longer hosted at https://concept-lab.lib.cam.ac.uk/, they will need to be hosted elsewhere in order to make use of either of these tools.

Furthermore, the constant "ConceptLabWebRoot" in the file Corpus.cs within the Shared Lexis Tool source code will need to be altered accordingly, replacing "https://concept-lab.lib.cam.ac.uk/" with the web domain that hosts the precomputed files. The relevant line of code to change is as follows:

public static string ConceptLabWebRoot = "https://concept-lab.lib.cam.ac.uk/";

Recompile the code (it was originally compiled in Microsoft Visual Studio 2017 Community Edition 2017, but newer editions should also work) and it should now point to the new domain.

The majority of the precomputed files described in Section III above need to be hosted in a directory called "counts" in order for the Shared Lexis Tool to be able to find them. This directory should be an immediate subfolder of the primary domain - e.g. https://concept-lab.lib.cam.ac.uk/counts . 

Each file with a filename of the form 5-counts-[CORPUSNAME]..clean-dist[DISTANCE].tar contains many smaller files with the extension .gz. In order to allow the Shared Lexis Tool to find these files, they must each be untarred and placed at the following directory within the primary domain:

counts/[CORPUSNAME]..clean/dist_[DISTANCE]/FILENAME

For example, 5-counts-eebo..clean-dist10.tar includes the file 000000.the.csv.gz, among others. It needs to be placed at counts/eebo..clean/dist_10/000000.the.csv.gz. To accomplish this, the contents of 5-counts-eebo..clean-dist10.tar would need to be untarred and the contents placed in the folder counts/eebo..clean/dist_10 on your online server. It is only necessary to do this with the corpora you plan to use.
	
Furthermore, if you plan to access metadata for ECCO using the Shared Lexis Tool, the accompanying files ecco-donut..docs.tar.gz, ecco-donut..gini.tar.gz, ecco-donut..spec.tar.gz, and ecco-donut..stdev.tar.gz need to be untarred and placed in directories named ecco-donut..docs, ecco-donut..gini, ecco-donut..spec, and ecco-donut..stdev, respectively. These directories should be immediate subfolders of the primary domain - e.g. https://concept-lab.lib.cam.ac.uk/ecco-donut..docs.

To use the Coassociation Grapher, there are precomputed files that need to be hosted in a directory called "indices". This "indices" folder is private and is not included in this collection; please contact Peter de Bolla for access. This directory should be located in an immediate subfolder of the primary domain - e.g. https://concept-lab.lib.cam.ac.uk/indices . Its internal structure should already be correct.

To use the Concept Lab-only features of the Shared Lexis Tool described in pp. 13-14 of Shared_Lexis_Documentation.docx, there are precomputed files that need to be hosted in a directories called "indices", "_corpora", and "metadata". These folders are private and are not included in this collection; please contact Peter de Bolla for access. These directories should be placed in an immediate subfolder of the primary domain - e.g. https://concept-lab.lib.cam.ac.uk/indices , 

--------------------
DATA-SPECIFIC INFORMATION
--------------------

This section provides details about the file format of the precomputed files available in this dataset.

For each corpus, the Shared Lexis Tool contains "frequency files" with lists of the searchable words in reverse frequency order, where each line is of the format 

word + tab character + frequency

These are referred to as 'whole-corpus frequency files', and the order that words appear in this file is treated as their “canonical ordering”. See the documentation that accompanies the Shared Lexis Tool for more information.

For our analyses, the 'window size' for computing co-occurrences has been fixed at 5, so the words that are counted as ‘co-associates’ of a word such as 'science' at “distance 100” are those words that appear exactly 98, 99, 100, 101, and 102 words before & after any instance of 'science'.

The easiest way to explain how the co-occurrence files are laid out is by example. Suppose the initial 8 lines of some corpusname.dD.word.txt were as follows:

!1701
104
60

33
!1702
170
110

This would mean that at distance D, in documents published in 1701, word appeared with the first word in the canonical ordering 104 times,  appeared with the second word in the canonical ordering 60 times, did not appear with the third word in the canonical ordering, appeared with the fourth word in the canonical ordering 33 times, and did not appear with any other word in the lexicon. For documents published in 1702, word appeared with the first word in the canonical ordering 170 times, the second word in the canonical wording 110 times, etc.