Skip to content

1. enKORE corpus processor code corpora

Andutta edited this page May 15, 2023 · 24 revisions

The main-directory contains a number of files and three main sub-directories (corpus, lib, and logs), and additional information inside these three sub-directories.

1.1. README.md

It contains basic instructions to download and run the code locally.

1.2. LICENSE.txt

It contains information about the GNU GENERAL PUBLIC LICENSE.

1.3. deno.jsonc

This file is used by deno (a package manager). It calls for the file (import_map.json), contains the information about the run-call (see below), and denotes where to write-to the Wikidata entries (entries.json or any other given name in the script). Note: Deno is a runtime for JavaScrip, and it was designed to have a better package management system and more secure runtime than Node (e.g. deno.lock further explained).

Note: Deno must be installed (read dependencies to install).

Note: For debugging with deno, see also

https://deno.land/[email protected]/basics/debugging_your_code.

Note: For debugging with node, see also

https://www.youtube.com/watch?v=i9hOCvBDMMg.

Note: The code does not read the entries from (entries.json), unless you request it via pull false and giving the file to read.

For example:


"process-read-entries_14867_23Feb2023": "deno run -A --unstable process.js -p false -f ./corpus/entries_14867_23Feb2023.json"


Details about the main arguments controlling the process are available at (process.js).


async function processArgs(args) { const parsedArgs = parse(args, { string: ["entries", "filename"], alias: { pull: "p", url: "u", ……


Note: (deno.jsonc) has an overwrite-flag over the file (config.js) regarding where to save (entries.json), so effectively the (config.js) entry is always overwritten when such information about (entries.json) is provided in (deno.jsonc).

Note: File entries.json is automatically created inside the directory (…sor-main/corpus), which is the list containing all Wikidata elements to be further downloaded.

1.4. deno.lock

The file used checking integrity & Lock Files (i.e. caching and lock files) during processing.

Note: This file is automatically created by deno.

1.5. import_map.json

The file that brings needed plugins and functions (e.g. for deno, for xmlbuilder, for wikidata).

One of the most important plugins is citation.js, because it is used to extract the XML information from Wikidata. However, we are using an adapted version of citation.js (enKORE_citation_js_plugin).

1.6. process.js

This file is the main conductor of this coding orchestra, and it extracts main needed information from the files further explained, and called (xmlexporter.js, and config.js).

1.7. xmlexporter.js

This file is located inside the folder (lib). This file contains indications of the various open licenses available. Furthermore, it defines the information to be saved in the XML files saved inside the directory (.../corpus/processed).

1.8. config.js

The file that provides the URL:

URL at Invasion-Biology-Page

URL displaying query

URL-json-request

These are used to extract entries from Wikidata. In this file one can adjust sleeping-time (ProcessDelay) between batches (batchSize) to avoid stressing the servers.

config.js indicates where the log must be saved, [filename: "./logs/process.log"]

config.js also provides the information about APIs to be considered, which currently are:

  • CrossRef (name: "crossref")
  • Pubmed (name: "pubmed")
  • PubMedCentral (name: "pmc")

Note: though empty, do not delete the directory (/logs), since the code requires this location to exist.

1.9. /corpus

It is the directory containing the sub-directory (/processed), and to host file (entries.json) that is generated during preliminary processing. As mentioned before, entries.json contains the list of Wikidata elements for extraction.

1.10. /corpus/processed

The directory used to store files (XMLs), and it contains meta from each Wikidata-publication. Do not delete this directory because it must exist.

1.11. /lib

Directory which contains the above-described file (xmlexporter.js)

1.12. /log

Directory which will host the above-described file (process.log)