-
Notifications
You must be signed in to change notification settings - Fork 1
1. enKORE corpus processor code corpora
The main-directory contains a number of files and three main sub-directories (corpus, lib, and logs), and additional information inside these three sub-directories.
It contains basic instructions to download and run the code locally.
It contains information about the GNU GENERAL PUBLIC LICENSE.
This file is used by deno (a package manager). It calls for the file (import_map.json), contains the information about the run-call (see below), and denotes where to write-to the Wikidata entries (entries.json or any other given name in the script). Note: Deno is a runtime for JavaScrip, and it was designed to have a better package management system and more secure runtime than Node (e.g. deno.lock further explained).
Note: Deno must be installed (read dependencies to install).
Note: For debugging with deno, see also
https://deno.land/[email protected]/basics/debugging_your_code.
Note: For debugging with node, see also
https://www.youtube.com/watch?v=i9hOCvBDMMg.
Note: The code does not read the entries from (entries.json), unless you request it via pull false and giving the file to read.
For example:
"process-read-entries_14867_23Feb2023": "deno run -A --unstable process.js -p false -f ./corpus/entries_14867_23Feb2023.json"
Details about the main arguments controlling the process are available at (process.js).
async function processArgs(args) { const parsedArgs = parse(args, { string: ["entries", "filename"], alias: { pull: "p", url: "u", ……
Note: (deno.jsonc) has an overwrite-flag over the file (config.js) regarding where to save (entries.json), so effectively the (config.js) entry is always overwritten when such information about (entries.json) is provided in (deno.jsonc).
Note: File entries.json is automatically created inside the directory (…sor-main/corpus), which is the list containing all Wikidata elements to be further downloaded.
The file used checking integrity & Lock Files (i.e. caching and lock files) during processing.
Note: This file is automatically created by deno.
The file that brings needed plugins and functions (e.g. for deno, for xmlbuilder, for wikidata).
One of the most important plugins is citation.js, because it is used to extract the XML information from Wikidata. However, we are using an adapted version of citation.js (enKORE_citation_js_plugin).
This file is the main conductor of this coding orchestra, and it extracts main needed information from the files further explained, and called (xmlexporter.js, and config.js).
This file is located inside the folder (lib). This file contains indications of the various open licenses available. Furthermore, it defines the information to be saved in the XML files saved inside the directory (.../corpus/processed).
The file that provides the URL:
These are used to extract entries from Wikidata. In this file one can adjust sleeping-time (ProcessDelay) between batches (batchSize) to avoid stressing the servers.
config.js indicates where the log must be saved, [filename: "./logs/process.log"]
config.js also provides the information about APIs to be considered, which currently are:
- CrossRef (name: "crossref")
- Pubmed (name: "pubmed")
- PubMedCentral (name: "pmc")
Note: though empty, do not delete the directory (/logs), since the code requires this location to exist.
It is the directory containing the sub-directory (/processed), and to host file (entries.json) that is generated during preliminary processing. As mentioned before, entries.json contains the list of Wikidata elements for extraction.
The directory used to store files (XMLs), and it contains meta from each Wikidata-publication. Do not delete this directory because it must exist.
Directory which contains the above-described file (xmlexporter.js)
Directory which will host the above-described file (process.log)