Construction Management Systems (CMS) Domain Corpora

This repository contains the first CMS domain corpus constructed from academic publications pertinent to CMS for the paper "Developing First Corpus in Construction Management Systems (CMS) Domain and Pre-trained Domain-specific Language Models".

Data Collection Procedure

Source: The corpus is comprised of scholarly journal papers, conference papers, articles, whitepapers, and select books related to CMS.
Collection: Academic publications were retrieved from the first 99 pages of Google Scholar search results using the keyword "construction management".
Volume: Out of 732 earmarked papers, 60 were discarded due to text recognition issues in older PDFs. The corpus has:
- 5.7 million words in total.
- 4.5 million words excluding references.
- 7.7 million tokens.
- 5.8 million tokens excluding references (processed with BERT tokenizer).
- Over 90% of sentences are less than 40 tokens long.

Data Cleaning and Pre-processing

The CMS Domain corpus underwent rigorous cleaning and pre-processing. The steps included:

PDF to Text: Automatic conversion of PDF files to plain text using Adobe Automation.
URL Removal: Exclusion of website links that may not provide substantive content.
Language Filter: Retention of only English text on the paragraph level, discarding unrecognizable characters and non-English content.
Sentence Segmentation: Division of paragraphs into distinct sentences, and removal of paragraphs without terminal punctuation marks.
Short Sentence Filter: Eradication of extremely short sentences to eliminate traces of formulas and tables.
Reference Filtering: Exclusion of references within papers to focus on context-rich content. Two datasets, with and without references, were generated.
Duplicate Removal: Ensuring the uniqueness of each sentence by removing duplicates.
Post cleaning, approximately 5% of the words in the original dataset were removed. During the reference filtering process, an additional 20% reduction in word count was observed.

Note on the Released Data

This repository presents both the raw, uncleaned, and unprocessed data and the cleaned data. Recognizing that no universally optimal pre-processing procedure exists for all tasks, we've provided the unprocessed data to allow researchers the flexibility to adopt or design preprocessing methods tailored to their specific requirements.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
all_papers_in_one_file.txt.zip		all_papers_in_one_file.txt.zip
all_txt.zip		all_txt.zip
cleaned_papers.zip		cleaned_papers.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Construction Management Systems (CMS) Domain Corpora

Data Collection Procedure

Data Cleaning and Pre-processing

Note on the Released Data

About

Releases

Packages

License

zhongyunshun/domain-corpora

Folders and files

Latest commit

History

Repository files navigation

Construction Management Systems (CMS) Domain Corpora

Data Collection Procedure

Data Cleaning and Pre-processing

Note on the Released Data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages