This package requires only a standard computer with enough RAM to support the in-memory operations. Models were trained on a NVIDIA Tesla V100 from the BWUniCluster 2.0.
This package is supported for Linux. The package has been tested on the following systems:
- Linux: Red Hat Enterprise Linux release 9.4 (Plow)
This repo's code depends mainly on:
pandasnltkrake_nltkfirelangdetectscipynetworkxpyarrowtqdmjupytertorchsklearntransformers
Create a new virtual environment using venv:
python -m venv .venv
source .venv/bin/activateInstall the required dependencies from requirements.txt:
pip install -r requirements.txtInstall the local package in editable mode:
python -m pip install --no-deps --disable-pip-version-check -e .Installation should take 5-10 min.
Notes on how to reproduce the AUROC values present in the paper can be found in Reproduce.md.
This is a demo of the best performing model in our paper (the mixture), utilzing both input modalities (topological features and semantic information).
Refer to materials_concepts/predict/README.md that guides you through running materials_concepts/predict/main.py. The "CLI App" will allow you entering your concepts after specifying where the model and feature files can be found. It will output a report containing your own concepts, enriched by top suggestions what new concepts might be interesting to combine.
If you don't find the proposed concepts relevant, feel free to increase the number of proposed concepts and to play around with the use_min_depth_of_threshold setting.
Note: You'll either need to download the figshare data or you can run the whole process on a small testset that we were able to store in this repo itself.
Create data/ top-level folder and a table/ subfolder to store the dataset.
$ python materials_concepts/dataset/downloader/download_sources.py --query 'materials science' --out data/table/test_3810.csv
This will create a
data/materials-science.sources.csvfile with all the sources.
Fetch works from single source:
$ python materials_concepts/dataset/downloader/download_works.py fetchsingle --source S82336448 --out data/table/S82336448.works.csv
This will create a
S82336448.csvfile with all the works belonging to that source.
Fetch works from all sources:
$ python materials_concepts/dataset/downloader/download_works.py fetchall --sources data/materials-science.sources.csv --out data/table/materials-science.works.csv
During fetching, this will create a
{source}.csvfile for each source incachelisting all the works which belong to that source. After downloading, these are merged automatically into a single fileout. If the download gets interrupted, the downloaded files serve as a cache. Re-run the script, it will automatically skip sources for which the data was already fetched.
Filter the data to improve its quality:
$ python materials_concepts/dataset/filtering/filter_works.py --source data/table/materials-science.works.csv --out data/table/materials-science.filtered.works.csv --njobs 8 --min-abstract-length 250 --max-abstract-length 3000 --topic "Materials science"
This will output a file
materials-science.filtered.works.csvin thedata/table/containing all works which sufficed the conditions.
In the end, the data folder should be structured like this. File names can be varied if the corresponding cli args to the individual scripts are adapted.
data
├── graph
│ └── edges.M.pkl
├── model
│ ├── baseline
│ │ ├── features.2016.binary.M.pkl.gz
│ │ ├── features.2019.binary.M.pkl.gz
│ │ ├── features.2022.binary.M.pkl.gz
│ │ └── model.pt
│ ├── combi
│ │ └── model.pt
│ ├── pure_embs
│ │ ├── features.concept-embs.2016.M.pkl.gz
│ │ ├── features.concept-embs.2019.M.pkl.gz
│ │ ├── features.concept-embs.2022.M.pkl.gz
│ │ └── model.pt
│ ├── test.data.M.pkl
│ └── train_val.M.pkl
└── table
├── lookup
│ └── lookup.M.2.csv
├── materials-science.llama2.works.csv
└── materials-science.sources.csvClean the abstracts:
$ python materials_concepts/dataset/preparation/clean_abstracts.py materials-science.filtered.works.csv --folder data/
This will output a file
materials-science.cleaned.works.csvin the specified folder containing all works with cleaned abstracts.
Note: As these operations are very time consuming, the scripts make use of parallelization.
Extract 'chemical elements' from abstracts:
$ python materials_concepts/dataset/preparation/extract_elements.py materials-science.cleaned.works.csv --folder data/
This will output a file
materials-science.elements.works.csvin the specified folder containing all works with extracted chemical elements in a separate columnselements.
Extract 'concepts' from abstracts using several methods (RAKE, keyBERT, OpenAlex concept list, Searching 'keywords' in abstracts):
$ python materials_concepts/dataset/preparation/extract_concepts.py materials-science.elements.works.csv {method} {colname} --folder data/
e.g.:
$ python materials_concepts/dataset/preparation/extract_concepts.py materials-science.elements.works.csv rake rake_concepts --folder data/
This will output a file
materials-science.rake.works.csvin the specified folder containing all works with extracted concepts according torake({method}) in a separate columnsrake_concepts({colname}).
The used concepts are generated by utilizing an LLM (LLaMa-2-13B) that is fine-tuned to this downstream task.
The processed file materials-science.llama2.works.csv, containing all the works can be downloaded from figshare.
To see how the concepts are generated, check out this repository.
If you were to replicate the process, you would have to copy the materials-science.elements.works.csv file to the concept extraction repository. After extraction there, you would have to copy the resulting materials-science.llamaX.works.csv file back to this repository.
Build concepts graph by executing the following command:
python materials_concepts/graph/build.py \
--input_path data/table/materials-science.llama2.works.csv \
--output_path data/graph/edges.M.pkl \
--output_lookup_path data/table/lookup/lookup.M.csv \
--colname llama_concepts \
--min_occurence 3 \
--min_words 2 \
--max_words 20 \
--min_occurence_elements 3 \
--min_amount_elements 2
# for the small test set, this will yield:
# nodes: 4,078
# edges: 142,040
Produces a pickled file graph/edges.pkl containing the graph:
{
"num_of_vertices": 123456,
"edges": [(v1, v2, timestamp), (v1, v2, timestamp), ...],
}
Because of the sparse nature of the graph, it is stored as edge list. The timestamp is the number of days passed since 01-01-1970.
Note: If you want use rake concepts, you have to first extract the rake concepts and then replace
llama_conceptswithrake_conceptsin the command above.
Note: The concepts are run against a filter mechanism to remove concepts which are not relevant for the domain. The filters are stored in the same file and can be extended or modified as needed.
Generate training and test data for classification task: Given {n} vertex pairs, decide whether they will be connected or not in {delta} years.
python materials_concepts/model/create_data.py \
--graph_path data/graph/edges.M.pkl \
--data_path data/model/data.pkl \
--year_start_train 2016 \
--year_start_test 2019 \
--year_delta 3 \
--edges_used_train 5_000_000 \ # 20_000 for small dataset
--edges_used_test 2_000_000 \ # 2_000 for small dataset
--train_val_split 0.8 \
--min_links 1 \
--max_v_degree=None \
--verbose=True
Output:
{
"year_train": 2016,
"year_test": 2019,
"year_delta": 3,
"min_links": 1,
"max_v_degree": None,
"X_train": [(v1, v2), ...] unnconnected vertex pairs until 2016, (80%)
"y_train": [0, 1, 1, 0, ...] indicating whether the vertex pairs will be connected in 2019 (2016 + 3) (80%)
"X_val": (20%) unnconnected vertex pairs until 2016,
"y_val": (20%) whether the vertex pairs will be connected in 2019,
"X_test": [(v1, v2), ...] unnconnected vertex pairs until 2019,
"y_test": whether the vertex pairs will be connected in 2022,
}
The classification process can typically be divided into two steps:
- Generate embeddings for nodes
- Train a (binary) classifier on the (concatenated) embeddings
- Generate the embeddings
Embeddings for training:
python -u materials_concepts/model/combi/pre_compute.py \
--graph_path data/graph/edges.M.pkl \
--output_path data/model/baseline/features.2016.binary.M.pkl.gz \
--binary True \
--years "[2012, 2013, 2014, 2015, 2016]"
Embeddings for validation:
python -u materials_concepts/model/combi/pre_compute.py \
--graph_path data/graph/edges.M.pkl \
--output_path data/model/baseline/features.2019.binary.M.pkl.gz \
--binary True \
--years "[2015, 2016, 2017, 2018, 2019]"
- Train the model
python -u materials_concepts/model/combi/train.py \
--data_path data/model/data.pkl \
--emb_f_train_path data/model/baseline/features.2016.binary.M.pkl.gz \
--emb_f_test_path data/model/baseline/features.2019.binary.M.pkl.gz \
--emb_c_train_path False \
--emb_c_test_path False \
--lr 0.0005 \
--batch_size 1000 \
--num_epochs 10000 \
--pos_ratio 0.3 \
--layers "[20, 300, 180, 108, 64, 10, 1]" \
--step_size 200 \
--gamma 0.95 \
--dropout 0.1 \
--sliding_window 5 \
--log_interval 200 \
--log_file logs/07_train-baseline.log \
--save_model "data/model/baseline/model.pt"
- Generate the emebddings (see
Word Embeddingsbelow) - Train the model
python -u materials_concepts/model/combi/train.py \
--data_path data/model/data.pkl \
--emb_f_train_path "" \
--emb_f_test_path "" \
--emb_c_train_path data/model/combi/word-embs.2016.M.pkl.gz \
--emb_c_test_path data/model/combi/word-embs.2019.M.pkl.gz \
--lr 0.001 \
--batch_size 1000 \
--num_epochs 15000 \
--pos_ratio 0.3 \
--layers "[1536, 1024, 819, 10, 1]" \
--step_size 200 \
--gamma 0.9 \
--dropout 0.1 \
--sliding_window 5 \
--log_interval 200 \
--log_file logs/07_train-wembs.log \
--save_model "data/model/pure_embs/model.pt"
- Use concatentation of baseline features and word embeddings as input. Take a look at the chapter
Word Embeddingsto see how to generate word embeddings. - Train the model
python -u model/combi/train.py \
--data_path data/model/data.M.pkl \
--emb_f_train_path data/model/baseline/features.2016.binary.M.pkl.gz \
--emb_f_test_path data/model/baseline/features.2019.binary.M.pkl.gz \
--emb_c_train_path data/model/combi/word-embs.2016.M.pkl.gz \
--emb_c_test_path data/model/combi/word-embs.2019.M.pkl.gz \
--lr 0.001 \
--batch_size 1000 \
--num_epochs 15000 \
--pos_ratio 0.3 \
--layers "[1556, 1556, 933, 559, 335, 10, 1]" \
--step_size 200 \
--gamma 0.95 \
--dropout 0.1 \
--sliding_window 5 \
--log_interval 200 \
--log_file logs/06_train-combi.log \
--save_model "data/model/combi/model.pt"
,
No need to train anything, as we just combine a baseline with a pure embeddings model.
To evaluate the models and reproduce the results, take a look at Reproduce.md.
Word embeddings are generated using BERT or a fine-tuned version of BERT e.g. MatSciBERT.
To extract ambeddings for all concepts (all embedded tokens comprising a concept are averaged), run:
python -u materials_concepts/word_embeddings/generate.py \
--concepts_path data/table/materials-science.llama2.works.csv \
--lookup_path data/table/lookup/lookup.M.csv \
--output_path data/embeddings/ \
--log_to_stdout False \
--step_size 50 \
--start 0 \
--end 5000
Currently, if a concept is not exactly contained in the abstract (this can happen because LLMs can apply some "normalization" during extraction), the embedding vector is set to the average of all tokens in the abstract. On GPU4_A100 generating embeddings for 80k abstracts takes about 6h.
Averaging word (concept) embeddings so that they can be used as feature vectors for classification.
python materials_concepts/word_embeddings/average_embs.py \
--concepts_path data/table/materials-science.llama2.works.csv \
--lookup_path data/table/lookup/lookup.M.csv \
--filter_path data/table/lookup/lookup.M.csv \
--embeddings_dir data/embeddings/ \
--output_path data/model/combi/word-embs.2016.M.pkl.gz \
--store_concepts_plain False \
--until_year 2016
Generate distilled version of reports:
python materials_concepts/report/pdf/generation/hack_llm_ready_report.pyGenerate the LLM report (prompt engineering + some report sections => LLM APIs) from the "distilled" version of the reports:
export RESEARCHER="...";
python materials_concepts/report/generate_llm_selection.py --txt_path materials_concepts/report/prompt_sec3.txt --tex_path materials_concepts/report/pdf/generation/${RESEARCHER}/distilled/plain_suggestions.tex --output_path materials_concepts/report/pdf/generation/${RESEARCHER}/llm_report_sec3.txt
python materials_concepts/report/generate_llm_selection.py --txt_path materials_concepts/report/prompt_sec5.txt --tex_path materials_concepts/report/pdf/generation/${RESEARCHER}/distilled/exotic_suggestions.tex --output_path materials_concepts/report/pdf/generation/${RESEARCHER}/llm_report_sec5.txt