Skip to content

Commit

Permalink
Merge pull request #21 from lfoppiano/update-material-parser
Browse files Browse the repository at this point in the history
Add material parser models and extend the material parsing
  • Loading branch information
lfoppiano authored Dec 25, 2023
2 parents de30f6c + 1785375 commit 58e2714
Show file tree
Hide file tree
Showing 354 changed files with 1,596 additions and 547 deletions.
37 changes: 32 additions & 5 deletions .github/workflows/ci-build-unstable.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,20 @@ on:
jobs:
build:
runs-on: ubuntu-latest

strategy:
matrix:
python-version: [ 3.8, 3.9 ]
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.10
uses: actions/setup-python@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: "3.10"
python-version: ${{ matrix.python-version }}
cache: "pip"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install flake8 pytest
pip install flake8 pytest pycodestyle
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
python -m spacy download en_core_web_sm
- name: Lint with flake8
Expand All @@ -36,6 +39,7 @@ jobs:
pytest
build-docker:
needs: [build]
runs-on: self-hosted

steps:
Expand All @@ -44,3 +48,26 @@ jobs:
run: docker build . --file Dockerfile --tag lfoppiano/grobid-superconductors-tools:latest
- name: Cleanup older than 24h images and containers
run: docker system prune --filter "until=24h" --force


build-docker-public:
needs: [build-docker]
runs-on: ubuntu-latest

steps:
- name: Create more disk space
run: sudo rm -rf /usr/share/dotnet && sudo rm -rf /opt/ghc && sudo rm -rf "/usr/local/share/boost" && sudo rm -rf "$AGENT_TOOLSDIRECTORY"
- uses: actions/checkout@v2
- name: Build and push
id: docker_build
uses: mr-smithers-excellent/docker-build-push@v5
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
image: lfoppiano/grobid-superconductors-tools
registry: docker.io
# pushImage: ${{ github.event_name != 'pull_request' }}
pushImage: true
tags: latest-develop
- name: Image digest
run: echo ${{ steps.docker_build.outputs.digest }}
82 changes: 82 additions & 0 deletions .github/workflows/ci-release.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
name: Build release

on:
workflow_dispatch:
push:
tags:
- 'v*'

concurrency:
group: docker
cancel-in-progress: true

jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Cleanup more disk space
run: sudo rm -rf /usr/share/dotnet && sudo rm -rf /opt/ghc && sudo rm -rf "/usr/local/share/boost" && sudo rm -rf "$AGENT_TOOLSDIRECTORY"
- uses: actions/checkout@v2
- name: Set up Python 3.9
uses: actions/setup-python@v4
with:
python-version: 3.9
cache: 'pip'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install --upgrade flake8 pytest pycodestyle
pip install --only-binary :all: scikit-learn==1.0.2
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
python -m spacy download en_core_web_sm
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
# - name: Test with pytest
# run: |
# pytest

- name: Build and Publish to PyPI
uses: conchylicultor/pypi-build-publish@v1
with:
pypi-token: ${{ secrets.PYPI_API_TOKEN }}


docker-build:
needs: [build]
runs-on: ubuntu-latest

steps:
- name: Set tags
id: set_tags
run: |
DOCKER_IMAGE=lfoppiano/material-parsers
VERSION=""
if [[ $GITHUB_REF == refs/tags/v* ]]; then
VERSION=${GITHUB_REF#refs/tags/v}
fi
if [[ $VERSION =~ ^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$ ]]; then
TAGS="${VERSION}"
else
TAGS="latest"
fi
echo "TAGS=${TAGS}"
echo ::set-output name=tags::${TAGS}
- name: Create more disk space
run: sudo rm -rf /usr/share/dotnet && sudo rm -rf /opt/ghc && sudo rm -rf "/usr/local/share/boost" && sudo rm -rf "$AGENT_TOOLSDIRECTORY"
- uses: actions/checkout@v2
- name: Build and push
id: docker_build
uses: mr-smithers-excellent/docker-build-push@v5
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
image: lfoppiano/material-parsers
registry: docker.io
pushImage: ${{ github.event_name != 'pull_request' }}
tags: ${{ steps.set_tags.outputs.tags }}
- name: Image digest
run: echo ${{ steps.docker_build.outputs.digest }}
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,6 @@ build/

glove*

**/test
**/test

src
9 changes: 5 additions & 4 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM python:3.10.13-slim-bullseye
FROM python:3.9-slim-bullseye

ENV LANG C.UTF-8

Expand All @@ -19,7 +19,7 @@ WORKDIR /opt/service
COPY requirements.txt .
COPY resources/config.json resources
COPY resources/data /opt/service/resources/data

COPY delft /opt/service/delft

ENV VIRTUAL_ENV=/opt/service/venv
RUN python3 -m venv $VIRTUAL_ENV
Expand All @@ -28,16 +28,17 @@ ENV PATH="$VIRTUAL_ENV/bin:$PATH"
RUN python3 -m pip install pip --upgrade
RUN python3 -m pip install -r ./requirements.txt
RUN python3 -m spacy download en_core_web_sm
RUN python3 delft/preload_embeddings.py --registry delft/resources-registry.json

# extract version
COPY .git ./.git
RUN git rev-parse --short HEAD > /opt/service/resources/version.txt
RUN rm -rf ./.git

# Copy code
COPY grobid_superconductors /opt/service/grobid_superconductors
COPY material_parsers /opt/service/material_parsers
#COPY __main__.py /opt/service

EXPOSE 8080

CMD ["python3", "-m", "grobid_superconductors", "--config", "resources/config.json"]
CMD ["python3", "-m", "material_parsers", "--config", "resources/config.json"]
130 changes: 91 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,22 @@
[![Python CI](https://github.com/lfoppiano/grobid-superconductors-tools/actions/workflows/python-app.yml/badge.svg)](https://github.com/lfoppiano/grobid-superconductors-tools/actions/workflows/python-app.yml)

# Material Parsers (and other tools)

# Grobid-superconductors material name tools
Sister project of [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors) containing a web service that interfaces with the python libraries (e.g. Spacy).

Sister project of [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors) containing a webservice that interfaces with the python libraries (e.g. Spacy).
The service provides the following functionalities:

The service provides the following functionalities:
- [Convert material name to formula](#convert-material-name-to-formula) (e.g. Lead -> Pb, Hydrogen -> H): `/convert/name/formula`
- [Decompose formula into structured dict of elements](#decompose-formula-into-structured-dict-of-elements) (e.g. La x Fe 1-x O7-> {La: x, Fe: 1-x, O: 7}): `/convert/formula/composition`
- Classify material in classes (from the superconductors domain) using a rule-base table (e.g. "La Cu Fe" -> Cuprates): `/classify/formula`
- Tc classification (Tc, not-Tc): `/classify/tc` **for information please open an issue**
- Relation extraction given a sentence and two entities: `/process/link` **for information please open an issue**
- [Convert material name to formula](#convert-material-name-to-formula) (e.g. Lead -> Pb, Hydrogen -> H): `/convert/name/formula`
- [Decompose formula into structured dict of elements](#decompose-formula-into-structured-dict-of-elements) (e.g. La x Fe 1-x O7-> {La: x, Fe: 1-x, O: 7}): `/convert/formula/composition`
- Classify material in classes (from the superconductors domain) using a rule-base table (e.g. "La Cu Fe" -> Cuprates): `/classify/formula`
- Tc's classification (Tc, not-Tc): `/classify/tc` **for information please open an issue**
- Relation extraction given a sentence and two entities: `/process/link` **for information please open an issue**
- Material processing using Deep Learning models and rule-based processing `/process/material`

## Usage

The service is deployed on huggingface spaces, and [can be used right away](https://lfoppiano-grobid-superconductors-tools.hf.space/version).
For installing the service in your own environment see below.
The service is deployed on huggingface spaces, and [can be used right away](https://lfoppiano-grobid-superconductors-tools.hf.space/version). For installing the service in your own environment see below.


### Convert material name to formula

Expand All @@ -24,41 +25,76 @@ curl --location 'https://lfoppiano-grobid-superconductors-tools.hf.space/convert
--form 'input="Hydrogen"'
```

output:
output:

```
{"composition": {"H": "1"}, "name": "Hydrogen", "formula": "H"}
```

### Decompose formula in a structured dict of elements

Example:
Example:

```
curl --location 'https://lfoppiano-grobid-superconductors-tools.hf.space/convert/formula/composition' \
--form 'input="CaBr2-x"'
```

output:
output:

```
{"composition": {"Ca": "1", "Br": "2-x"}}
```

### Classify materials in classes

Example:
Example:

```
curl --location 'https://lfoppiano-grobid-superconductors-tools.hf.space/classify/formula' \
--form 'input="(Mo 0.96 Zr 0.04 ) 0.85 B x "'
```

output:
output:

```
['Alloys']
```

## Installing in your environment
### Process material
This process includes a combination of everything listed above, after passing the material sequence through a DL model

Example:

```
curl --location 'https://lfoppiano-material-parsers.hf.space/process/material' \
--form 'text="(Mo 0.96 Zr 0.04 ) 0.85 B x "'
```

output:

```json
[
{
"formula": {
"rawValue": "(Mo 0.96 Zr 0.04 ) 0.85 B x"
},
"resolvedFormulas": [
{
"rawValue": "(Mo 0.96 Zr 0.04 ) 0.85 B x",
"formulaComposition": {
"Mo": "0.816",
"Zr": "0.034",
"B": "x"
}
}
]
}
]
```

## Installing in your environment

```
docker run -it lfoppiano/grobid-superconductors-tools:2.1
Expand All @@ -67,36 +103,52 @@ docker run -it lfoppiano/grobid-superconductors-tools:2.1
## References

If you use our work, and write about it, please cite [our paper](https://hal.inria.fr/hal-03776658):

```bibtex
@article{doi:10.1080/27660400.2022.2153633,
author = {Luca Foppiano and Pedro Baptista Castro and Pedro Ortiz Suarez and Kensei Terashima and Yoshihiko Takano and Masashi Ishii},
title = {Automatic extraction of materials and properties from superconductors scientific literature},
journal = {Science and Technology of Advanced Materials: Methods},
volume = {3},
number = {1},
pages = {2153633},
year = {2023},
publisher = {Taylor & Francis},
doi = {10.1080/27660400.2022.2153633},
URL = {
https://doi.org/10.1080/27660400.2022.2153633
},
eprint = {
https://doi.org/10.1080/27660400.2022.2153633
}
}
@article{doi:10.1080/27660400.2022.2153633,
author = {Luca Foppiano and Pedro Baptista Castro and Pedro Ortiz Suarez and Kensei Terashima and Yoshihiko Takano and Masashi Ishii},
title = {Automatic extraction of materials and properties from superconductors scientific literature},
journal = {Science and Technology of Advanced Materials: Methods},
volume = {3},
number = {1},
pages = {2153633},
year = {2023},
publisher = {Taylor & Francis},
doi = {10.1080/27660400.2022.2153633},
URL = {
https://doi.org/10.1080/27660400.2022.2153633
},
eprint = {
https://doi.org/10.1080/27660400.2022.2153633
}
}
```

## Overview of the repository

- [Converters](material_parsers/converters) TSV to/from Grobid XML files conversion
- [Linking](material_parsers/linking) module: A rule based python algorithm to link entities
- [Commons libraries](material_parsers/commons): contains common code shared between the various component. The Grobid client was borrowed from [here](https://github.com/kermitt2/grobid-client-python), the tokenizer from [there](https://github.com/kermitt2/delft).

## Overview of the repository
## Developer's notes

> python -m spacy download en_core_web_sm
```shell
conda install -c apple tensorflow-deps
```

- [Converters](grobid_superconductors/converters) TSV to/from Grobid XML files conversion
- [Linking](grobid_superconductors/linking) module: A rule based python algorithm to link entities
- [Commons libraries](grobid_superconductors/commons): contains common code shared between the various component. The Grobid client was borrowed from [here](https://github.com/kermitt2/grobid-client-python), the tokenizer from [there](https://github.com/kermitt2/delft).
```shell
pip install -r requirements.txt
```

```shell
conda install scikit-learn=1.0.1
```

We need to remove tensorflow, h5py, scikit-learn from the delft dependencies in setup.py

## Developer's notes
```shell
pip install -e ../../delft
```

> python -m spacy download en_core_web_sm
Loading

0 comments on commit 58e2714

Please sign in to comment.