BioChemInsight platform integrates chemical structure recognition with bioactivity analysis to automatically extract relationships between chemical structures and biological activities from literature. This platform efficiently generates structured data, significantly reducing the cost and complexity of data curation and accelerating the construction of high-quality datasets. This tool not only provides rich training data for machine learning and deep learning models but also lays the foundation for drug optimization and target screening research tasks.
- Efficient Identification and Parsing: Automatically extracts compound structures and biological activity data from PDF documents.
- Deep Learning and OCR Technologies: Utilizes DECIMER Segmentation models and PaddleOCR for image and text data processing.
- Data Transformation: Converts unstructured data into structured formats for further analysis and application.
BioChemInsight combines advanced image recognition and text extraction techniques to streamline the data extraction process. It includes two main modules:
- Detects and extracts compound structure images from PDFs using image segmentation models.
- Converts detected chemical structures into SMILES format and associates them with compound identifiers.
- Extracts bioactivity data such as IC50, EC50, Ki, and other experimental results from PDFs using OCR.
- Enhances data extraction and transformation using advanced language models for consistency and accuracy.
-
PDF Segmentation and Image Conversion
Splits PDF documents into pages and converts them into image formats for processing. -
Chemical Structure Detection and Conversion
Locates compound images using DECIMER Segmentation and parses structures into SMILES format with MolScribe or MolVec. -
Compound Identifier Recognition
Recognizes compound numbers using the MiniCPB-V-2.6 model for robust detection and pairing. -
Bioactivity Data Extraction and Parsing
Extracts bioactivity results using PaddleOCR and refines data with large language models. -
Data Integration
Merges all extracted chemical and bioactivity data into structured formats such as CSV or Excel for downstream analysis.
To set up BioChemInsight, follow the steps below:
git clone https://github.com/dahuilangda/BioChemInsight
cd BioChemInsight
The project uses a configuration file named constants.py
for environment-specific variables. A template file constants_example.py
is provided in the repository. To configure your environment:
-
Rename
constants_example.py
toconstants.py
:mv constants_example.py constants.py
-
Open
constants.py
and update the values as per your environment:GEMINI_MODEL_NAME = 'gemini-1.5-flash' # GEMINI model name GEMINI_API_KEY = 'sk-xxxx' # API key for the Gemini model SECONDARY_MODEL_NAME = 'qwen' # Secondary model name SECONDARY_MODEL_URL = 'http://xxxx:8000/v1' # URL for the secondary model SECONDARY_MODEL_KEY = 'sk-xxxx' # API key for the secondary model VISUAL_MODEL_URL = 'http://xxxx:8000/v1' # URL for the visual model VISUAL_MODEL_KEY = 'sk-xxxx' # API key for the visual model HTTP_PROXY = '' # HTTP proxy (if needed) HTTPS_PROXY = '' # HTTPS proxy (if needed) MOLVEC = '/path/to/BioChemInsight/bin/molvec-0.9.9-SNAPSHOT-jar-with-dependencies.jar' # Path to MolVec JAR
-
Save the changes.
conda install -c conda-forge mamba
mamba create -n chem_ocr python=3.10
conda activate chem_ocr
mamba install -c conda-forge -c nvidia cuda-tools==11.8
mamba install pytorch==2.0.0 torchvision==0.15.0 torchaudio==2.0.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install decimer-segmentation molscribe -i https://pypi.tuna.tsinghua.edu.cn/simple
mamba install -c conda-forge jupyter pytesseract transformers
pip install paddleocr paddlepaddle-gpu PyMuPDF PyPDF2 fitz -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install openai -i https://pypi.tuna.tsinghua.edu.cn/simple
Run the BioChemInsight pipeline to extract both chemical structures and bioactivity data:
python pipeline.py data/sample.pdf \
--structure-start-page 242 \
--structure-end-page 267 \
--assay-start-page 270 \
--assay-end-page 272 \
--assay-name "FRET EC50" \
--output output
-
Extract only chemical structures:
python pipeline.py data/sample.pdf \ --structure-start-page 242 \ --structure-end-page 267 \ --output output
-
Extract multiple assays from different ranges:
python pipeline.py data/sample.pdf \ --assay-start-page 30 270 \ --assay-end-page 40 272 \ --assay-names "IC50,Ki" \ --output output
-
Merge structures and assays into a single file: After extracting structures and assays, the platform automatically merges the data into a consolidated CSV file.
The platform generates structured data files, including:
- structures.csv: Contains detected compound IDs and their SMILES representations.
- assay_data.json: Stores extracted bioactivity data for each assay.
- merged.csv: Combines chemical structures with bioactivity data into a single file.
- AI/ML Model Training: Supplies high-quality training datasets for machine learning and deep learning tasks in cheminformatics and bioinformatics.
- Drug Discovery: Supports drug optimization and target screening by providing precise structure-activity relationships.
- Literature Mining: Automates the extraction of key data from scientific articles, reducing manual labor.