It is generally shown, and validated in this study, that attention mechanisms in protein Language Models (pLMs) encode evolutionary structural information, facilitating accurate prediction of protein tertiary structure contact maps. Here, we explore whether this concept extends to nucleic acid Language Models (naLMs) for predicting RNA secondary and tertiary structure contact maps. However, our investigation reveals that current language models do not effectively capture structural information. The presented code contains all the trials we performed.
To repeat analysis install and run the Python scripts inside the conda environment:
conda env create -f attentionpred.yml
Usage instructions and installation steps for Miniconda / Anaconda refer to the official web page: https://docs.anaconda.com/free/miniconda/index.html
Note: Make sure you change the input / output directories correspondingly!
- All molecules can be sourced from PDB (https://www.rcsb.org/): the ones we used are contained in trRosetta.json
- Execute multicontacts.py to generate contact maps (browse inside to change input/target directory and threshold)
- Execute del_diag.py to delete nearby contacts (browse inside to change input/target directory and range)
- Execute p_calculation.py (browse inside to change input/target directory and threshold)
- Execute dataload.py for feature extraction
- Execute training.py and predict.py for retraining and evaluation
- Download molecules from PDB: dataset contained in redundants.json (contains dataset after redundancy reduction)
- Execute multicontacts_1mer.py or multicontacts_3mer.py to generate contact maps (browse inside to change target directory and cutoff, select script based on LM)
- Execute del_diag.py to delete nearby contacts (browse inside to change cutoff)
- Execute p_calculation.py (browse inside to change input/target directory and threshold)
- Execute 1mer_dataload.py or 3mer_dataload.py for feature extraction
- Execute training.py and predict.py for retraining and evaluation
- Download molecules from PDB: dataset contained in allfordnabert.json or allforrnabert.json (contains dataset after redundancy reduction)
- Execute p_calculation.py (browse inside to change input/target directory and threshold)
- Execute dataload.py for feature extraction
- Execute training.py and predict.py for retraining and evaluation