Skip to content

TheProv1/Transcription-Factor-Binding-Domain

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Transcription-Factor-Binding-Domain

Project Overview

This project is a collection of Python scripts designed to perform a multi-stage analysis on a hierarchically organized dataset of transcription factor protein sequences. The pipeline automates the process of parsing raw data, extracting functionally relevant protein domains, analyzing their biophysical properties, and summarizing their sequence composition.

Dataset Description/Information

This dataset is hierarchically organised as superclass-class-family-subfamily.

For example, the file 1/1.1/1.1.1/1.1.1.1.txt will be refered to as belonging to superclass 1, class 1, family 1, and sub-family 1.

Each subfamily contains one or more .txt files, and within each .txt file there can exist multiple transcription factors. Each transcription factor is a long chain of amino acids.

Note: In the original data, ANCHOR refers to the last column in the text file and is used as the DBD Flag.

  • DBD Region - Refers to the region where the protein binds to the DNA, which in turn activates/enables gene expression. In this dataset, this is identified by the ANCHOR column being "Yes".
  • Non-DBD Region - The remaining region of the protein that does not bind to DNA.

In order to separate the transcription factors from each other within a single .txt file, we keep checking the POS_IU Header (the 5th column). When the residue number in the current line is less than the previous line, it signifies a reset, and we consider it the end of one transcription factor and the beginning of a new one. This process continues till the end of the file.


Script Descriptions

DBD-Non-DBD-Split.py

  • Purpose: This is the foundational preprocessing script. It reads the entire raw dataset and separates every transcription factor into two distinct parts: its DNA-Binding Domain (DBD) and its non-DBD region.

  • Input: The raw NR_HI_IU directory.

  • Process:

    1. Recursively scans every .txt file in the NR_HI_IU directory.
    2. Identifies the boundaries of individual transcription factors by detecting resets in the POS_IU column.
    3. For each transcription factor, it inspects the ANCHOR column (the 8th column).
    4. It writes all rows where ANCHOR is "Yes" to a new file in the DBD-Region directory.
    5. It concatenates all other rows (where ANCHOR is "No") and writes them to a corresponding file in the Non-DBD-Region directory.
    6. The output files are simplified to contain only the four IUPred-related columns (POS_IU, RES_IU, IU, ANCHOR).
  • Output: Creates two directories:

    • DBD-Region: Contains .txt files, each holding the isolated DBD region of a single transcription factor.
    • Non-DBD-Region: Contains .txt files, each holding the isolated non-DBD region of a single transcription factor.

DBD-Splitting-Code.py

  • Purpose: A specialized version of the splitting script that extracts only the ANCHOR/DBD regions into a clean, organized directory structure, sorted by family. This is useful for analyses focused exclusively on DBDs.

  • Input: The raw NR_HI_IU directory.

  • Process:

    1. Identical to DBD-Non-DBD-Split.py but only performs the DBD extraction logic.
    2. It filters for rows where the ANCHOR column is "Yes".
    3. It selects and reformats only the last four columns: POS_IU, RES_IU, IU, and ANCHOR.
  • Output: Creates a new directory (e.g., DBD_Split or ANCHOR_regions_by_family) containing the extracted ANCHOR region files. These new files are sorted into subdirectories named after the family they belong to (e.g., 1.1.1, 1.1.2, etc.).


DBD-Disorder-Code.py

  • Purpose: To analyze the extracted ANCHOR/DBD regions and filter them based on their level of intrinsic disorder.

  • Input: A directory of split DBD regions (e.g., DBD_Split).

  • Process:

    1. Prompts the user to enter a disorder percentage threshold (e.g., 80).
    2. Iterates through every file in the input directory.
    3. For each file, it calculates the disorder percentage using the formula: Disorder % = (Number of amino acids with IUPred Score > 0.5 / Total number of amino acids) * 100
    4. It compares this calculated percentage to the user's input threshold.
  • Output: Creates a single .csv file named dynamically (e.g., DBD_disorder_above_80.csv). This file contains two columns: filename and disorder_percentage, listing only the files that met or exceeded the specified disorder threshold.


DBD-Non-DBD-Window-Code.py

  • Purpose: To perform a comprehensive compositional analysis on the cleaned and separated DBD and non-DBD regions.

  • Input: The DBD-Region and Non-DBD-Region directories generated by DBD-Non-DBD-Split.py.

  • Process:

    1. Runs two main jobs: one for the DBD directory, and one for the non-DBD directory.
    2. For each job, it initiates a loop that iterates through window sizes from 3 to 11.
    3. It uses a sliding window of the current size (e.g., 3 for triplets) to count the occurrences of every unique amino acid pattern within each region file.
  • Output: Creates two large master directories:

    • DBD-region-Window-Output: Structured by window size (e.g., 3/, 4/), containing the analysis for all DBD regions.
    • Non-DBD-Window-Output: Similarly structured, containing the analysis for all non-DBD regions.

Occurence-CSV-generator.py

  • Purpose: To aggregate the detailed window analysis results into a structured, comparable matrix format for each superclass and region type.

  • Input: The DBD-region-Window-Output/3/ and Non-DBD-Window-Output/3/ directories.

  • Process:

    1. Runs multiple jobs, one for each superclass and region combination (e.g., Superclass 1 DBD, Superclass 1 non-DBD, etc.).
    2. Pass 1 (Header Discovery): For each job, it scans all relevant files to find every unique triplet that occurs 3 or more times. This set of triplets forms the header.
    3. Pass 2 (Data Population): It re-scans the files and populates a matrix where rows are transcription factors and columns are the frequent triplets. The cells contain the actual occurrence count for that triplet in that factor (or 0 if it's absent or has a count < 3).
  • Output: Generates four (or more) summary CSV files, such as superclass_1_DBD_summary.csv, superclass_1_nonDBD_summary.csv, etc.


Amino-Acid-Distribution.py

  • Purpose: To perform a comparative biophysical analysis by visualizing the propensity of each amino acid to be in an ordered versus a disordered state.

  • Input: The DBD-Region and Non-DBD-Region directories.

  • Process:

    1. Runs two main jobs: one for DBDs and one for non-DBDs.
    2. For each job, it iterates through a list of superclasses.
    3. Within each superclass, it counts the total number of times each amino acid appears in an ordered state (IUPred < 0.5) and a disordered state (IUPred >= 0.5).
    4. It calculates a Disorder-to-Order Ratio for each amino acid.
    5. Generates a bar chart visualizing this ratio.
  • Output: Creates a directory (amino_acid_disorder_ratios) containing two subdirectories (DBD_ratios, nonDBD_ratios), which hold the .png bar chart images for each superclass.


Disorder-by-Order-Normalized.py

  • Purpose: To perform a more sophisticated biophysical analysis by calculating a normalized score that indicates an amino acid's preference for ordered or disordered states, relative to its overall abundance.

  • Input: The DBD-Region and Non-DBD-Region directories.

  • Process:

    1. Runs two main jobs, one for DBDs and one for non-DBDs.
    2. For each job and for each superclass, it aggregates four key values: the individual count of each amino acid in an ordered state (Oi), the individual count in a disordered state (Di), the total count of all ordered residues (Otot), and the total count of all disordered residues (Dtot).
    3. It then calculates the Normalized Disorder Preference Score for each amino acid using the formula: Score = (Di/Dtot - Oi/Otot) / (Di/Dtot + Oi/Otot)
    4. A score of +1 indicates a complete preference for disordered regions, -1 indicates a complete preference for ordered regions, and 0 indicates no preference.
  • Output: Creates a directory (amino_acid_normalized_disorder) containing subdirectories (DBD_normalized_scores, nonDBD_normalized_scores), which hold the .png bar chart images of these scores for each superclass.


Excel-to-fasta-merged.py & convert-to-fasta.py

  • Purpose: These scripts handle the integration of an external dataset (provided as Human-TFs-PDB.xls and ExtraIDs.fasta) with the existing data.

  • Input: Human-TFs-PDB.xls and ExtraIDs.fasta.

  • Process:

    1. The first script (Excel-to-fasta-merged.py, also referred to as merge_sequences.py) intelligently parses the Excel and FASTA files. It matches IDs from the "ExtraIDs" sheet to their corresponding sequences in the FASTA file, resolving ID formatting inconsistencies (e.g., matching 7QOD with 7QOD_1).
    2. It appends these new sequences to the primary "All-Human" data sheet, creating a new, consolidated Human-TFs-PDB_MERGED.xlsx file.
    3. The second script (convert-to-fasta.py, or merged_excel_to_fasta.py) reads this new merged Excel file and converts it into a single, master FASTA file (all_sequences.fasta).
  • Output: The final all_sequences.fasta file, which contains every protein sequence from the original and supplementary datasets, ready for homology analysis.


less-than-25-similarity.py

  • Purpose: To identify and list all pairs of transcription factors from the master dataset that are highly dissimilar (i.e., share less than 25% sequence identity). This script works in conjunction with the command-line tool NCBI BLAST+.

  • Input: The master all_sequences.fasta file generated previously.

  • Process:

    1. BLAST Analysis (Manual Step): An all-vs-all blastp search is first performed on all_sequences.fasta to generate a comprehensive similar_pairs.tsv file containing all significant alignments.
    2. Filtering: The Python script (less-than-25-similarity.py, also referred to as filter_blast_results.py) reads this raw similar_pairs.tsv file.
    3. It inspects the percent identity (column 3) for every alignment reported by BLAST.
    4. It keeps only the pairs where the percent identity is explicitly less than 25%.
  • Output: A single CSV file (dissimilar_pairs_lt25_with_scores.csv) containing three columns: Sequence_1, Sequence_2, and Percent_Identity, providing a verifiable list of all highly divergent protein pairs.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages