This project is a collection of Python scripts designed to perform a multi-stage analysis on a hierarchically organized dataset of transcription factor protein sequences. The pipeline automates the process of parsing raw data, extracting functionally relevant protein domains, analyzing their biophysical properties, and summarizing their sequence composition.
This dataset is hierarchically organised as superclass-class-family-subfamily.
For example, the file 1/1.1/1.1.1/1.1.1.1.txt will be refered to as belonging to superclass 1, class 1, family 1, and sub-family 1.
Each subfamily contains one or more .txt files, and within each .txt file there can exist multiple transcription factors. Each transcription factor is a long chain of amino acids.
Note: In the original data, ANCHOR refers to the last column in the text file and is used as the DBD Flag.
- DBD Region - Refers to the region where the protein binds to the DNA, which in turn activates/enables gene expression. In this dataset, this is identified by the
ANCHORcolumn being"Yes". - Non-DBD Region - The remaining region of the protein that does not bind to DNA.
In order to separate the transcription factors from each other within a single .txt file, we keep checking the POS_IU Header (the 5th column). When the residue number in the current line is less than the previous line, it signifies a reset, and we consider it the end of one transcription factor and the beginning of a new one. This process continues till the end of the file.
-
Purpose: This is the foundational preprocessing script. It reads the entire raw dataset and separates every transcription factor into two distinct parts: its DNA-Binding Domain (DBD) and its non-DBD region.
-
Input: The raw
NR_HI_IUdirectory. -
Process:
- Recursively scans every
.txtfile in theNR_HI_IUdirectory. - Identifies the boundaries of individual transcription factors by detecting resets in the
POS_IUcolumn. - For each transcription factor, it inspects the
ANCHORcolumn (the 8th column). - It writes all rows where
ANCHORis"Yes"to a new file in theDBD-Regiondirectory. - It concatenates all other rows (where
ANCHORis"No") and writes them to a corresponding file in theNon-DBD-Regiondirectory. - The output files are simplified to contain only the four IUPred-related columns (
POS_IU,RES_IU,IU,ANCHOR).
- Recursively scans every
-
Output: Creates two directories:
DBD-Region: Contains.txtfiles, each holding the isolated DBD region of a single transcription factor.Non-DBD-Region: Contains.txtfiles, each holding the isolated non-DBD region of a single transcription factor.
-
Purpose: A specialized version of the splitting script that extracts only the ANCHOR/DBD regions into a clean, organized directory structure, sorted by family. This is useful for analyses focused exclusively on DBDs.
-
Input: The raw
NR_HI_IUdirectory. -
Process:
- Identical to
DBD-Non-DBD-Split.pybut only performs the DBD extraction logic. - It filters for rows where the
ANCHORcolumn is"Yes". - It selects and reformats only the last four columns:
POS_IU,RES_IU,IU, andANCHOR.
- Identical to
-
Output: Creates a new directory (e.g.,
DBD_SplitorANCHOR_regions_by_family) containing the extracted ANCHOR region files. These new files are sorted into subdirectories named after the family they belong to (e.g.,1.1.1,1.1.2, etc.).
-
Purpose: To analyze the extracted ANCHOR/DBD regions and filter them based on their level of intrinsic disorder.
-
Input: A directory of split DBD regions (e.g.,
DBD_Split). -
Process:
- Prompts the user to enter a disorder percentage threshold (e.g.,
80). - Iterates through every file in the input directory.
- For each file, it calculates the disorder percentage using the formula:
Disorder % = (Number of amino acids with IUPred Score > 0.5 / Total number of amino acids) * 100 - It compares this calculated percentage to the user's input threshold.
- Prompts the user to enter a disorder percentage threshold (e.g.,
-
Output: Creates a single
.csvfile named dynamically (e.g.,DBD_disorder_above_80.csv). This file contains two columns:filenameanddisorder_percentage, listing only the files that met or exceeded the specified disorder threshold.
-
Purpose: To perform a comprehensive compositional analysis on the cleaned and separated DBD and non-DBD regions.
-
Input: The
DBD-RegionandNon-DBD-Regiondirectories generated byDBD-Non-DBD-Split.py. -
Process:
- Runs two main jobs: one for the DBD directory, and one for the non-DBD directory.
- For each job, it initiates a loop that iterates through window sizes from 3 to 11.
- It uses a sliding window of the current size (e.g., 3 for triplets) to count the occurrences of every unique amino acid pattern within each region file.
-
Output: Creates two large master directories:
DBD-region-Window-Output: Structured by window size (e.g.,3/,4/), containing the analysis for all DBD regions.Non-DBD-Window-Output: Similarly structured, containing the analysis for all non-DBD regions.
-
Purpose: To aggregate the detailed window analysis results into a structured, comparable matrix format for each superclass and region type.
-
Input: The
DBD-region-Window-Output/3/andNon-DBD-Window-Output/3/directories. -
Process:
- Runs multiple jobs, one for each superclass and region combination (e.g., Superclass 1 DBD, Superclass 1 non-DBD, etc.).
- Pass 1 (Header Discovery): For each job, it scans all relevant files to find every unique triplet that occurs 3 or more times. This set of triplets forms the header.
- Pass 2 (Data Population): It re-scans the files and populates a matrix where rows are transcription factors and columns are the frequent triplets. The cells contain the actual occurrence count for that triplet in that factor (or 0 if it's absent or has a count < 3).
-
Output: Generates four (or more) summary CSV files, such as
superclass_1_DBD_summary.csv,superclass_1_nonDBD_summary.csv, etc.
-
Purpose: To perform a comparative biophysical analysis by visualizing the propensity of each amino acid to be in an ordered versus a disordered state.
-
Input: The
DBD-RegionandNon-DBD-Regiondirectories. -
Process:
- Runs two main jobs: one for DBDs and one for non-DBDs.
- For each job, it iterates through a list of superclasses.
- Within each superclass, it counts the total number of times each amino acid appears in an ordered state (IUPred < 0.5) and a disordered state (IUPred >= 0.5).
- It calculates a Disorder-to-Order Ratio for each amino acid.
- Generates a bar chart visualizing this ratio.
-
Output: Creates a directory (
amino_acid_disorder_ratios) containing two subdirectories (DBD_ratios,nonDBD_ratios), which hold the.pngbar chart images for each superclass.
-
Purpose: To perform a more sophisticated biophysical analysis by calculating a normalized score that indicates an amino acid's preference for ordered or disordered states, relative to its overall abundance.
-
Input: The
DBD-RegionandNon-DBD-Regiondirectories. -
Process:
- Runs two main jobs, one for DBDs and one for non-DBDs.
- For each job and for each superclass, it aggregates four key values: the individual count of each amino acid in an ordered state (
Oi), the individual count in a disordered state (Di), the total count of all ordered residues (Otot), and the total count of all disordered residues (Dtot). - It then calculates the Normalized Disorder Preference Score for each amino acid using the formula:
Score = (Di/Dtot - Oi/Otot) / (Di/Dtot + Oi/Otot) - A score of +1 indicates a complete preference for disordered regions, -1 indicates a complete preference for ordered regions, and 0 indicates no preference.
-
Output: Creates a directory (
amino_acid_normalized_disorder) containing subdirectories (DBD_normalized_scores,nonDBD_normalized_scores), which hold the.pngbar chart images of these scores for each superclass.
-
Purpose: These scripts handle the integration of an external dataset (provided as
Human-TFs-PDB.xlsandExtraIDs.fasta) with the existing data. -
Input:
Human-TFs-PDB.xlsandExtraIDs.fasta. -
Process:
- The first script (
Excel-to-fasta-merged.py, also referred to asmerge_sequences.py) intelligently parses the Excel and FASTA files. It matches IDs from the "ExtraIDs" sheet to their corresponding sequences in the FASTA file, resolving ID formatting inconsistencies (e.g., matching7QODwith7QOD_1). - It appends these new sequences to the primary "All-Human" data sheet, creating a new, consolidated
Human-TFs-PDB_MERGED.xlsxfile. - The second script (
convert-to-fasta.py, ormerged_excel_to_fasta.py) reads this new merged Excel file and converts it into a single, master FASTA file (all_sequences.fasta).
- The first script (
-
Output: The final
all_sequences.fastafile, which contains every protein sequence from the original and supplementary datasets, ready for homology analysis.
-
Purpose: To identify and list all pairs of transcription factors from the master dataset that are highly dissimilar (i.e., share less than 25% sequence identity). This script works in conjunction with the command-line tool NCBI BLAST+.
-
Input: The master
all_sequences.fastafile generated previously. -
Process:
- BLAST Analysis (Manual Step): An all-vs-all
blastpsearch is first performed onall_sequences.fastato generate a comprehensivesimilar_pairs.tsvfile containing all significant alignments. - Filtering: The Python script (
less-than-25-similarity.py, also referred to asfilter_blast_results.py) reads this rawsimilar_pairs.tsvfile. - It inspects the percent identity (column 3) for every alignment reported by BLAST.
- It keeps only the pairs where the percent identity is explicitly less than 25%.
- BLAST Analysis (Manual Step): An all-vs-all
-
Output: A single CSV file (
dissimilar_pairs_lt25_with_scores.csv) containing three columns:Sequence_1,Sequence_2, andPercent_Identity, providing a verifiable list of all highly divergent protein pairs.