split_fastqcats

A tool for processing and splitting concatenated FastQ reads based on primer sequences, particularly designed for long-read RNA sequencing data.

Overview

split_fastqcats is designed to handle FastQ reads with the following structure:

TSO---cDNA---polyA---UMI---revRTprimer

Example structure:

AAGCAGTGGTATCAACGCAGAGTGAAT---cDNA---polyA--N---UMI---GTACTCTGCGTTGATACCACTGCTT

The tool performs the following:

Identifies and validates primer sequences using Smith-Waterman alignment
Handles potential sequencing errors with configurable mismatch tolerance
Processes polyA tails and UMIs
Splits concatenated reads into individual segments
Provides detailed statistics about the processing

Installation

Using pip

pip install split_fastqcats

From source

git clone https://github.com/cribbslab/split_fastqcats.git
cd split_fastqcats
pip install -e .

Using conda/mamba

mamba env create -f conda/environments/split_fastqcats.yml
conda activate fastcats
pip install -e .

Usage

Command Line Interface

Basic usage:

split-fastqcats input.fastq.gz \
    -fp AAGCAGTGGTATCAACGCAGAGT \
    -rp ACTCTGCGTTGATACCACTGCTT \
    processed.fastq.gz \
    lowqual.fastq.gz \
    bin.fastq.gz \
    stats.csv

Arguments:

input.fastq.gz: Input FastQ file (gzipped)
-fp, --forward_primer: Forward primer sequence (default: AAGCAGTGGTATCAACGCAGAGT)
-rp, --reverse_primer: Reverse primer sequence (default: ACTCTGCGTTGATACCACTGCTT)
-i, --indexes: Optional index sequences as 'index:start_seq,end_seq'
processed.fastq.gz: Output file for correctly processed reads
lowqual.fastq.gz: Output file for low quality reads
bin.fastq.gz: Output file for binned reads
stats.csv: Output statistics file

Python API

from split_fastqcats import FastqSplitter

# Initialize the splitter
splitter = FastqSplitter(
    forward_primer="AAGCAGTGGTATCAACGCAGAGT",
    reverse_primer="ACTCTGCGTTGATACCACTGCTT",
    index_dict={'1': ['AAATTTGGGCCC', 'GGGCCCAAATTT']}
)

# Process reads
splitter.split_reads(
    input_file="input.fastq.gz",
    processed_output="processed.fastq.gz",
    lowqual_output="lowqual.fastq.gz",
    bin_output="bin.fastq.gz",
    stats_output="stats.csv"
)

Output Files

processed.fastq.gz: Contains reads that:
- Have valid primer pairs
- Meet quality thresholds
- Have correct index sequences (if specified)
lowqual.fastq.gz: Contains reads that:
- Have only one valid primer
- Have mismatched indexes
- Meet basic quality requirements but fail stricter criteria
bin.fastq.gz: Contains reads that:
- Lack valid primers
- Have too many primer matches (>10)
- Fail basic quality requirements
stats.csv: Contains processing statistics:
- Total sequences processed
- Number of processed reads
- Number of low-quality reads
- Number of binned reads

Quality Control Parameters

The tool implements several QC measures:

Smith-Waterman alignment for primer detection
Maximum 5 mismatches allowed in primer sequences
Minimum 75% match score required for primer identification
Index validation with 83% minimum match score
Position validation for index sequences

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this tool in your research, please cite:

@software{split_fastqcats2024,
  author = {Cribbs, Adam},
  title = {split_fastqcats: A tool for processing concatenated FastQ reads},
  year = {2024},
  publisher = {GitHub},
  url = {https://github.com/cribbslab/split_fastqcats}
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github		.github
.vscode		.vscode
conda		conda
src		src
tests		tests
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

split_fastqcats

Overview

Installation

Using pip

From source

Using conda/mamba

Usage

Command Line Interface

Python API

Output Files

Quality Control Parameters

Contributing

License

Citation

About

Releases

Packages

Contributors 2

Languages

License

cribbslab/split_fastqcats

Folders and files

Latest commit

History

Repository files navigation

split_fastqcats

Overview

Installation

Using pip

From source

Using conda/mamba

Usage

Command Line Interface

Python API

Output Files

Quality Control Parameters

Contributing

License

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages