Skip to content

Latest commit

 

History

History
46 lines (36 loc) · 2.66 KB

README.md

File metadata and controls

46 lines (36 loc) · 2.66 KB

BaSeRPro - Batch Sequence Retrieval and Processing for GenBank

Application Preview

Features

  • Portable Application (Windows / Mac / Linux)
  • Retrieves GenBank sequences from following arguments (comma seperated OR in multiple fields):
    • Multiple species / organism (at least one argument must be supplied)
    • Strain / Isolate / Title / Wildcard (Any)
    • Gene / Product / Locus tag (at least one argument must be supplied)
    • Sequence length / range
    • Country
    • Title of record (according to GenBank .GBFF format)
    • Three RefSeq modes: All sequences / RefSeq only / RefSeq priority (preferentially extracts RefSeq sequences first. If number of target records is not achieved, continue retrieval with non-Refseq sequences)
    • Four extraction modes:
      • Single gene
      • Gene range (Extracts segment of sequence from one gene to another [Requires input of two gene/product/locus tag arguments])
      • Gene order (Extracts segment of sequence when a specific order of genes is met [Requires input of two or moregene/product/locus tag arguments])
      • Entire sequence
  • Progress bar for fetching GenBank sequences
  • Automatic retries sequence retrieval to obtain desired number of target sequences if duplicate accession number is found (e.g. NZ_ vs. non-NZ_)
  • Automatically generates .xlsx file containing annotations/summary of extracted GenBank records
  • Automatically aligns sequences using FAMSA alignment
  • Automatically trims gaps in sequence using gap threshold (0-1)
    • Example: If gappyness threshold is 0.7, at least 70 % of sequences must contain a gap in a position for EACH species for it to be trimmed
    • Default: 0.95
  • Automatically generates consensus sequence based on trimmed alignment using consensus threhsold (0-1)
    • Example: If consensus threhsold is 0.2, at least 20% of equences must contain the same base in a position for it to be incorporated into the consensus sequence
    • Default: 0.1
    • Full support for ambiguous bases (both input and output)
    • Automatic detection of DNA, RNA and amino acids
    • Base frequencies of degenerate bases are shown in .xlsx summary file
  • Automatically generates conservation plot based on consensus sequence (only positions with degenerate bases from consensus sequence are shown)
  • Automatically retries search using provided email only if API key is invalid
  • Advanced features
    • Expand extracted sequence range by number of base pairs upstream/downstream (used for checking gene boundaries)