PyFLUte

Description:

pyFlute is a python packaged targeted towards handling messy or incomplete influenza genome datasets. It is primarily built as a wrapper for seqkit (doi:10.1371/journal.pone.0163962)

Dependencies:

seqkit
Biopython
Python ≥3.0

Our Issue pyFLUte addresses:

Acquiring sequences through in-house influenza genome sequencing pipelines or genome acquisition databases can result in 'uneven' or 'incomplete' influenza genome datasets making tedious work out of data preparation. pyFLUte remedies this by providing an integratabtle CLI tool for separating genome segments for downstream analysis.

Warning .fasta header formatting requirements

GISAID: Isolate name|Isolate ID | Segment OR Isolate name|Isolate ID | Segment number

NCBI Influenza Virus Database : >{strain}_{segment}

Current segment header match cases supported

*_1
*|1
*_PB2
*|PB2

Usage

usage: pyflute.py [-h] [-i INPUT] [-o OUTPUT] [-r]

  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Input file for the FASTA file
  -o OUTPUT, --output OUTPUT
                        Output directory for the extracted sequences
  -r, --stats           Generates a report for each sorted segment file

Examples using the example data folder

Note Example influenza genome data was aquired through the NCBI Influenza Virus Database. Accession numbers can be found in /example_input/NCBI_example_accessions.txt

Basic use case for sorting a single .fasta file containing incomplete segments:

python3 pyflute.py -i ./input_directory -o ./output_directory

Sorts sequences and provides a CLI report of the resulting sorted sequences:

python3 pyflute.py -i ./input_directory -o ./output_directory  -r

output:

file	format	type	num_seqs	sum_len	min_len	avg_len	max_len
1_PB2.fasta	FASTA	DNA	77	179,064	1,942	2,325.5	2,341
2_PB1.fasta	FASTA	DNA	80	188,838	2,181	2,360.5	2,396
3_PA.fasta	FASTA	DNA	78	176,706	1,984	2,265.5	2,305
4_HA.fasta	FASTA	DNA	71	130,000	1,638	1,831	1,847
5_NP.fasta	FASTA	DNA	67	120,647	1,612	1,800.7	1,844
6_NA.fasta	FASTA	DNA	75	113,874	1,401	1,518.3	1,557
7_M.fasta	FASTA	DNA	59	67,602	1,046	1,145.8	1,189
8_NS.fasta	FASTA	DNA	55	58,410	1,024	1,062	1,068

Glossary:

'Incomplete Genome'' OR 'Genome Completeness': In this documentation, an 'incomplete genome' refers to a genome that is not fully recoverable by our RT-PCR ➡️ sequencing pipeline, NOT a defective or defective interfering genomes (DVGs) which have been observed to be players in viral pathogenesis (Defective viral genomes are key drivers of the virus–host interaction | Nature Microbiology ).

Future Additions

interactive report summarizing complete and incomplete genome segments.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
example_input		example_input
README.md		README.md
pyflute.py		pyflute.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyFLUte

Description:

Dependencies:

Our Issue pyFLUte addresses:

Usage

Examples using the example data folder

Glossary:

Future Additions

About

Releases

Packages

Languages

elginakin/pyflute

Folders and files

Latest commit

History

Repository files navigation

PyFLUte

Description:

Dependencies:

Our Issue pyFLUte addresses:

Usage

Examples using the example data folder

Glossary:

Future Additions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages