title | permalink | layout | nav_order | parent |
---|---|---|---|---|
Tutorial: File formats |
/tutorials/formats |
default |
1 |
Tutorials |
{: .no_toc }
{: .no_toc .text-delta}
- TOC {:toc}
FASTA files are the de-facto standard for storing and sharing DNA/RNA/amino acid
sequences. It is a text-file format with a header line followed by one or more lines of sequence.
The header line starts with a '>' character, which may be used for parsing the file. The remainder
of the header line consists of a name, optionally followed by a comment. The name and comment are
separated by whitespace. The name must not contain a whitespace (
, or \t
) character.
The remaining lines consist of DNA, RNA, or amino acid sequences using IUPAC notation. There are no technical specifications for line length or wrapping, but it is common practice for long sequences (genes/genomes) to be have a consistent line length. However, it is not required for the sequence to have any breaks. Blank lines and any whitespace in the sequence line(s) should be ignored.
FASTA is the most common format for distributing reference genome sequences. Reference genomes will contain one sequence for each chromosome. In addition to chromosomes, some reference genomes will also include unplaced contigs, which are sequences known to be part of the organism's genome, but the exact position is unclear.
FASTA files are commonly stored as gzip
compressed files, due to their large size and general space inefficiency. However,
files that need to be randomly accessed need to be stored uncompressed, or compressed with a tool like bgzip
(although the
latter is less common).
>sequence1 This is the comment line
ATCGATCGATCGACTACGACTACGACGACGACATCGACATCTACT
GGCGCGCGCTAGAGCTAGCTTGAGATAAATCGACTAGCGACTGAG
CTATCTTCTCTATATATTTAAAAAGCGCAACTACTGACTA
>seq2
AATAGCGCGCGCGCGCTCATATATCTATATATAAAAACCTACTAC
GACTACGACTATCGATCGATTATCGGTATCGTATCGGTATTATTA
TTTAATGCGCGCGCGCCGACTAGCTAGCTATCGATCGATCGATCG
ACTACGACTACGACGACGACATCGACATCTACT
FASTA files can be indexed and queried with the program samtools faidx
. This requires a well-formatted FASTA file with
consistent line lengths.
ngsutilsj contains a number of tools for managing FASTA files, including indexed FASTA files. These tools including tagging sequence names, masking regions of sequences, splitting a FASTA file by sequence name, changing the line wrapping for a file, or generating mock FASTQ reads.
https://en.wikipedia.org/wiki/FASTA_format
FASTQ is the most common sequencing read format. FASTQ files are text files with a four-line record for each sequence. The first line starts with an '@' character and the unique name for the read. The second line is the sequence. The third line starts with the '+' character, and may optionally contain the read name again. The fourth line is the quality score for each of the basecalls in Phred scale.
The sequence and quality lines in the FASTQ record may be word-wrapped, like FASTA files, but this is not seen very often.
FASTQ files are almost always stored compressed, usually with gzip
, although bzip2
or xz
compression can
also be used. FASTQ record information can also be stored in other compressed formats, such as unaligned BAM files. However, almost all publicly available data is distributed as gzip-compressed FASTQ files.
Paired-end sequencing data is commonly stored as two separate FASTQ files, one for each read. But, it is also possible
to store both reads in the same file. These are called interleaved FASTQ files. Many aligners support interleaved
files out of the box for paired end data. For those that don't an adapter tool can be used to convert interleaved files
to non-interleaved files using named FIFO pipes. Interleaved FASTQ files are slightly have slightly more efficient
compression ratios when compared to using two separate FASTQ files, but the main benefit is the need to only manage
one file per sample. Some filtering steps can be more efficient using interleaved files and some filtering steps with the ngsutilsj fastq-filter tool require interleaved files (e.g. --paired
).
@seq1 This is the comment line
ATCGATCGATCGACTACGACTACGACGACGACATCGACATCTACT
+
BBBB,(,,7AA,<((,A<,,,FKAF,,,F,F7F,7,,,AA##@!@
https://en.wikipedia.org/wiki/FASTQ_format
BAM files are typically associated with read alignments to a genome, but they can also be used to store unaligned/raw sequences too. Unaligned BAM files store the same read information as a FASTQ file (name, sequence, and quality scores) and they can also store multiple reads in the same file (like interleaved FASTQ files). Also, unaligned BAM files are compressed, and store quality score information in an optimized manner versus character encoding. Even with the extra overhead of the BAM format, these factors make this a slightly more efficient way to store raw sequencing reads over FASTQ files.
Both FASTQ (compressed or not) and unaligned BAM files can be used as a source for sequencing reads with ngsutilsj. Compression status is determined automatically by ngsutilsj.