-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Genome fasta and GTF files #2
Comments
Genome ReferencesThe latest GTF (V43) references from ENCODE are available through here: https://www.gencodegenes.org/human/ We downloaded the top-level, soft-masked fasta and the gtf file from Ensembl. https://useast.ensembl.org/info/data/ftp/index.html TOPLEVELThese files contains all sequence regions flagged as toplevel in an Ensembl schema. This includes chromsomes, regions not assembled into chromosomes and N padded haplotype/patch regions. EXAMPLESToplevel sequences unmasked: Toplevel soft/hard masked sequences: Sequence Type
Download Files# Fasta File
wget https://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.toplevel.fa.gz
# Annotations
wget https://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.chr.gtf.gz |
The genome assembly downloaded from ENSEMBL corresponds to GenBank Assembly ID GCA_000001405.28 which is GRCh38.p13. However, GRCh38.p13 has been replaced by GRCh38.p14 https://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/README ![]() ![]() The genome assembly GRCh38.p14 can be downloaded from here from GenBank: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/ The GenBank fasta headers look like this:
Using the Ensembl genome fasta the headers look like this:
The gtf from Encode has the format:
We can do a quick check to see the ENSEMBL and GenBank references. They should be the same. ENSEMBL chromosome names appear to be more appropriate with the GTF from Encode. |
GTF format from ENSEMBL before and after subsetting miRNA sequences.
|
NCBI reference vs. Ensembl reference: NCBI reference has uppercase Ns and ENSEMBL has lowecase Ns (n) when looking at the headers. However, both seem to be soft-masked genomic DNA i.e. all repeats and low complexity regions have been replaced with lowercased versions of their nucleic base. (Screenshots below).
Ensembl reference:
|
In the pipeline developed in 2018, the genome file used is:
/fdb/igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa
And the GTF annotation file is:
ENCFF628BVT.gtf
The same gtf file as above can be downloaded from Encode website here:
https://www.encodeproject.org/files/ENCFF628BVT/
There are 12,279 miRNA entries in this gtf file and three types of genomic features: gene, transcript and exons:
The text was updated successfully, but these errors were encountered: