Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genome fasta and GTF files #2

Open
komaljain3 opened this issue Jun 7, 2023 · 4 comments
Open

Genome fasta and GTF files #2

komaljain3 opened this issue Jun 7, 2023 · 4 comments
Assignees

Comments

@komaljain3
Copy link
Collaborator

In the pipeline developed in 2018, the genome file used is:

/fdb/igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa

And the GTF annotation file is:

ENCFF628BVT.gtf

The same gtf file as above can be downloaded from Encode website here:

https://www.encodeproject.org/files/ENCFF628BVT/

There are 12,279 miRNA entries in this gtf file and three types of genomic features: gene, transcript and exons:

Screenshot 2023-06-07 at 3 36 59 PM
@komaljain3
Copy link
Collaborator Author

komaljain3 commented Jun 7, 2023

Genome References

The latest GTF (V43) references from ENCODE are available through here:

https://www.gencodegenes.org/human/

We downloaded the top-level, soft-masked fasta and the gtf file from Ensembl.

https://useast.ensembl.org/info/data/ftp/index.html

TOPLEVEL

These files contains all sequence regions flagged as toplevel in an Ensembl schema. This includes chromsomes, regions not assembled into chromosomes and N padded haplotype/patch regions.

EXAMPLES

Toplevel sequences unmasked:
Homo_sapiens.GRCh37.dna.toplevel.fa.gz

Toplevel soft/hard masked sequences:
Homo_sapiens.GRCh37.dna_sm.toplevel.fa.gz
Homo_sapiens.GRCh37.dna_rm.toplevel.fa.gz

Sequence Type

  • 'dna' - unmasked genomic DNA sequences.
  • 'dna_rm' - masked genomic DNA. Interspersed repeats and low complexity regions are detected with the RepeatMasker tool and masked by replacing repeats with 'N's.
  • 'dna_sm' - soft-masked genomic DNA. All repeats and low complexity regions have been replaced with lowercased versions of their nucleic base

Download Files

# Fasta File
wget https://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.toplevel.fa.gz

# Annotations
wget https://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.chr.gtf.gz 

@komaljain3
Copy link
Collaborator Author

komaljain3 commented Jun 7, 2023

The genome assembly downloaded from ENSEMBL corresponds to GenBank Assembly ID GCA_000001405.28 which is GRCh38.p13. However, GRCh38.p13 has been replaced by GRCh38.p14

https://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/README

Screenshot 2023-06-07 at 4 39 21 PM Screenshot 2023-06-07 at 4 39 32 PM

The genome assembly GRCh38.p14 can be downloaded from here from GenBank:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/

The GenBank fasta headers look like this:

 jaink4$ grep ">" GCF_000001405.40_GRCh38.p14_genomic.fna
>NC_000001.11 Homo sapiens chromosome 1, GRCh38.p14 Primary Assembly
>NT_187361.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR1_CTG1_UNLOCALIZED
>NT_187362.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR1_CTG2_UNLOCALIZED
>NT_187363.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR1_CTG3_UNLOCALIZED
>NT_187364.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR1_CTG4_UNLOCALIZED
>NT_187365.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR1_CTG5_UNLOCALIZED
>NT_187366.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR1_CTG6_UNLOCALIZED
>NT_187367.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR1_CTG7_UNLOCALIZED
>NT_187368.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR1_CTG8_UNLOCALIZED
>NT_187369.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR1_CTG9_UNLOCALIZED
>NC_000002.12 Homo sapiens chromosome 2, GRCh38.p14 Primary Assembly
>NT_187370.1 Homo sapiens chromosome 2 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR2_RANDOM_CTG1
>NT_187371.1 Homo sapiens chromosome 2 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR2_RANDOM_CTG2
>NC_000003.12 Homo sapiens chromosome 3, GRCh38.p14 Primary Assembly
>NT_167215.1 Homo sapiens chromosome 3 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR3UN_CTG2
>NC_000004.12 Homo sapiens chromosome 4, GRCh38.p14 Primary Assembly
>NT_113793.3 Homo sapiens chromosome 4 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR4_RANDOM_CTG4
>NC_000005.10 Homo sapiens chromosome 5, GRCh38.p14 Primary Assembly

Using the Ensembl genome fasta the headers look like this:

jaink4@nci-cgr:/DCEG/CGF/Research/RD168_Chernobyl_TN-Pairs/ANALYSIS_miR/2023-05-30-miRNA-pipeline-test/ref$ grep ">" Homo_sapiens.GRCh38.dna_sm.toplevel.fa
>1 dna_sm:chromosome chromosome:GRCh38:1:1:248956422:1 REF
>2 dna_sm:chromosome chromosome:GRCh38:2:1:242193529:1 REF
>3 dna_sm:chromosome chromosome:GRCh38:3:1:198295559:1 REF
>4 dna_sm:chromosome chromosome:GRCh38:4:1:190214555:1 REF
>5 dna_sm:chromosome chromosome:GRCh38:5:1:181538259:1 REF
>6 dna_sm:chromosome chromosome:GRCh38:6:1:170805979:1 REF
>7 dna_sm:chromosome chromosome:GRCh38:7:1:159345973:1 REF
>8 dna_sm:chromosome chromosome:GRCh38:8:1:145138636:1 REF

The gtf from Encode has the format:

jaink4@nci-cgr:/DCEG/CGF/Research/RD168_Chernobyl_TN-Pairs/ANALYSIS_miR/2023-05-30-miRNA-pipeline-test/ref$ head ENCFF628BVT.gtf
chr1	ENSEMBL	gene	17369	17436	.	-	.	gene_id "ENSG00000278267.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; level 3;
chr1	ENSEMBL	transcript	17369	17436	.	-	.	gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR6859-1-201"; level 3; tag "basic"; transcript_support_level "NA";
chr1	ENSEMBL	exon	17369	17436	.	-	.	gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR6859-1-201"; exon_number 1; exon_id "ENSE00003746039.1"; level 3; tag "basic"; transcript_support_level "NA";
chr1	ENSEMBL	gene	30366	30503	.	+	.	gene_id "ENSG00000274890.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR1302-2"; level 3;
chr1	ENSEMBL	transcript	30366	30503	.	+	.	gene_id "ENSG00000274890.1"; transcript_id "ENST00000607096.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR1302-2"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR1302-2-201"; level 3; tag "basic"; transcript_support_level "NA";

We can do a quick check to see the ENSEMBL and GenBank references. They should be the same. ENSEMBL chromosome names appear to be more appropriate with the GTF from Encode.

@komaljain3
Copy link
Collaborator Author

GTF format from ENSEMBL before and after subsetting miRNA sequences.

jaink4@nci-cgr:/DCEG/CGF/Research/RD168_Chernobyl_TN-Pairs/ANALYSIS_miR/2023-05-30-miRNA-pipeline-test/ref$ head Homo_sapiens.GRCh38.109.gtf
#!genome-build GRCh38.p13
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession GCA_000001405.28
#!genebuild-last-updated 2022-11
1	ensembl_havana	gene	1471765	1497848	.	+	.	gene_id "ENSG00000160072"; gene_version "20"; gene_name "ATAD3B"; gene_source "ensembl_havana"; gene_biotype "protein_coding";
1	ensembl_havana	transcript	1471765	1497848	.	+	.	gene_id "ENSG00000160072"; gene_version "20"; transcript_id "ENST00000673477"; transcript_version "1"; gene_name "ATAD3B"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "ATAD3B-206"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS30"; tag "basic"; tag "Ensembl_canonical"; tag "MANE_Select";
1	ensembl_havana	exon	1471765	1472089	.	+	.	gene_id "ENSG00000160072"; gene_version "20"; transcript_id "ENST00000673477"; transcript_version "1"; exon_number "1"; gene_name "ATAD3B"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "ATAD3B-206"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS30"; exon_id "ENSE00003889014"; exon_version "1"; tag "basic"; tag "Ensembl_canonical"; tag "MANE_Select";
1	ensembl_havana	CDS	1471885	1472089	.	+	0	gene_id "ENSG00000160072"; gene_version "20"; transcript_id "ENST00000673477"; transcript_version "1"; exon_number "1"; gene_name "ATAD3B"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "ATAD3B-206"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS30"; protein_id "ENSP00000500094"; protein_version "1"; tag "basic"; tag "Ensembl_canonical"; tag "MANE_Select";
1	ensembl_havana	start_codon	1471885	1471887	.	+	0	gene_id "ENSG00000160072"; gene_version "20"; transcript_id "ENST00000673477"; transcript_version "1"; exon_number "1"; gene_name "ATAD3B"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "ATAD3B-206"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS30"; tag "basic"; tag "Ensembl_canonical"; tag "MANE_Select";
jaink4@nci-cgr:/DCEG/CGF/Research/RD168_Chernobyl_TN-Pairs/ANALYSIS_miR/2023-05-30-miRNA-pipeline-test/ref$ 
jaink4@nci-cgr:/DCEG/CGF/Research/RD168_Chernobyl_TN-Pairs/ANALYSIS_miR/2023-05-30-miRNA-pipeline-test/ref$ 
jaink4@nci-cgr:/DCEG/CGF/Research/RD168_Chernobyl_TN-Pairs/ANALYSIS_miR/2023-05-30-miRNA-pipeline-test/ref$ head Homo_sapiens.GRCh38.109.miRNA.gtf
1	mirbase	gene	187891	187958	.	-	.	gene_id "ENSG00000273874"; gene_version "1"; gene_name "MIR6859-2"; gene_source "mirbase"; gene_biotype "miRNA";
1	mirbase	transcript	187891	187958	.	-	.	gene_id "ENSG00000273874"; gene_version "1"; transcript_id "ENST00000612080"; transcript_version "1"; gene_name "MIR6859-2"; gene_source "mirbase"; gene_biotype "miRNA"; transcript_name "MIR6859-2-201"; transcript_source "mirbase"; transcript_biotype "miRNA"; tag "basic"; tag "Ensembl_canonical"; transcript_support_level "NA";
1	mirbase	exon	187891	187958	.	-	.	gene_id "ENSG00000273874"; gene_version "1"; transcript_id "ENST00000612080"; transcript_version "1"; exon_number "1"; gene_name "MIR6859-2"; gene_source "mirbase"; gene_biotype "miRNA"; transcript_name "MIR6859-2-201"; transcript_source "mirbase"; transcript_biotype "miRNA"; exon_id "ENSE00003737837"; exon_version "1"; tag "basic"; tag "Ensembl_canonical"; transcript_support_level "NA";
1	mirbase	gene	5564071	5564143	.	+	.	gene_id "ENSG00000264341"; gene_version "1"; gene_source "mirbase"; gene_biotype "miRNA";
1	mirbase	transcript	5564071	5564143	.	+	.	gene_id "ENSG00000264341"; gene_version "1"; transcript_id "ENST00000579887"; transcript_version "1"; gene_source "mirbase"; gene_biotype "miRNA"; transcript_source "mirbase"; transcript_biotype "miRNA"; tag "basic"; tag "Ensembl_canonical"; transcript_support_level "NA";
1	mirbase	exon	5564071	5564143	.	+	.	gene_id "ENSG00000264341"; gene_version "1"; transcript_id "ENST00000579887"; transcript_version "1"; exon_number "1"; gene_source "mirbase"; gene_biotype "miRNA"; transcript_source "mirbase"; transcript_biotype "miRNA"; exon_id "ENSE00002721598"; exon_version "1"; tag "basic"; tag "Ensembl_canonical"; transcript_support_level "NA";
1	mirbase	gene	5862672	5862741	.	-	.	gene_id "ENSG00000264101"; gene_version "1"; gene_name "MIR4689"; gene_source "mirbase"; gene_biotype "miRNA";
1	mirbase	transcript	5862672	5862741	.	-	.	gene_id "ENSG00000264101"; gene_version "1"; transcript_id "ENST00000582517"; transcript_version "1"; gene_name "MIR4689"; gene_source "mirbase"; gene_biotype "miRNA"; transcript_name "MIR4689-201"; transcript_source "mirbase"; transcript_biotype "miRNA"; tag "basic"; tag "Ensembl_canonical"; transcript_support_level "NA";
1	mirbase	exon	5862672	5862741	.	-	.	gene_id "ENSG00000264101"; gene_version "1"; transcript_id "ENST00000582517"; transcript_version "1"; exon_number "1"; gene_name "MIR4689"; gene_source "mirbase"; gene_biotype "miRNA"; transcript_name "MIR4689-201"; transcript_source "mirbase"; transcript_biotype "miRNA"; exon_id "ENSE00002689481"; exon_version "1"; tag "basic"; tag "Ensembl_canonical"; transcript_support_level "NA";
1	mirbase	gene	18883202	18883275	.	-	.	gene_id "ENSG00000265606"; gene_version "1"; gene_name "MIR4695"; gene_source "mirbase"; gene_biotype "miRNA";

@komaljain3
Copy link
Collaborator Author

NCBI reference vs. Ensembl reference:

NCBI reference has uppercase Ns and ENSEMBL has lowecase Ns (n) when looking at the headers. However, both seem to be soft-masked genomic DNA i.e. all repeats and low complexity regions have been replaced with lowercased versions of their nucleic base. (Screenshots below).

==> GCF_000001405.40_GRCh38.p14_genomic.fna <==
>NC_000001.11 Homo sapiens chromosome 1, GRCh38.p14 Primary Assembly
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
(base) NCI-02225697-ML:microRNA jaink4$ head GCF_000001405.40_GRCh38.p14_genomic.fna
>NC_000001.11 Homo sapiens chromosome 1, GRCh38.p14 Primary Assembly
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Ensembl reference:

jaink4@nci-cgr:/DCEG/CGF/Research/RD168_Chernobyl_TN-Pairs/ANALYSIS_miR/2023-05-30-miRNA-pipeline-test/ref$ head  Homo_sapiens.GRCh38.dna_sm.toplevel.fa
>1 dna_sm:chromosome chromosome:GRCh38:1:1:248956422:1 REF
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn

ENSEMBL
Screenshot 2023-06-07 at 6 18 43 PM

GenBank
Screenshot 2023-06-07 at 6 19 41 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants