Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STAR index creation and mapping test #1

Open
komaljain3 opened this issue Jun 7, 2023 · 5 comments
Open

STAR index creation and mapping test #1

komaljain3 opened this issue Jun 7, 2023 · 5 comments
Assignees

Comments

@komaljain3
Copy link
Collaborator

The command for creating STAR index is:

STAR \
	--runThreadN $SLURM_CPUS_PER_TASK \
	--runMode genomeGenerate \
	--genomeDir star_index \
	--sjdbGTFfile /DCEG/CGF/Research/RD168_Chernobyl_TN-Pairs/ANALYSIS_miR/2023-05-30-miRNA-pipeline-test/ref/Homo_sapiens.GRCh38.109.gtf \
	--sjdbOverhang 1 \
	--limitGenomeGenerateRAM 180000000000 \
 	--genomeFastaFiles /DCEG/CGF/Research/RD168_Chernobyl_TN-Pairs/ANALYSIS_miR/2023-05-30-miRNA-pipeline-test/ref/Homo_sapiens.GRCh38.dna_sm.toplevel.fa

From Star Manual:

--sjdbOverhang default: 100

int>0: length of the donor/acceptor sequence on each side of the junctions, ideally = (mate length - 1)

From Google search

The --sjdbOverhang is used only at the genome generation step and tells STAR how many bases to concatenate from donor and acceptor sides of the junctions. If you have 100b reads, the ideal value of --sjdbOverhang is 99, which allows the 100b read to map 99b on one side, 1b on the other side. One can think of --sjdbOverhang as the maximum possible overhang for your reads.
On the other hand, --alignSJDBoverhangMin is used at the mapping step to define the minimum allowed overhang over splice junctions. For example, the default value of 3 would prohibit overhangs of 1b or 2b

Alex Dobin response in a Google Group

https://groups.google.com/g/rna-star/c/RBWvAGFooMU

Hi Eugene,

--sjdbOverhang 1 is a hack to prohibit splicing over annotated junctions in the GTF, but still use the GTF for counting reads over genes.
If you do not need counting over genes in the GTF file, you can omit the --sjdbGTFfile and --sjdbOverhang parameters altogether.

Cheers
Alex

@komaljain3
Copy link
Collaborator Author

komaljain3 commented Jun 7, 2023

Using the Ensembl genome fasta the headers look like this:

jaink4@nci-cgr:/DCEG/CGF/Research/RD168_Chernobyl_TN-Pairs/ANALYSIS_miR/2023-05-30-miRNA-pipeline-test/ref$ grep ">" Homo_sapiens.GRCh38.dna_sm.toplevel.fa
>1 dna_sm:chromosome chromosome:GRCh38:1:1:248956422:1 REF
>2 dna_sm:chromosome chromosome:GRCh38:2:1:242193529:1 REF
>3 dna_sm:chromosome chromosome:GRCh38:3:1:198295559:1 REF
>4 dna_sm:chromosome chromosome:GRCh38:4:1:190214555:1 REF
>5 dna_sm:chromosome chromosome:GRCh38:5:1:181538259:1 REF
>6 dna_sm:chromosome chromosome:GRCh38:6:1:170805979:1 REF
>7 dna_sm:chromosome chromosome:GRCh38:7:1:159345973:1 REF
>8 dna_sm:chromosome chromosome:GRCh38:8:1:145138636:1 REF

The gtf from Encode has the format:

jaink4@nci-cgr:/DCEG/CGF/Research/RD168_Chernobyl_TN-Pairs/ANALYSIS_miR/2023-05-30-miRNA-pipeline-test/ref$ head ENCFF628BVT.gtf
chr1	ENSEMBL	gene	17369	17436	.	-	.	gene_id "ENSG00000278267.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; level 3;
chr1	ENSEMBL	transcript	17369	17436	.	-	.	gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR6859-1-201"; level 3; tag "basic"; transcript_support_level "NA";
chr1	ENSEMBL	exon	17369	17436	.	-	.	gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR6859-1-201"; exon_number 1; exon_id "ENSE00003746039.1"; level 3; tag "basic"; transcript_support_level "NA";
chr1	ENSEMBL	gene	30366	30503	.	+	.	gene_id "ENSG00000274890.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR1302-2"; level 3;
chr1	ENSEMBL	transcript	30366	30503	.	+	.	gene_id "ENSG00000274890.1"; transcript_id "ENST00000607096.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR1302-2"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR1302-2-201"; level 3; tag "basic"; transcript_support_level "NA";

When the gtf from ENCODE is provided to the Ensembl fasta, due to the naming incompatibility, there is an error in STAR index generation.

Failed Results

Fatal INPUT FILE error, no valid exon lines in the GTF file: star_index/ENCFF628BVT.gtf
Solution: check the formatting of the GTF file. One likely cause is the difference in chromosome naming between GTF and FASTA file.

May 31 10:07:40 ...... FATAL ERROR, exiting

Solution 1: Generate new miRNA GTF file by subsetting full GTF from Ensembl

cat Homo_sapiens.GRCh38.109.gtf | grep miRNA > Homo_sapiens.GRCh38.109.miRNA.gtf

Reran star_align, change --sjdbGTFfile

SAMPLE=ACBW0PANXX_L1
INPUT_FASTQ=/DCEG/Projects/Exome/SequencingData/DAATeam/Xin/ad_hoc/MyeloidSpike95/Illumina/HiSeq/PostRun_Analysis/Data/180329_D00620_0114_ACBW0PANXX/CASAVA/L1/Undetermined_S0_L001_R1_001.fastq.gz
GTF_ANNOTATION=Homo_sapiens.GRCh38.109.miRNA.gtf

sbatch star_align.sh $SAMPLE $INPUT_FASTQ $GTF_ANNOTATION

Results

Runtime: 25 minutes

ACBW0PANXX_L1/
├── ACBW0PANXX_L1Aligned.sortedByCoord.out.bam
├── ACBW0PANXX_L1Aligned.toTranscriptome.out.bam
├── ACBW0PANXX_L1Log.final.out
├── ACBW0PANXX_L1Log.out
├── ACBW0PANXX_L1Log.progress.out
├── ACBW0PANXX_L1ReadsPerGene.out.tab
├── ACBW0PANXX_L1SJ.out.tab
├── ACBW0PANXX_L1Signal.Unique.str1.out.wig
├── ACBW0PANXX_L1Signal.Unique.str2.out.wig
├── ACBW0PANXX_L1Signal.UniqueMultiple.str1.out.wig
├── ACBW0PANXX_L1Signal.UniqueMultiple.str2.out.wig
├── ACBW0PANXX_L1Unmapped.out.mate1
├── ACBW0PANXX_L1_STARgenome
│   ├── exonGeTrInfo.tab
│   ├── exonInfo.tab
│   ├── geneInfo.tab
│   ├── sjdbInfo.txt
│   ├── sjdbList.fromGTF.out.tab
│   ├── sjdbList.out.tab
│   └── transcriptInfo.tab

ACBW0PANXX_L1ReadsPerGene.out.tab


N_unmapped	10125126	10125126	10125126
N_multimapping	424700	424700	424700
N_noFeature	115648	115648	532049
N_ambiguous	3058	0	0
ENSG00000273874	0	0	0
ENSG00000264341	0	0	0
ENSG00000264101	0	0	0
ENSG00000265606	0	0	0
ENSG00000284357	63	63	0
ENSG00000207941	0	0	0
ENSG00000264773	0	0	0
ENSG00000264698	0	0	0
ENSG00000265596	0	0	0
ENSG00000283724	0	0	0
ENSG00000284410	2	2	0
ENSG00000283477	0	0	0
ENSG00000264834	0	0	0
ENSG00000263675	0	0	0
ENSG00000266188	0	0	0
ENSG00000265815	0	0	0
ENSG00000263908	0	0	0
ENSG00000283690	0	0	0
ENSG00000276081	0	0	0
ENSG00000283749	1	1	0
ENSG00000284247	0	0	0
ENSG00000284202	0	0	0
ENSG00000198974	2723	2723	0
ENSG00000207962	4	4	0

Solution 2: Update miRNA Annotation File from Gencode

Error occured because fasta and full GTF annotation file used for creating star index had different naming conventions for chromosome. Fasta file did not have "chr" as prefix of chromosome. To comply with this the miRNA annotaiton file could be updated.

sed 's/^chr//g' ENCFF628BVT.gtf > ENCFF628BVT_rename.gtf 

Original:

chr1	ENSEMBL	gene	17369	17436	.	-	.	gene_id "ENSG00000278267.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; level 3;
chr1	ENSEMBL	transcript	17369	17436	.	-	.	gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR6859-1-201"; level 3; tag "basic"; transcript_support_level "NA";

Updated:

1	ENSEMBL	gene	17369	17436	.	-	.	gene_id "ENSG00000278267.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; level 3;
1	ENSEMBL	transcript	17369	17436	.	-	.	gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR6859-1-201"; level 3; tag "basic"; transcript_support_level "NA";

Reran star_align, change --sjdbGTFfile star_index/ENCFF628BVT_rename.gtf

SAMPLE=ACBW0PANXX_L1
INPUT_FASTQ=/DCEG/Projects/Exome/SequencingData/DAATeam/Xin/ad_hoc/MyeloidSpike95/Illumina/HiSeq/PostRun_Analysis/Data/180329_D00620_0114_ACBW0PANXX/CASAVA/L1/Undetermined_S0_L001_R1_001.fastq.gz
GTF_ANNOTATION=star_index/ENCFF628BVT_rename.gtf

sbatch star_align.sh $SAMPLE $INPUT_FASTQ $GTF_ANNOTATION

Results

N_unmapped	10125126	10125126	10125126
N_multimapping	424700	424700	424700
N_noFeature	113772	113891	534988
N_ambiguous	0	0	0
ENSG00000278267.1	0	0	0
ENSG00000274890.1	0	0	0
ENSG00000273874.1	0	0	0
ENSG00000275135.1	0	0	0
ENSG00000276171.1	0	0	0
ENSG00000278791.1	0	0	0
ENSG00000277294.1	0	0	0
ENSG00000207730.2	2517	2517	0
ENSG00000207607.2	2749	2749	0
ENSG00000198976.1	361	361	0

@komaljain3 komaljain3 changed the title STAR index creation with parameter --sjdbOverhang 1 STAR index creation and mapping test Jun 12, 2023
@komaljain3
Copy link
Collaborator Author

STAR mapper was tested with 2 different outputs from cutadapt using 3 merged subjects (each subject has 2 replicates) :

  1. When had the -q option enabled (i.e. the reads are not trimmed from the ends) and other parameters stayed the same as finalized in the command of Comparison of vendor provided cutadapt command with the pipeline command #5 .

cutadapt -a file:${ADAPTERS} -m 15 -M 31 -O 5
--too-short-output=trimmed/${SAMPLE}_too_short.fastq.gz
--too-long-output=trimmed/${SAMPLE}_too_long.fastq.gz
-q 10,10

The reports for this test are present here:

/DCEG/CGF/Research/RD168_Chernobyl_TN-Pairs/ANALYSIS_miR/2023-05-30-miRNA-pipeline-test/cutadapt_trim_test/workflow_run/Gencode_microRNA-seq/star_align/log/star_align_multiqc_report.html

The results from MultiQC are shown here:

Screenshot 2023-06-13 at 11 07 30 AM

Overall, the uniquely mapped reads are ~50% ranging from 6-11 million reads. This is acceptable for short reads based on the review papers.

https://www.sciencedirect.com/science/article/pii/S266591312030131X

  1. The second test after removing -q and -M (upper limit on read length to filter) is underway.

cutadapt -a file:${ADAPTERS} -m 15 -O 5
--too-short-output=trimmed/${SAMPLE}_too_short.fastq.gz
--too-long-output=trimmed/${SAMPLE}_too_long.fastq.gz

The results will be posted in this comment later

@bentjordan
Copy link
Collaborator

bentjordan commented Jun 14, 2023

Results of cutadapt after removing -q and -M

cutadapt -a file:${ADAPTERS} -m 15 -O 5
--too-short-output=trimmed/${SAMPLE}_too_short.fastq.gz
--too-long-output=trimmed/${SAMPLE}_too_long.fastq.gz

The percent of uniquely mapped reads range from 25% to 36% and contain between 7 to 12 million reads

Screenshot 2023-06-14 at 12 15 46 PM

percentage
Screenshot 2023-06-14 at 12 20 20 PM

@komaljain3
Copy link
Collaborator Author

After removing -q option for end trimming and -M for long read removal, the number of mapped reads increased slightly. However, -q option will generate higher quality reads and the difference in not significant enough to remove the -q option. We should stick to -q and -M for STAR mapping. The number of mapped reads are acceptable based on the old results and the review paper (attached).
Screenshot 2023-06-14 at 8 42 01 PM
1-s2.0-S266591312030131X-main.pdf
QC sequencing metrics_RD168_miR_ALL_v1.xlsx

@komaljain3
Copy link
Collaborator Author

Update on -M option for cutadapt and -outFilterMismatchNmax 1 option for STAR align:

We decided not to put upper length filter (-M 31) on long read filtering by cutadapt for many reasons. The old CGR pipeline uses -M 31 filter but the Encode pipeline's cutadapt command doesn't.

  1. We tested our new GenCode_microRNA-seq pipeline using cutadapt -M 31 (remove reads longer than 31 bp) and without -M 31 using MiSeq QC data and the read count matrix is almost identical in term of read counts.

  2. We also looked at the miRNA lengths in the GTF file. Even though most papers suggest a length of 22-25 bp, the length in the Encode GTF file is longer (col 6)

Screenshot 2023-06-26 at 12 21 37 PM
  1. Most importantly, the parameter -outFilterMismatchNmax 1 in STAR alignment step restricts alignment to only mismatch. So, reads with more than 1 mismatches to the reference will not be mapped. Instead of filtering the reads upfront, we let the mapper decide whether the read is mappable or not. Another good thing about this parameter is that the behavior resembles the Bowtie1 mapper which is used by mirDeep, a popular tool for miRNAs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants