STAR index creation and mapping test #1

komaljain3 · 2023-06-07T19:21:59Z

The command for creating STAR index is:

STAR \
	--runThreadN $SLURM_CPUS_PER_TASK \
	--runMode genomeGenerate \
	--genomeDir star_index \
	--sjdbGTFfile /DCEG/CGF/Research/RD168_Chernobyl_TN-Pairs/ANALYSIS_miR/2023-05-30-miRNA-pipeline-test/ref/Homo_sapiens.GRCh38.109.gtf \
	--sjdbOverhang 1 \
	--limitGenomeGenerateRAM 180000000000 \
 	--genomeFastaFiles /DCEG/CGF/Research/RD168_Chernobyl_TN-Pairs/ANALYSIS_miR/2023-05-30-miRNA-pipeline-test/ref/Homo_sapiens.GRCh38.dna_sm.toplevel.fa

From Star Manual:

--sjdbOverhang default: 100

int>0: length of the donor/acceptor sequence on each side of the junctions, ideally = (mate length - 1)

From Google search

The --sjdbOverhang is used only at the genome generation step and tells STAR how many bases to concatenate from donor and acceptor sides of the junctions. If you have 100b reads, the ideal value of --sjdbOverhang is 99, which allows the 100b read to map 99b on one side, 1b on the other side. One can think of --sjdbOverhang as the maximum possible overhang for your reads.
On the other hand, --alignSJDBoverhangMin is used at the mapping step to define the minimum allowed overhang over splice junctions. For example, the default value of 3 would prohibit overhangs of 1b or 2b

Alex Dobin response in a Google Group

https://groups.google.com/g/rna-star/c/RBWvAGFooMU

Hi Eugene,

--sjdbOverhang 1 is a hack to prohibit splicing over annotated junctions in the GTF, but still use the GTF for counting reads over genes.
If you do not need counting over genes in the GTF file, you can omit the --sjdbGTFfile and --sjdbOverhang parameters altogether.

Cheers
Alex

The text was updated successfully, but these errors were encountered:

komaljain3 · 2023-06-07T20:22:56Z

Using the Ensembl genome fasta the headers look like this:

jaink4@nci-cgr:/DCEG/CGF/Research/RD168_Chernobyl_TN-Pairs/ANALYSIS_miR/2023-05-30-miRNA-pipeline-test/ref$ grep ">" Homo_sapiens.GRCh38.dna_sm.toplevel.fa
>1 dna_sm:chromosome chromosome:GRCh38:1:1:248956422:1 REF
>2 dna_sm:chromosome chromosome:GRCh38:2:1:242193529:1 REF
>3 dna_sm:chromosome chromosome:GRCh38:3:1:198295559:1 REF
>4 dna_sm:chromosome chromosome:GRCh38:4:1:190214555:1 REF
>5 dna_sm:chromosome chromosome:GRCh38:5:1:181538259:1 REF
>6 dna_sm:chromosome chromosome:GRCh38:6:1:170805979:1 REF
>7 dna_sm:chromosome chromosome:GRCh38:7:1:159345973:1 REF
>8 dna_sm:chromosome chromosome:GRCh38:8:1:145138636:1 REF

The gtf from Encode has the format:

jaink4@nci-cgr:/DCEG/CGF/Research/RD168_Chernobyl_TN-Pairs/ANALYSIS_miR/2023-05-30-miRNA-pipeline-test/ref$ head ENCFF628BVT.gtf
chr1	ENSEMBL	gene	17369	17436	.	-	.	gene_id "ENSG00000278267.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; level 3;
chr1	ENSEMBL	transcript	17369	17436	.	-	.	gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR6859-1-201"; level 3; tag "basic"; transcript_support_level "NA";
chr1	ENSEMBL	exon	17369	17436	.	-	.	gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR6859-1-201"; exon_number 1; exon_id "ENSE00003746039.1"; level 3; tag "basic"; transcript_support_level "NA";
chr1	ENSEMBL	gene	30366	30503	.	+	.	gene_id "ENSG00000274890.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR1302-2"; level 3;
chr1	ENSEMBL	transcript	30366	30503	.	+	.	gene_id "ENSG00000274890.1"; transcript_id "ENST00000607096.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR1302-2"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR1302-2-201"; level 3; tag "basic"; transcript_support_level "NA";

When the gtf from ENCODE is provided to the Ensembl fasta, due to the naming incompatibility, there is an error in STAR index generation.

Failed Results

Fatal INPUT FILE error, no valid exon lines in the GTF file: star_index/ENCFF628BVT.gtf
Solution: check the formatting of the GTF file. One likely cause is the difference in chromosome naming between GTF and FASTA file.

May 31 10:07:40 ...... FATAL ERROR, exiting

Solution 1: Generate new miRNA GTF file by subsetting full GTF from Ensembl

cat Homo_sapiens.GRCh38.109.gtf | grep miRNA > Homo_sapiens.GRCh38.109.miRNA.gtf

Reran star_align, change --sjdbGTFfile

SAMPLE=ACBW0PANXX_L1
INPUT_FASTQ=/DCEG/Projects/Exome/SequencingData/DAATeam/Xin/ad_hoc/MyeloidSpike95/Illumina/HiSeq/PostRun_Analysis/Data/180329_D00620_0114_ACBW0PANXX/CASAVA/L1/Undetermined_S0_L001_R1_001.fastq.gz
GTF_ANNOTATION=Homo_sapiens.GRCh38.109.miRNA.gtf

sbatch star_align.sh $SAMPLE $INPUT_FASTQ $GTF_ANNOTATION

Results

Runtime: 25 minutes

ACBW0PANXX_L1/
├── ACBW0PANXX_L1Aligned.sortedByCoord.out.bam
├── ACBW0PANXX_L1Aligned.toTranscriptome.out.bam
├── ACBW0PANXX_L1Log.final.out
├── ACBW0PANXX_L1Log.out
├── ACBW0PANXX_L1Log.progress.out
├── ACBW0PANXX_L1ReadsPerGene.out.tab
├── ACBW0PANXX_L1SJ.out.tab
├── ACBW0PANXX_L1Signal.Unique.str1.out.wig
├── ACBW0PANXX_L1Signal.Unique.str2.out.wig
├── ACBW0PANXX_L1Signal.UniqueMultiple.str1.out.wig
├── ACBW0PANXX_L1Signal.UniqueMultiple.str2.out.wig
├── ACBW0PANXX_L1Unmapped.out.mate1
├── ACBW0PANXX_L1_STARgenome
│   ├── exonGeTrInfo.tab
│   ├── exonInfo.tab
│   ├── geneInfo.tab
│   ├── sjdbInfo.txt
│   ├── sjdbList.fromGTF.out.tab
│   ├── sjdbList.out.tab
│   └── transcriptInfo.tab

ACBW0PANXX_L1ReadsPerGene.out.tab


N_unmapped	10125126	10125126	10125126
N_multimapping	424700	424700	424700
N_noFeature	115648	115648	532049
N_ambiguous	3058	0	0
ENSG00000273874	0	0	0
ENSG00000264341	0	0	0
ENSG00000264101	0	0	0
ENSG00000265606	0	0	0
ENSG00000284357	63	63	0
ENSG00000207941	0	0	0
ENSG00000264773	0	0	0
ENSG00000264698	0	0	0
ENSG00000265596	0	0	0
ENSG00000283724	0	0	0
ENSG00000284410	2	2	0
ENSG00000283477	0	0	0
ENSG00000264834	0	0	0
ENSG00000263675	0	0	0
ENSG00000266188	0	0	0
ENSG00000265815	0	0	0
ENSG00000263908	0	0	0
ENSG00000283690	0	0	0
ENSG00000276081	0	0	0
ENSG00000283749	1	1	0
ENSG00000284247	0	0	0
ENSG00000284202	0	0	0
ENSG00000198974	2723	2723	0
ENSG00000207962	4	4	0

Solution 2: Update miRNA Annotation File from Gencode

Error occured because fasta and full GTF annotation file used for creating star index had different naming conventions for chromosome. Fasta file did not have "chr" as prefix of chromosome. To comply with this the miRNA annotaiton file could be updated.

sed 's/^chr//g' ENCFF628BVT.gtf > ENCFF628BVT_rename.gtf

Original:

chr1	ENSEMBL	gene	17369	17436	.	-	.	gene_id "ENSG00000278267.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; level 3;
chr1	ENSEMBL	transcript	17369	17436	.	-	.	gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR6859-1-201"; level 3; tag "basic"; transcript_support_level "NA";

Updated:

1	ENSEMBL	gene	17369	17436	.	-	.	gene_id "ENSG00000278267.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; level 3;
1	ENSEMBL	transcript	17369	17436	.	-	.	gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR6859-1-201"; level 3; tag "basic"; transcript_support_level "NA";

Reran star_align, change --sjdbGTFfile star_index/ENCFF628BVT_rename.gtf

SAMPLE=ACBW0PANXX_L1
INPUT_FASTQ=/DCEG/Projects/Exome/SequencingData/DAATeam/Xin/ad_hoc/MyeloidSpike95/Illumina/HiSeq/PostRun_Analysis/Data/180329_D00620_0114_ACBW0PANXX/CASAVA/L1/Undetermined_S0_L001_R1_001.fastq.gz
GTF_ANNOTATION=star_index/ENCFF628BVT_rename.gtf

sbatch star_align.sh $SAMPLE $INPUT_FASTQ $GTF_ANNOTATION

Results

N_unmapped	10125126	10125126	10125126
N_multimapping	424700	424700	424700
N_noFeature	113772	113891	534988
N_ambiguous	0	0	0
ENSG00000278267.1	0	0	0
ENSG00000274890.1	0	0	0
ENSG00000273874.1	0	0	0
ENSG00000275135.1	0	0	0
ENSG00000276171.1	0	0	0
ENSG00000278791.1	0	0	0
ENSG00000277294.1	0	0	0
ENSG00000207730.2	2517	2517	0
ENSG00000207607.2	2749	2749	0
ENSG00000198976.1	361	361	0

komaljain3 · 2023-06-13T15:13:06Z

STAR mapper was tested with 2 different outputs from cutadapt using 3 merged subjects (each subject has 2 replicates) :

When had the -q option enabled (i.e. the reads are not trimmed from the ends) and other parameters stayed the same as finalized in the command of Comparison of vendor provided cutadapt command with the pipeline command #5 .

cutadapt -a file:${ADAPTERS} -m 15 -M 31 -O 5
--too-short-output=trimmed/${SAMPLE}_too_short.fastq.gz
--too-long-output=trimmed/${SAMPLE}_too_long.fastq.gz
-q 10,10

The reports for this test are present here:

/DCEG/CGF/Research/RD168_Chernobyl_TN-Pairs/ANALYSIS_miR/2023-05-30-miRNA-pipeline-test/cutadapt_trim_test/workflow_run/Gencode_microRNA-seq/star_align/log/star_align_multiqc_report.html

The results from MultiQC are shown here:

Overall, the uniquely mapped reads are ~50% ranging from 6-11 million reads. This is acceptable for short reads based on the review papers.

https://www.sciencedirect.com/science/article/pii/S266591312030131X

The second test after removing -q and -M (upper limit on read length to filter) is underway.

cutadapt -a file:${ADAPTERS} -m 15 -O 5
--too-short-output=trimmed/${SAMPLE}_too_short.fastq.gz
--too-long-output=trimmed/${SAMPLE}_too_long.fastq.gz

The results will be posted in this comment later

bentjordan · 2023-06-14T16:21:24Z

Results of cutadapt after removing -q and -M

cutadapt -a file:${ADAPTERS} -m 15 -O 5
--too-short-output=trimmed/${SAMPLE}_too_short.fastq.gz
--too-long-output=trimmed/${SAMPLE}_too_long.fastq.gz

The percent of uniquely mapped reads range from 25% to 36% and contain between 7 to 12 million reads

percentage

komaljain3 · 2023-06-15T00:45:13Z

After removing -q option for end trimming and -M for long read removal, the number of mapped reads increased slightly. However, -q option will generate higher quality reads and the difference in not significant enough to remove the -q option. We should stick to -q and -M for STAR mapping. The number of mapped reads are acceptable based on the old results and the review paper (attached).

1-s2.0-S266591312030131X-main.pdf
QC sequencing metrics_RD168_miR_ALL_v1.xlsx

komaljain3 · 2023-06-26T17:10:14Z

Update on -M option for cutadapt and -outFilterMismatchNmax 1 option for STAR align:

We decided not to put upper length filter (-M 31) on long read filtering by cutadapt for many reasons. The old CGR pipeline uses -M 31 filter but the Encode pipeline's cutadapt command doesn't.

We tested our new GenCode_microRNA-seq pipeline using cutadapt -M 31 (remove reads longer than 31 bp) and without -M 31 using MiSeq QC data and the read count matrix is almost identical in term of read counts.
We also looked at the miRNA lengths in the GTF file. Even though most papers suggest a length of 22-25 bp, the length in the Encode GTF file is longer (col 6)

Most importantly, the parameter -outFilterMismatchNmax 1 in STAR alignment step restricts alignment to only mismatch. So, reads with more than 1 mismatches to the reference will not be mapped. Instead of filtering the reads upfront, we let the mapper decide whether the read is mappable or not. Another good thing about this parameter is that the behavior resembles the Bowtie1 mapper which is used by mirDeep, a popular tool for miRNAs.

komaljain3 assigned belyndahicks and bentjordan Jun 7, 2023

komaljain3 changed the title ~~STAR index creation with parameter --sjdbOverhang 1~~ STAR index creation and mapping test Jun 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

STAR index creation and mapping test #1

STAR index creation and mapping test #1

komaljain3 commented Jun 7, 2023

komaljain3 commented Jun 7, 2023 •

edited

Loading

komaljain3 commented Jun 13, 2023

bentjordan commented Jun 14, 2023 •

edited

Loading

komaljain3 commented Jun 15, 2023

komaljain3 commented Jun 26, 2023

STAR index creation and mapping test #1

STAR index creation and mapping test #1

Comments

komaljain3 commented Jun 7, 2023

From Star Manual:

From Google search

Alex Dobin response in a Google Group

komaljain3 commented Jun 7, 2023 • edited Loading

Solution 1: Generate new miRNA GTF file by subsetting full GTF from Ensembl

Solution 2: Update miRNA Annotation File from Gencode

komaljain3 commented Jun 13, 2023

bentjordan commented Jun 14, 2023 • edited Loading

Results of cutadapt after removing -q and -M

komaljain3 commented Jun 15, 2023

komaljain3 commented Jun 26, 2023

Update on -M option for cutadapt and -outFilterMismatchNmax 1 option for STAR align:

komaljain3 commented Jun 7, 2023 •

edited

Loading

bentjordan commented Jun 14, 2023 •

edited

Loading