-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
STAR index creation and mapping test #1
Comments
Using the Ensembl genome fasta the headers look like this:
The gtf from Encode has the format:
When the gtf from ENCODE is provided to the Ensembl fasta, due to the naming incompatibility, there is an error in STAR index generation. Failed Results
Solution 1: Generate new miRNA GTF file by subsetting full GTF from Ensembl
Reran star_align, change SAMPLE=ACBW0PANXX_L1
INPUT_FASTQ=/DCEG/Projects/Exome/SequencingData/DAATeam/Xin/ad_hoc/MyeloidSpike95/Illumina/HiSeq/PostRun_Analysis/Data/180329_D00620_0114_ACBW0PANXX/CASAVA/L1/Undetermined_S0_L001_R1_001.fastq.gz
GTF_ANNOTATION=Homo_sapiens.GRCh38.109.miRNA.gtf
sbatch star_align.sh $SAMPLE $INPUT_FASTQ $GTF_ANNOTATION Results Runtime: 25 minutes
ACBW0PANXX_L1ReadsPerGene.out.tab
Solution 2: Update miRNA Annotation File from GencodeError occured because fasta and full GTF annotation file used for creating star index had different naming conventions for chromosome. Fasta file did not have "chr" as prefix of chromosome. To comply with this the miRNA annotaiton file could be updated.
Original: chr1 ENSEMBL gene 17369 17436 . - . gene_id "ENSG00000278267.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; level 3;
chr1 ENSEMBL transcript 17369 17436 . - . gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR6859-1-201"; level 3; tag "basic"; transcript_support_level "NA"; Updated: 1 ENSEMBL gene 17369 17436 . - . gene_id "ENSG00000278267.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; level 3;
1 ENSEMBL transcript 17369 17436 . - . gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR6859-1-201"; level 3; tag "basic"; transcript_support_level "NA"; Reran star_align, change SAMPLE=ACBW0PANXX_L1
INPUT_FASTQ=/DCEG/Projects/Exome/SequencingData/DAATeam/Xin/ad_hoc/MyeloidSpike95/Illumina/HiSeq/PostRun_Analysis/Data/180329_D00620_0114_ACBW0PANXX/CASAVA/L1/Undetermined_S0_L001_R1_001.fastq.gz
GTF_ANNOTATION=star_index/ENCFF628BVT_rename.gtf
sbatch star_align.sh $SAMPLE $INPUT_FASTQ $GTF_ANNOTATION Results N_unmapped 10125126 10125126 10125126
N_multimapping 424700 424700 424700
N_noFeature 113772 113891 534988
N_ambiguous 0 0 0
ENSG00000278267.1 0 0 0
ENSG00000274890.1 0 0 0
ENSG00000273874.1 0 0 0
ENSG00000275135.1 0 0 0
ENSG00000276171.1 0 0 0
ENSG00000278791.1 0 0 0
ENSG00000277294.1 0 0 0
ENSG00000207730.2 2517 2517 0
ENSG00000207607.2 2749 2749 0
ENSG00000198976.1 361 361 0 |
STAR mapper was tested with 2 different outputs from cutadapt using 3 merged subjects (each subject has 2 replicates) :
cutadapt -a file:${ADAPTERS} -m 15 -M 31 -O 5 The reports for this test are present here: /DCEG/CGF/Research/RD168_Chernobyl_TN-Pairs/ANALYSIS_miR/2023-05-30-miRNA-pipeline-test/cutadapt_trim_test/workflow_run/Gencode_microRNA-seq/star_align/log/star_align_multiqc_report.html The results from MultiQC are shown here: Overall, the uniquely mapped reads are ~50% ranging from 6-11 million reads. This is acceptable for short reads based on the review papers. https://www.sciencedirect.com/science/article/pii/S266591312030131X
cutadapt -a file:${ADAPTERS} -m 15 -O 5 The results will be posted in this comment later |
After removing -q option for end trimming and -M for long read removal, the number of mapped reads increased slightly. However, -q option will generate higher quality reads and the difference in not significant enough to remove the -q option. We should stick to -q and -M for STAR mapping. The number of mapped reads are acceptable based on the old results and the review paper (attached). |
The command for creating STAR index is:
From Star Manual:
--sjdbOverhang default: 100
int>0: length of the donor/acceptor sequence on each side of the junctions, ideally = (mate length - 1)
From Google search
The --sjdbOverhang is used only at the genome generation step and tells STAR how many bases to concatenate from donor and acceptor sides of the junctions. If you have 100b reads, the ideal value of --sjdbOverhang is 99, which allows the 100b read to map 99b on one side, 1b on the other side. One can think of --sjdbOverhang as the maximum possible overhang for your reads.
On the other hand, --alignSJDBoverhangMin is used at the mapping step to define the minimum allowed overhang over splice junctions. For example, the default value of 3 would prohibit overhangs of 1b or 2b
Alex Dobin response in a Google Group
https://groups.google.com/g/rna-star/c/RBWvAGFooMU
Hi Eugene,
--sjdbOverhang 1 is a hack to prohibit splicing over annotated junctions in the GTF, but still use the GTF for counting reads over genes.
If you do not need counting over genes in the GTF file, you can omit the --sjdbGTFfile and --sjdbOverhang parameters altogether.
Cheers
Alex
The text was updated successfully, but these errors were encountered: