Simulate test sets using Arabidopsis genome and chloroplast sequence #6

greatfireball · 2018-01-15T06:23:34Z

Generate test sets to evaluate assembler performance.

Therefore, use Arabidopsis genome and chloroplast from Genbank and simulate short read libraries fulfilling those characteristics:

Read length (100, 150, 250 bp)
Insert size (overlapping by 50 %, overlapping by 10 %, 100, 200, 500 bp)
Different ratios of genomic DNA to chloroplast DNA (500:1, 200:1, 100:1, 50:1, 10:1, 1:1, 1:10, 1:50, 1:100, 1:200, 1:500)

Take care of the circular sequence of the chloroplast genome!

Use a simulation software which allows the usage of a random seed to ensure reproducability. Maybe this paper gives some ideas which tool to use.

PfaffS · 2018-01-26T12:24:13Z

Available Data from F1:

Simulating Data with ART: http://www.niehs.nih.gov/research/resources/software/biostatistics/art/
Command:
art_illumina [options] -i <INPUT_SEQ_FILE> -l <READ_LEN> -f <FOLD_COVERAGE> -o <OUTPUT_FILE_PREFIX> -m <MEAN_FRAG_LEN> -s <STD_DE>

   1:1 : ./art_illumina -p -i sequence-arabidopsis-thaliana-kern-chl.fa -l 150 -f 16 -o a_thaliana_1_1_sim -m 500 -s 150

(only 16x coverage because reads of Chloro and A.thaliana was used 6x )

   1:10 : ./art_illumina -p -i sequence-arabidopsis-thaliana-kern-chl-1zu10.fa -l 150 -f 100 -o a_thaliana_1_10_sim -m 500 -s 150

   1:100 :  ./art_illumina -p -i sequence-arabidopsis-thaliana-kern-chl-1zu100.fa -l 150 -f 100 -o a_thaliana_1_100_sim -m 500 -s 150

   1:1000 :  ./art_illumina -p -i sequence-arabidopsis-thaliana-kern-chl-1zu1000.fa -l 150 -f 100 -o a_thaliana_1_1000_sim -m 500 -s 150

  chl-only: ../program/art_bin_MountRainier/art_illumina -p -i sequence-arabidopsis-thaliana-chl.txt -l 150 -f 100 -o a_thaliana_chl_only_sim -m 500 -s 150

greatfireball · 2018-02-20T13:34:08Z

Think we should try chloroplasts as contamination as well...
Would suggest
10:1
100:1
1000:1
Genome vs. Chloroplast... This setting might simulates extracted nuclei with a little bit of contamination.

Opinions @PfaffS @iimog ?

iimog · 2018-02-20T13:49:50Z

I don't hate the idea. However, one thing to consider is that if we target 200x chloroplast coverage the last dataset would require a genomic coverage of 200,000x
I don't think that is feasible or realistic. Even if we want to attempt assembly at 20x chloroplast coverage I can't (currently) imagine a genome sequenced to 20,000x coverage.

greatfireball · 2018-02-20T14:27:22Z

100% agree, but I would like to know what will happen if we only provide rare chloroplast sequences. Wrong assemblies? Error messages? Anything else?

Nevertheless, we are using a definition for ratio 1:1 of one complete host genome to one complete chloroplast genome. Another definition is also possible: 1:1 in that case means, that one read belongs to the host genome and the second read belongs to the chloroplast genome. (I just wanted to state that here to ensure, that we later remember our definition) :)

iimog · 2018-02-22T08:33:32Z

Yeah, good point. I'd suggest we first try it with 10:1 then. We could use a 500x covered genome (so chloroplast will be coverd 50x). With default parameters I expect ChloroExtractor to fail when it tries to scale reads to 200x coverage. We can then re-run ChloroExtractor with target coverage of 40x to see what happens then. I'm also curious how the other tools behave.

greatfireball assigned PfaffS Jan 15, 2018

greatfireball mentioned this issue Jan 15, 2018

Evaluation of assembly parameters for different genome/chloroplast ratios #4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simulate test sets using Arabidopsis genome and chloroplast sequence #6

Simulate test sets using Arabidopsis genome and chloroplast sequence #6

greatfireball commented Jan 15, 2018

PfaffS commented Jan 26, 2018 •

edited

Loading

greatfireball commented Feb 20, 2018

iimog commented Feb 20, 2018

greatfireball commented Feb 20, 2018

iimog commented Feb 22, 2018

Simulate test sets using Arabidopsis genome and chloroplast sequence #6

Simulate test sets using Arabidopsis genome and chloroplast sequence #6

Comments

greatfireball commented Jan 15, 2018

PfaffS commented Jan 26, 2018 • edited Loading

greatfireball commented Feb 20, 2018

iimog commented Feb 20, 2018

greatfireball commented Feb 20, 2018

iimog commented Feb 22, 2018

PfaffS commented Jan 26, 2018 •

edited

Loading