Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simulate test sets using Arabidopsis genome and chloroplast sequence #6

Open
greatfireball opened this issue Jan 15, 2018 · 5 comments
Assignees

Comments

@greatfireball
Copy link
Member

Generate test sets to evaluate assembler performance.

Therefore, use Arabidopsis genome and chloroplast from Genbank and simulate short read libraries fulfilling those characteristics:

  1. Read length (100, 150, 250 bp)
  2. Insert size (overlapping by 50 %, overlapping by 10 %, 100, 200, 500 bp)
  3. Different ratios of genomic DNA to chloroplast DNA (500:1, 200:1, 100:1, 50:1, 10:1, 1:1, 1:10, 1:50, 1:100, 1:200, 1:500)

Take care of the circular sequence of the chloroplast genome!

Use a simulation software which allows the usage of a random seed to ensure reproducability. Maybe this paper gives some ideas which tool to use.

@PfaffS
Copy link

PfaffS commented Jan 26, 2018

Available Data from F1:

Simulating Data with ART: http://www.niehs.nih.gov/research/resources/software/biostatistics/art/
Command:
art_illumina [options] -i <INPUT_SEQ_FILE> -l <READ_LEN> -f <FOLD_COVERAGE> -o <OUTPUT_FILE_PREFIX> -m <MEAN_FRAG_LEN> -s <STD_DE>

   1:1 : ./art_illumina -p -i sequence-arabidopsis-thaliana-kern-chl.fa -l 150 -f 16 -o a_thaliana_1_1_sim -m 500 -s 150

(only 16x coverage because reads of Chloro and A.thaliana was used 6x )

   1:10 : ./art_illumina -p -i sequence-arabidopsis-thaliana-kern-chl-1zu10.fa -l 150 -f 100 -o a_thaliana_1_10_sim -m 500 -s 150

   1:100 :  ./art_illumina -p -i sequence-arabidopsis-thaliana-kern-chl-1zu100.fa -l 150 -f 100 -o a_thaliana_1_100_sim -m 500 -s 150

   1:1000 :  ./art_illumina -p -i sequence-arabidopsis-thaliana-kern-chl-1zu1000.fa -l 150 -f 100 -o a_thaliana_1_1000_sim -m 500 -s 150

  chl-only: ../program/art_bin_MountRainier/art_illumina -p -i sequence-arabidopsis-thaliana-chl.txt -l 150 -f 100 -o a_thaliana_chl_only_sim -m 500 -s 150

@greatfireball
Copy link
Member Author

Think we should try chloroplasts as contamination as well...
Would suggest
10:1
100:1
1000:1
Genome vs. Chloroplast... This setting might simulates extracted nuclei with a little bit of contamination.

Opinions @PfaffS @iimog ?

@iimog
Copy link
Member

iimog commented Feb 20, 2018

I don't hate the idea. However, one thing to consider is that if we target 200x chloroplast coverage the last dataset would require a genomic coverage of 200,000x
I don't think that is feasible or realistic. Even if we want to attempt assembly at 20x chloroplast coverage I can't (currently) imagine a genome sequenced to 20,000x coverage.

@greatfireball
Copy link
Member Author

100% agree, but I would like to know what will happen if we only provide rare chloroplast sequences. Wrong assemblies? Error messages? Anything else?

Nevertheless, we are using a definition for ratio 1:1 of one complete host genome to one complete chloroplast genome. Another definition is also possible: 1:1 in that case means, that one read belongs to the host genome and the second read belongs to the chloroplast genome. (I just wanted to state that here to ensure, that we later remember our definition) :)

@iimog
Copy link
Member

iimog commented Feb 22, 2018

Yeah, good point. I'd suggest we first try it with 10:1 then. We could use a 500x covered genome (so chloroplast will be coverd 50x). With default parameters I expect ChloroExtractor to fail when it tries to scale reads to 200x coverage. We can then re-run ChloroExtractor with target coverage of 40x to see what happens then. I'm also curious how the other tools behave.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants