Skip to content

Commit fa69a5f

Browse files
author
lshzhang
committed
upload local files
1 parent 7217ca6 commit fa69a5f

25 files changed

+7451
-0
lines changed

ART/FPA_normal.fa

+356
Large diffs are not rendered by default.

ART/FPA_normal_with_mutations.fa

+362
Large diffs are not rendered by default.

ART/FPB_normal.fa

+356
Large diffs are not rendered by default.

ART/FPB_normal_with_mutations.fa

+362
Large diffs are not rendered by default.

ART/README.md

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# ART Folder
2+
3+
example bash script to generate artifical Trusight Tumor 26 fastq files
4+
5+
6+
## FPA_normal.fa and FPB_normal.fa
7+
these are fasta files created from the Trusight manifest files that include
8+
both the probes and the target region sequence together, you can use this as
9+
your reference when using the trusight_tumor_fastq_generator.sh
10+
11+
12+
## FPA_normal_with_mutations.fa and FPB_normal_with_mutations.fa
13+
exactly the same as FPA_normal.fa and FPB_normal.fa however the last
14+
three fasta sequences modify EGFR, KIT, or PTEN amplicons to create
15+
the mutations described in the fasta description line
16+
17+
18+
## trusight_tumor_fastq_generator.sh
19+
a simple script file calling the ART program that makes an artifical fastq file
20+
similar to trusight tumor 26, users will need to modify the top portion of the
21+
script to call the appropriate fasta files and ART program

ART/trusight_tumor_fastq_generator.sh

+62
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
#!/bin/bash
2+
3+
##MODIFY AS NECESSARY FOR YOUR OWN FOLDER
4+
#illumina test examples
5+
#art=/home/tom/Documents/bioinformatics/ART/bin/art_illumina
6+
basedir=$(cd ../ && pwd)
7+
art=$basedir/art_bin_ChocolateCherryCake/art_illumina
8+
#reference1=$PWD/FPA_normal_with_mutations.fa
9+
#reference2=$PWD/FPB_normal_with_mutations.fa
10+
reference1=$PWD/FPA_normal.fa
11+
reference2=$PWD/FPB_normal.fa
12+
#################################
13+
14+
15+
# 7) amplicaton read simulation: generate one 121bp paired-end reads from both ends for each amplicon reference
16+
17+
18+
#####
19+
###DEPTH OF 10 TO MAKE REALLY SMALL FILE FOR PURPOSES OF DEBUGGING PIPELINE AS IT IS VERY
20+
###FAST TO PROCESS
21+
#$art -i $reference1 -amp -o ./normal_with_mutations_S0_R -p -l 121 -f 10 --seqSys MS
22+
#######
23+
#$art -i $reference1 -amp -o ./normal_with_mutations_S0_R -p -l 121 -f 10 --seqSys MS
24+
25+
#rm ./normal_with_mutations_S0_R1.aln
26+
#rm ./normal_with_mutations_S0_R2.aln
27+
28+
29+
$art -i $reference1 -amp -o ./normal_S0_R -p -l 121 -f 10 --seqSys MS
30+
31+
rm ./normal_S0_R1.aln
32+
rm ./normal_S0_R2.aln
33+
34+
mv normal_S0_R1.fq normal_S0_R1.fastq
35+
mv normal_S0_R2.fq normal_S0_R2.fastq
36+
37+
38+
#######
39+
###DEPTH OF 10 TO MAKE REALLY SMALL FILE FOR PURPOSES OF DEBUGGING PIPELINE AS IT IS VERY
40+
###FAST TO PROCESS
41+
#$art -i $reference2 -amp -o ./normal_with_mutations_S1_R -p -l 121 -f 10 --seqSys MS
42+
#######
43+
44+
#$art -i $reference2 -amp -o ./normal_S1_R -p -l 121 -f --seqSys MS
45+
46+
47+
#rm ./normal_with_mutations_S1_R1.aln
48+
#rm ./normal_with_mutations_S1_R2.aln
49+
50+
51+
$art -i $reference2 -amp -o ./normal_S1_R -p -l 121 -f 10 --seqSys MS
52+
53+
54+
rm ./normal_S1_R1.aln
55+
rm ./normal_S1_R2.aln
56+
57+
mv normal_S1_R1.fq normal_S1_R1.fastq
58+
mv normal_S1_R2.fq normal_S1_R2.fastq
59+
60+
61+
62+

Manifest_Folder/TruSightTumor-FPA-Manifest_RevB.txt

+447
Large diffs are not rendered by default.

Manifest_Folder/TruSightTumor-FPB-Manifest_RevB.txt

+447
Large diffs are not rendered by default.

README.md

+278
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,278 @@
1+
# OTA-pipeline
2+
Open source tumor amplicon pipeline, ie an alternative Bioinformatic Pipeline for AmpliconDS, that works for any ampliconDS library given a proper manifest file
3+
4+
This program is designed to run through a NextSeq or MiSeq run directory looking for
5+
fastq files located in ${current directory or specified directory}/Data/Intensities/BaseCalls/
6+
Note: MiSeq folders start with the structure YYMMDD_machinename_NNNN where machinename starts with a 'M'ex: 140729_M01382_0050_000000000-AAE8K
7+
and NextSeq folder start with the structure YYMMDD_machinename_NNNN where machinename starts with a 'N'ex: 140729_N01382_0050_000000000-AAE8K
8+
AlAmpDS expects YYMMDD_machinename with machinename starting with 'M' or 'N'. Modification of code or changing the name of folders is necessary if run on hiseq
9+
10+
The main script file to run is runAltPipeline.sh.
11+
12+
## example usage
13+
```bash
14+
bash /<OTA-pipeline directory>/runAltPipeline -h #to get help and see the different parameters
15+
bash /<OTA-pipeline directory>/runAltPipeline -s /<OTA-pipeline directory>/trusight_tumor_pipeline.sh > output_alt_pipeline_run.txt 2>&1&
16+
nohup sh /<OTA-pipeline directory>/runAltPipeline.sh -debugging true -validation true> output_alt_pipeline_run.txt 2>&1&
17+
```
18+
19+
It is highlest suggested to make script alias to make running the pipeline easier
20+
```bash
21+
cd ~
22+
vim ./.bashrc
23+
```
24+
in the bashrc file under the # User specific aliases and functions section (modify as appropriate for your machine)<br />
25+
26+
```
27+
alias runAltPipeline='nohup bash /<OTA-pipeline directory>/runAltPipeline.sh > output_alt_pipeline_run.txt 2>&1&'
28+
alias debugRunAltPipeline='nohup bash /<OTA-pipeline directory>/runAltPipeline.sh -debugging true -validation true> output_alt_pipeline_run.txt 2>&1&'
29+
alias validationRunAltPipeline='nohup sh /<OTA-pipeline directory>/runAltPipeline.sh -validation true > output_alt_pipeline_run.txt 2>&1&'
30+
31+
```
32+
where runAltPipeline is the default, debugRunAltPipeline and validationRunAltPipeline do not get rid of temporary files, debugRunAltPipeline has less restrictions region depth (to use when testing pipeline with very small artifical fastqs) <br />
33+
34+
**Note:when running the above code the user needs to be in the top directory of a NextSeq or MiSeq folder
35+
as a home_dir was not specified** <br />
36+
37+
38+
39+
40+
41+
42+
43+
44+
45+
## Parameters
46+
PIPELINLE_DIR, this variable needs to be set in ./.bashrc file
47+
```bash
48+
cd ~
49+
vim ./.bashrc
50+
###add this under the alias section, modify to point to the main directory folder of this repository
51+
PIPELINE_DIR=/home/ec2-user/ampDsTs;export PIPELINE_DIR
52+
```
53+
THREADS - number of threads to use when calling functions that support multi-threaded workflow (default 25) <br />
54+
this parameter can be changed by specifying the -threads parameter when calling runAltPipeline.sh <br />
55+
MEMORY - integer: the amount of memory to specify for the java virtual manager to use: default 16 <br />
56+
active_case_limit - integer: number of cases to process at one time, default is 8 <br />
57+
```bash
58+
nohup sh $PIPELINE_DIR/runAltPipeline.sh -threads 25 -memory 16 -active_case_limit 8 > output_alt_pipeline_run.txt 2>&1&'
59+
```
60+
61+
## Dependencies
62+
63+
There is a script file called download_dependencies.sh that will help you download all of these programs if running on ubuntu,
64+
similar code for Red-hat is commented out which can be removed if necessary. Please note that this file will NOT download GATK and Annovar as those programs have license agreements. To launch the script in the terminal type, this simple script will download all dependencies in the directory it currently resides in
65+
```bash
66+
bash download_dependencies.sh
67+
```
68+
69+
bash
70+
-variables need to be set in ~./.bashrc file
71+
-some of the code uses bash syntax so need to make sure bash installed on linux distribution
72+
73+
To install, proceed to install in the order below
74+
75+
git
76+
```bash
77+
sudo yum install git #Red-hat
78+
sudo apt-get install git #ubuntu
79+
```
80+
zip
81+
```bash
82+
sudo yum install unzip #Red-hat
83+
sudo apt-get install unzip #ubuntu
84+
```
85+
86+
java
87+
```bash
88+
sudo yum install java-1.8.0-openjdk-devel #Red-hat
89+
sudo apt-get install openjdk-8-jdk #Ubuntu
90+
91+
```
92+
wget
93+
```bash
94+
sudo yum install wget #Red-hat
95+
sudo apt-get install wget #Ubuntu
96+
```
97+
98+
gcc
99+
```bash
100+
sudo yum install gcc #red-hat
101+
sudo apt-get install gcc #ubuntu
102+
```
103+
104+
python-devel
105+
```bash
106+
sudo yum install python-devel #Red-hat
107+
sudo yum install python-dev #Ubuntu
108+
```
109+
zlib
110+
```bash
111+
sudo yum install zlib-devel #Red-hat
112+
sudo apt-get install zlib1g-dev #ubuntu
113+
```
114+
g++
115+
```bash
116+
sudo yum install gcc-c++ #red-hat
117+
sudo apt-get install g++ #ubuntu
118+
```
119+
120+
curses
121+
```bash
122+
sudo apt-get install libncurses5-dev libncursesw5-dev #ubuntu
123+
yum install ncurses-devel ncurses #red-hat
124+
```
125+
126+
download the git repository
127+
```bash
128+
gitclone https://github.com/schneiderthomas/AltAmpDs
129+
```
130+
131+
### Python Dependencies
132+
133+
#### pip
134+
```bash
135+
wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm #red-hat
136+
sudo yum install epel-release-7.noarch.rpm #red-hat
137+
sudo yum install python-pip #red-hat
138+
sudo apt-get install python-pip #ubuntu
139+
```
140+
#### biopython (1.66)
141+
```bash
142+
sudo pip install biopython==1.66
143+
```
144+
#### pysam (0.8.4)
145+
```bash
146+
sudo pip install pysam==0.8.4
147+
```
148+
#### pyvcf (0.6.7)
149+
```bash
150+
sudo pip install pyvcf==0.6.7
151+
```
152+
#### Pandas (0.16.2)
153+
```bash
154+
sudo pip install pandas==0.16.2
155+
```
156+
#### regex (2015.3.18)
157+
```bash
158+
sudo pip install regex==2015.3.18
159+
```
160+
161+
### Linux/Shell Dependencies
162+
#### Zenity
163+
to display dialog boxes from shell script (to let tech know that processing is done)
164+
```bash
165+
sudo yum install zenity #red-hat
166+
sudo apt-get install zenity #ubuntu
167+
```
168+
169+
xterm
170+
```bash
171+
sudo yum install xterm #red-hat
172+
sudo yum install xorg-x11-xauth.x86_64 xorg-x11-server-utils.x86_64 dbus-x11.x86_64 #red-hat
173+
sudo apt-get install xterm xorg dbus #ubuntu
174+
175+
```
176+
177+
bcl2fastq (v2.17)
178+
to convert files from bcl to fastq files
179+
```bash
180+
#optional, already provided as zip
181+
#Red-hat
182+
wget 'ftp://webdata2:[email protected]/downloads/software/bcl2fastq/bcl2fastq2-v2.17.1.14-Linux-x86_64.zip'
183+
unzip bcl2fastq2-v2.17.1.14-Linux-x86_64.zip
184+
yum localinstall bcl2fastq2-v2.17.1.14-Linux-x86_64.rpm
185+
#if unbuntu
186+
sudo apt-get install alien dpkg-dev debhelper build-essential #needed for unbuntu
187+
sudo alien bcl2fastq2-v2.17.1.14-Linux-x86_64.rpm
188+
sudo dpkg -i bcl2fastq2-v2.17.1.14-Linux-x86_64.deb
189+
190+
```
191+
192+
193+
194+
## Major Program dependencies
195+
these programs need to be downloaded and/or compiled and their resulting directories need to be placed in this
196+
directory, a more recent version may be used but there may be some compatibility issues with the pipeline as it is
197+
198+
GATK 3.5 (VERY IMPORTANT AT LEAST 3.5) <br />
199+
annovar - 2014-11-12 <br />
200+
201+
freebayes v0.9.20 <br />
202+
bcftools-1.2 <br />
203+
FastQC v0.11.3 <br />
204+
htslib-1.2.1 <br />
205+
IGVTools 2.3.57 <br />
206+
picard 2.10 <br />
207+
samtools_1.2 <br />
208+
snpeff 4.1g 2015-05-17 <br />
209+
varscan v2.3.9 <br />
210+
bwa 0.7.10 <br />
211+
vcflib v.1.0.0 <br />
212+
CoverageQC - for debugging <br />
213+
bedtools2 -> Version 2.26.0 <br />
214+
Trimmomatic 0.33 <br />
215+
216+
217+
218+
Please note annovar and GATK have license agreements must be accepted before you download them and therefore they cannot be downloaded using the above script.
219+
Instructions to download these files are below:
220+
221+
#annovar
222+
please download annovar, version 2014-11-12 was used originally (therefore is the preferred version to ensure compatibility), to download Annovar click [here](http://www.openbioinformatics.org/annovar/annovar_download_form.php). After downloading annovar place the annovar folder entitled "annovar" in the current directory
223+
Note: the original splicing threshold for annovar is to 2, this can be modified if one goes to file table_annovar.pl and modifies the line
224+
```python
225+
$sc = "annotate_variation.pl -geneanno -buildver $buildver -dbtype $protocol -hgvs -outfile $tempfile.$protocol -exonsort $queryfile $dbloc";
226+
```
227+
to
228+
```python
229+
$sc = "annotate_variation.pl -geneanno -buildver $buildver -dbtype $protocol -splicing_threshold 5 -hgvs -outfile $tempfile.$protocol -exonsort $queryfile $dbloc";
230+
```
231+
232+
# GATK
233+
version 3.5 is being used for this pipeline
234+
get the latest software [here](https://software.broadinstitute.org/gatk/download/)
235+
if download version higher than 3.5, need to change line 34 in amplicon_ds_pipeline.sh
236+
as appropriate
237+
238+
239+
### Extra
240+
241+
In this repository there is a folder called ART, in here you will find shell that can be used to create
242+
artifical FASTQ files similar to an ampliconDS run. ART version ChocolateCherryCake-03-19-2015 was used in these scripts.
243+
244+
Download the latest ART program [here](https://www.niehs.nih.gov/research/resources/software/biostatistics/art/).
245+
246+
247+
248+
## Reference Files
249+
### hg19
250+
will be downloaded if use download_dependencies.sh script
251+
252+
## ANNOVAR reference files
253+
will download_dependencies.sh install clinvar, cosmic, exac, snp and 1000g
254+
in the annovar directory, see download_dependencies.sh if curious
255+
256+
257+
258+
#### NOTES ON MAJOR FILES
259+
260+
261+
##### runAltPipeline.sh
262+
263+
-the shell script which runs through the current directory (unless given) and feeds files to the pipeline shell script (location can be specified with -s command but default parameters are at the
264+
top of the shell script which can be changed if one moves the directory <br />
265+
266+
# ASSUMPTIONS: <br />
267+
- the directory has a folder structure <br />
268+
BaseDirectory -> Data -> Intensities -> BaseCalls <br />
269+
will exit if this is not seen <br />
270+
<br />
271+
- there needs to be an even number of fastq files (not including the Undetermined FASTQ files) because there always be either two fastq files (or 8 when a NextSeq Folder with no lane splitting) in Amplicon DS pipeline, will exit if does not see this <br />
272+
<br />
273+
- if no FASTQ files are present then there needs to be tiffs in BaseDirectory-> Images folder or bcl files in BaseFolder -> Data -> Intensities -> BaseCalls -> L001 & L002 & L003 & L004 so bcl2fastq can turn the images or bcl filtes to fastq files <br />
274+
<br />
275+
- The BaseFolder name has to start to like 160518_N or 160518_M where the first part is a number and then there is an underscore and either an N letter or a M letter (this tells the script if it is dealing with a NextSeq or MiSeq folder), if the name does not start
276+
like this it should exit in an error <br />
277+
278+

Workflow pipeline.png

874 KB
Loading

0 commit comments

Comments
 (0)