|
| 1 | +# OTA-pipeline |
| 2 | +Open source tumor amplicon pipeline, ie an alternative Bioinformatic Pipeline for AmpliconDS, that works for any ampliconDS library given a proper manifest file |
| 3 | + |
| 4 | +This program is designed to run through a NextSeq or MiSeq run directory looking for |
| 5 | +fastq files located in ${current directory or specified directory}/Data/Intensities/BaseCalls/ |
| 6 | +Note: MiSeq folders start with the structure YYMMDD_machinename_NNNN where machinename starts with a 'M'ex: 140729_M01382_0050_000000000-AAE8K |
| 7 | +and NextSeq folder start with the structure YYMMDD_machinename_NNNN where machinename starts with a 'N'ex: 140729_N01382_0050_000000000-AAE8K |
| 8 | +AlAmpDS expects YYMMDD_machinename with machinename starting with 'M' or 'N'. Modification of code or changing the name of folders is necessary if run on hiseq |
| 9 | + |
| 10 | +The main script file to run is runAltPipeline.sh. |
| 11 | + |
| 12 | +## example usage |
| 13 | +```bash |
| 14 | +bash /<OTA-pipeline directory>/runAltPipeline -h #to get help and see the different parameters |
| 15 | +bash /<OTA-pipeline directory>/runAltPipeline -s /<OTA-pipeline directory>/trusight_tumor_pipeline.sh > output_alt_pipeline_run.txt 2>&1& |
| 16 | +nohup sh /<OTA-pipeline directory>/runAltPipeline.sh -debugging true -validation true> output_alt_pipeline_run.txt 2>&1& |
| 17 | +``` |
| 18 | + |
| 19 | +It is highlest suggested to make script alias to make running the pipeline easier |
| 20 | +```bash |
| 21 | +cd ~ |
| 22 | +vim ./.bashrc |
| 23 | +``` |
| 24 | +in the bashrc file under the # User specific aliases and functions section (modify as appropriate for your machine)<br /> |
| 25 | + |
| 26 | +``` |
| 27 | +alias runAltPipeline='nohup bash /<OTA-pipeline directory>/runAltPipeline.sh > output_alt_pipeline_run.txt 2>&1&' |
| 28 | +alias debugRunAltPipeline='nohup bash /<OTA-pipeline directory>/runAltPipeline.sh -debugging true -validation true> output_alt_pipeline_run.txt 2>&1&' |
| 29 | +alias validationRunAltPipeline='nohup sh /<OTA-pipeline directory>/runAltPipeline.sh -validation true > output_alt_pipeline_run.txt 2>&1&' |
| 30 | +
|
| 31 | +``` |
| 32 | +where runAltPipeline is the default, debugRunAltPipeline and validationRunAltPipeline do not get rid of temporary files, debugRunAltPipeline has less restrictions region depth (to use when testing pipeline with very small artifical fastqs) <br /> |
| 33 | + |
| 34 | +**Note:when running the above code the user needs to be in the top directory of a NextSeq or MiSeq folder |
| 35 | +as a home_dir was not specified** <br /> |
| 36 | + |
| 37 | + |
| 38 | + |
| 39 | + |
| 40 | + |
| 41 | + |
| 42 | + |
| 43 | + |
| 44 | + |
| 45 | +## Parameters |
| 46 | +PIPELINLE_DIR, this variable needs to be set in ./.bashrc file |
| 47 | +```bash |
| 48 | +cd ~ |
| 49 | +vim ./.bashrc |
| 50 | +###add this under the alias section, modify to point to the main directory folder of this repository |
| 51 | +PIPELINE_DIR=/home/ec2-user/ampDsTs;export PIPELINE_DIR |
| 52 | +``` |
| 53 | +THREADS - number of threads to use when calling functions that support multi-threaded workflow (default 25) <br /> |
| 54 | +this parameter can be changed by specifying the -threads parameter when calling runAltPipeline.sh <br /> |
| 55 | +MEMORY - integer: the amount of memory to specify for the java virtual manager to use: default 16 <br /> |
| 56 | +active_case_limit - integer: number of cases to process at one time, default is 8 <br /> |
| 57 | +```bash |
| 58 | +nohup sh $PIPELINE_DIR/runAltPipeline.sh -threads 25 -memory 16 -active_case_limit 8 > output_alt_pipeline_run.txt 2>&1&' |
| 59 | +``` |
| 60 | +
|
| 61 | +## Dependencies |
| 62 | +
|
| 63 | +There is a script file called download_dependencies.sh that will help you download all of these programs if running on ubuntu, |
| 64 | +similar code for Red-hat is commented out which can be removed if necessary. Please note that this file will NOT download GATK and Annovar as those programs have license agreements. To launch the script in the terminal type, this simple script will download all dependencies in the directory it currently resides in |
| 65 | +```bash |
| 66 | +bash download_dependencies.sh |
| 67 | +``` |
| 68 | +
|
| 69 | +bash |
| 70 | +-variables need to be set in ~./.bashrc file |
| 71 | +-some of the code uses bash syntax so need to make sure bash installed on linux distribution |
| 72 | +
|
| 73 | +To install, proceed to install in the order below |
| 74 | +
|
| 75 | +git |
| 76 | +```bash |
| 77 | +sudo yum install git #Red-hat |
| 78 | +sudo apt-get install git #ubuntu |
| 79 | +``` |
| 80 | +zip |
| 81 | +```bash |
| 82 | +sudo yum install unzip #Red-hat |
| 83 | +sudo apt-get install unzip #ubuntu |
| 84 | +``` |
| 85 | +
|
| 86 | +java |
| 87 | +```bash |
| 88 | +sudo yum install java-1.8.0-openjdk-devel #Red-hat |
| 89 | +sudo apt-get install openjdk-8-jdk #Ubuntu |
| 90 | +
|
| 91 | +``` |
| 92 | +wget |
| 93 | +```bash |
| 94 | +sudo yum install wget #Red-hat |
| 95 | +sudo apt-get install wget #Ubuntu |
| 96 | +``` |
| 97 | +
|
| 98 | +gcc |
| 99 | +```bash |
| 100 | +sudo yum install gcc #red-hat |
| 101 | +sudo apt-get install gcc #ubuntu |
| 102 | +``` |
| 103 | +
|
| 104 | +python-devel |
| 105 | +```bash |
| 106 | +sudo yum install python-devel #Red-hat |
| 107 | +sudo yum install python-dev #Ubuntu |
| 108 | +``` |
| 109 | +zlib |
| 110 | +```bash |
| 111 | +sudo yum install zlib-devel #Red-hat |
| 112 | +sudo apt-get install zlib1g-dev #ubuntu |
| 113 | +``` |
| 114 | +g++ |
| 115 | +```bash |
| 116 | +sudo yum install gcc-c++ #red-hat |
| 117 | +sudo apt-get install g++ #ubuntu |
| 118 | +``` |
| 119 | +
|
| 120 | +curses |
| 121 | +```bash |
| 122 | +sudo apt-get install libncurses5-dev libncursesw5-dev #ubuntu |
| 123 | +yum install ncurses-devel ncurses #red-hat |
| 124 | +``` |
| 125 | +
|
| 126 | +download the git repository |
| 127 | +```bash |
| 128 | +gitclone https://github.com/schneiderthomas/AltAmpDs |
| 129 | +``` |
| 130 | +
|
| 131 | +### Python Dependencies |
| 132 | +
|
| 133 | +#### pip |
| 134 | +```bash |
| 135 | +wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm #red-hat |
| 136 | +sudo yum install epel-release-7.noarch.rpm #red-hat |
| 137 | +sudo yum install python-pip #red-hat |
| 138 | +sudo apt-get install python-pip #ubuntu |
| 139 | +``` |
| 140 | +#### biopython (1.66) |
| 141 | +```bash |
| 142 | +sudo pip install biopython==1.66 |
| 143 | +``` |
| 144 | +#### pysam (0.8.4) |
| 145 | +```bash |
| 146 | +sudo pip install pysam==0.8.4 |
| 147 | +``` |
| 148 | +#### pyvcf (0.6.7) |
| 149 | +```bash |
| 150 | +sudo pip install pyvcf==0.6.7 |
| 151 | +``` |
| 152 | +#### Pandas (0.16.2) |
| 153 | +```bash |
| 154 | +sudo pip install pandas==0.16.2 |
| 155 | +``` |
| 156 | +#### regex (2015.3.18) |
| 157 | +```bash |
| 158 | +sudo pip install regex==2015.3.18 |
| 159 | +``` |
| 160 | +
|
| 161 | +### Linux/Shell Dependencies |
| 162 | +#### Zenity |
| 163 | +to display dialog boxes from shell script (to let tech know that processing is done) |
| 164 | +```bash |
| 165 | +sudo yum install zenity #red-hat |
| 166 | +sudo apt-get install zenity #ubuntu |
| 167 | +``` |
| 168 | +
|
| 169 | +xterm |
| 170 | +```bash |
| 171 | +sudo yum install xterm #red-hat |
| 172 | +sudo yum install xorg-x11-xauth.x86_64 xorg-x11-server-utils.x86_64 dbus-x11.x86_64 #red-hat |
| 173 | +sudo apt-get install xterm xorg dbus #ubuntu |
| 174 | +
|
| 175 | +``` |
| 176 | +
|
| 177 | +bcl2fastq (v2.17) |
| 178 | +to convert files from bcl to fastq files |
| 179 | +```bash |
| 180 | +#optional, already provided as zip |
| 181 | +#Red-hat |
| 182 | +wget 'ftp://webdata2: [email protected]/downloads/software/bcl2fastq/bcl2fastq2-v2.17.1.14-Linux-x86_64.zip ' |
| 183 | +unzip bcl2fastq2-v2.17.1.14-Linux-x86_64.zip |
| 184 | +yum localinstall bcl2fastq2-v2.17.1.14-Linux-x86_64.rpm |
| 185 | +#if unbuntu |
| 186 | +sudo apt-get install alien dpkg-dev debhelper build-essential #needed for unbuntu |
| 187 | +sudo alien bcl2fastq2-v2.17.1.14-Linux-x86_64.rpm |
| 188 | +sudo dpkg -i bcl2fastq2-v2.17.1.14-Linux-x86_64.deb |
| 189 | +
|
| 190 | +``` |
| 191 | +
|
| 192 | +
|
| 193 | +
|
| 194 | +## Major Program dependencies |
| 195 | +these programs need to be downloaded and/or compiled and their resulting directories need to be placed in this |
| 196 | +directory, a more recent version may be used but there may be some compatibility issues with the pipeline as it is |
| 197 | +
|
| 198 | +GATK 3.5 (VERY IMPORTANT AT LEAST 3.5) <br /> |
| 199 | +annovar - 2014-11-12 <br /> |
| 200 | +
|
| 201 | +freebayes v0.9.20 <br /> |
| 202 | +bcftools-1.2 <br /> |
| 203 | +FastQC v0.11.3 <br /> |
| 204 | +htslib-1.2.1 <br /> |
| 205 | +IGVTools 2.3.57 <br /> |
| 206 | +picard 2.10 <br /> |
| 207 | +samtools_1.2 <br /> |
| 208 | +snpeff 4.1g 2015-05-17 <br /> |
| 209 | +varscan v2.3.9 <br /> |
| 210 | +bwa 0.7.10 <br /> |
| 211 | +vcflib v.1.0.0 <br /> |
| 212 | +CoverageQC - for debugging <br /> |
| 213 | +bedtools2 -> Version 2.26.0 <br /> |
| 214 | +Trimmomatic 0.33 <br /> |
| 215 | +
|
| 216 | +
|
| 217 | +
|
| 218 | +Please note annovar and GATK have license agreements must be accepted before you download them and therefore they cannot be downloaded using the above script. |
| 219 | +Instructions to download these files are below: |
| 220 | +
|
| 221 | +#annovar |
| 222 | +please download annovar, version 2014-11-12 was used originally (therefore is the preferred version to ensure compatibility), to download Annovar click [here](http://www.openbioinformatics.org/annovar/annovar_download_form.php). After downloading annovar place the annovar folder entitled "annovar" in the current directory |
| 223 | +Note: the original splicing threshold for annovar is to 2, this can be modified if one goes to file table_annovar.pl and modifies the line |
| 224 | +```python |
| 225 | +$sc = "annotate_variation.pl -geneanno -buildver $buildver -dbtype $protocol -hgvs -outfile $tempfile.$protocol -exonsort $queryfile $dbloc"; |
| 226 | +``` |
| 227 | +to |
| 228 | +```python |
| 229 | +$sc = "annotate_variation.pl -geneanno -buildver $buildver -dbtype $protocol -splicing_threshold 5 -hgvs -outfile $tempfile.$protocol -exonsort $queryfile $dbloc"; |
| 230 | +``` |
| 231 | +
|
| 232 | +# GATK |
| 233 | +version 3.5 is being used for this pipeline |
| 234 | +get the latest software [here](https://software.broadinstitute.org/gatk/download/) |
| 235 | +if download version higher than 3.5, need to change line 34 in amplicon_ds_pipeline.sh |
| 236 | +as appropriate |
| 237 | +
|
| 238 | +
|
| 239 | +### Extra |
| 240 | +
|
| 241 | +In this repository there is a folder called ART, in here you will find shell that can be used to create |
| 242 | +artifical FASTQ files similar to an ampliconDS run. ART version ChocolateCherryCake-03-19-2015 was used in these scripts. |
| 243 | +
|
| 244 | +Download the latest ART program [here](https://www.niehs.nih.gov/research/resources/software/biostatistics/art/). |
| 245 | +
|
| 246 | +
|
| 247 | +
|
| 248 | +## Reference Files |
| 249 | +### hg19 |
| 250 | +will be downloaded if use download_dependencies.sh script |
| 251 | +
|
| 252 | +## ANNOVAR reference files |
| 253 | +will download_dependencies.sh install clinvar, cosmic, exac, snp and 1000g |
| 254 | +in the annovar directory, see download_dependencies.sh if curious |
| 255 | +
|
| 256 | +
|
| 257 | +
|
| 258 | +#### NOTES ON MAJOR FILES |
| 259 | +
|
| 260 | +
|
| 261 | +##### runAltPipeline.sh |
| 262 | +
|
| 263 | +-the shell script which runs through the current directory (unless given) and feeds files to the pipeline shell script (location can be specified with -s command but default parameters are at the |
| 264 | +top of the shell script which can be changed if one moves the directory <br /> |
| 265 | +
|
| 266 | +# ASSUMPTIONS: <br /> |
| 267 | +- the directory has a folder structure <br /> |
| 268 | + BaseDirectory -> Data -> Intensities -> BaseCalls <br /> |
| 269 | + will exit if this is not seen <br /> |
| 270 | + <br /> |
| 271 | +- there needs to be an even number of fastq files (not including the Undetermined FASTQ files) because there always be either two fastq files (or 8 when a NextSeq Folder with no lane splitting) in Amplicon DS pipeline, will exit if does not see this <br /> |
| 272 | + <br /> |
| 273 | +- if no FASTQ files are present then there needs to be tiffs in BaseDirectory-> Images folder or bcl files in BaseFolder -> Data -> Intensities -> BaseCalls -> L001 & L002 & L003 & L004 so bcl2fastq can turn the images or bcl filtes to fastq files <br /> |
| 274 | + <br /> |
| 275 | +- The BaseFolder name has to start to like 160518_N or 160518_M where the first part is a number and then there is an underscore and either an N letter or a M letter (this tells the script if it is dealing with a NextSeq or MiSeq folder), if the name does not start |
| 276 | + like this it should exit in an error <br /> |
| 277 | + |
| 278 | + |
0 commit comments