- Scripts are formatted for Stanford internal use. We use SLURM to send jobs to Stanford's Sherlock cluster.
- to get access and support, ask PI to e-mail and give you permission research-computing-support[at]stanford.edu
- [Sherlock Wiki] (http://sherlock.stanford.edu/mediawiki/index.php/Main_Page)
- create project directory in
$PI_SCRATCH/
(20TB space available)- more space and accessible to other users in the lab
- files are deleted from both your and PI scratch every 6 month. Regular backups (use "gdrive" or "rclone") are necessary.
- your personal directory
$HOME
is 15GB- you can also use this space for your research, but it is not easily accessible to other lab members
- more info on Sherlock storage
- go to the directory where you want your raw sequence files
- use command line from Gnomex to download files to Sherlock
- remember to install .jar file in the same directory you want your files to go
- back up raw files to Palumbi Team Drive
- configure rclone to be set up with the Google Team Drive account: http://moo.nac.uci.edu/~hjm/HOWTO-rclone-to-Gdrive.html
- use rclone.sh script in the shared scripts folder
-
to log in:
ssh [email protected] #& type in pw to gain permission & access
-
to move around:
cd $HOME
is your personal directory- limited to 15GB storage
- not accessible to other users
-
all common use scripts:
cd $PI_HOME/scripts
-
all common use programs:
cd $PI_HOME/programs
-
to add a program to Sherlock
cd $PI_HOME/programs wget <linktofiledownload> gzip <file.gz>
-
modifying your path to include scripts & programs:
- this allows you to access these directories from any directory without having to hardcode the path
cd ~ #move to your account's home folder ls -a #list hidden items nano .bashrc #open this hidden file in the nano text editor #paste the following into the end of your .bashrc file export PATH="$PI_HOME/programs:$PATH" export PATH="$PI_HOME/scripts:$PATH" #ctrl O to overwrite, control X to exit
-
ways to submit jobs:
login_node
this is the default, it's not for running jobs, just for moving around the clustersdev
is for running small test jobs straight from the command line- one hour time limit
- use
sdev -t 2:00:00
to get max 2 hrs
bash
is for use for small jobs in terminalsbatch
submits the job generated by your script to the cluster
-
to run your job on different nodes, in your batch script, add:
#SBATCH -p owners
access to 600 nodes only available to owners. You will be kicked off an owner node if that owner logs on. Always check your slurm file to see if your job was aborted#SBATCH -p spalumbi
256GB memory node#SBATCH -p hns
a series of nodes just for Humanities & Sciences users#SBATCH -p spalumbi,hns
#will submit your job to whatever node is available first
-
to check on status of job
squeue -u username
squeue | grep 'spalumbi
#to see our spalumbi node's usage
-
to cancel a job
scancel <jobID>
to cancel 1 jobscancel -u <username>
will cancel all jobs
-
to see how much memory your job has used so far
sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize $JOBNUMBER
Back up your scratch directory!
- You can use the Sherlock data transfer node to move large datasets onto a backup hard drive
kinit [email protected]
rsync -avz --progress --stats -e 'ssh -o GSSAPIAuthentication=yes' [email protected]:/<sherlock directory> <backup location>
-
Use the rclone.sh script mentioned above to backup intermediate files
-
Check to see how much space you're taking up in the shared SCRATCH directory
du -sh * | sort -h
-
chmod 775 *
to change permissions for programs, scripts, some files that you create so others can use them; i.e. this is important in the shared scripts and programs folders -
chmod 444
files you don’t want to accidentally write over,chmod -R ###
flag needed for changing directories -
before downloading a progrma, check to see if it's already an installed module on Sherlock https://www.sherlock.stanford.edu/docs/software/modules/
-
if downloading a program for the first time, move the executable of the program to the programs folder; this can make it easier to call the program if your bashrc profile is linked to the programs directory
mv <program> $PI_HOME/programs
-
if downloading a new version of a program, rename with version to not override any executables in that directory
-
try to create scripts that are not hard-coded so that others can easily reuse them
-
comment your scripts!
-
Tips for checking outputs along the way:
- always look at your slurm files for errors
cat slurm*
- when making scripts that split up data into TEMP files for parallel processing, add a line that states your input file in the slurm output, to do this:
echo $1
if $1 is your input file
- to count lines in a file
wc - l <filename>
- to count contigs in a file
grep -c “>” <filename>
- always look at your slurm files for errors
- this program removes adapters, low quality seqs, etc.
- webpage
- for paired end reads:
bash batch-trimmomatic-pe.sh *_1.txt
- outputs 4 files: 1_paired.fq, 2_paired.fq, 1_unpaired.fq, 2_unpaired.fq
- for single end reads:
batch-trimmomatic-se.sh
-
this step is useful for picking samples to use in your Trinity assembly
-
this step is also helpful for PE data, where you use the length of your sequences for merging reads (i.e. FLASH)
-
usage:
bash batch-fastqc.sh *.fastq
-
to move html files from the cluster to your computer
- open a terminal that is not logged into cluster
rsync [email protected]:<path on sherlock>*.html <directory path on your computer>
rsync -a
will recursively transfer an entire directory
- This step is necessary for PE data to merge your two reads (_1 and _2) that you will use in transcriptome assembly, but is not necessary for SE data
- usage:
bash batch-flash.sh *_1_paired.fastq
- webpage
- get lengths of your trimmed seqs above and use in FLASH script
- pick samples relevant to your biological question
- restrict sample # for computational capacity (i.e. 3)
- examples:
- Balanus natural cline: 3 trinity transcriptomes, one for each population that includes 3 individuals
- Coral stress physiology: 1 trinity transcriptome with 2 experimental individuals and 1 control individual
- examples:
- You need to hardcode this script for yourself
- an example:
example_Trinity_mkm.sbatch
- an example:
- [Trinity wiki] (https://github.com/trinityrnaseq/trinityrnaseq/wiki)
- to see insides of Trinity:
.Trinity --show_full_usage_info
- you need to update this example for your assembly
- Trinity computing requirements
- 1GB RAM for 1M reads
- Example: 3 Balanus samples: 11649437 + 11728787 + 11556893
- total pairs of reads = 34,835,117 is 35GB RAM and 35 hours of time
-
to get median contig length:
samtools faidx Bg_2assembliesof3.fa module load R R index<-read.delim(‘<file.fa.fai>’, header=F) median(index[,2]) table(cut(index[,2],30)) #bins into groups of 30 table(index[,2]>300) #how many contigs are greater than 30bp
-
to get basic statistics, including N50
perl abyss-fac.pl <assembly.fa>
perl longestisoform_trinity.pl <input.fasta> <output.fasta>
- this reduces computational load
- rename contigs in each Trinity file by adding a unique letter to the end of each contig name, for ex: "TRINITYa_####"
- separate Trinity runs will have overlapping names that are not the same contig
- if assembling multiple Trinity assemblies in CAP3 you will have duplicate names
- Then combine all Trinitiy assemblies into one file:
cat Trinityrun1.fa Trinityrun2.fa > all_assemblies.fasta
- paper
- manual
- CAP3 is an overlap consensus assembler that will merge reads that would not assemble in Trinity due to high heterozygosity
- usage:
sbatch cap3.sh all_assemblies.fasta
after running:
-
merge your contigs and singlets files into one assembly file
cat file.fasta.cap.contigs file.fasta.cap.singlets > newfile.fasta
- contigs are what CAP3 merged, singlets did not merge and still contain the Trinity output name
-
to look at contig #s after CAP3:
grep -c '>' file.fasta
-
to check for contig name duplicates:
grep ">" <assembly_file.fa> | perl histogram.pl | head -n
sbatch cdhit.sh input.fa outfilename
- if you don't have this resource; move to 8b
- download cDNA files for blasting your nucleotides against real transcripts
- use the makeblastdb tool from ncbi to make a database:
- to filter a transcriptome, start by blasting to full genome scaffolds
- make a nucleotide database:
makeblastdb -in <infile.fasta> -dbtype nucl -out <outfile>
- blast to scaffold database with blastn:
- example scripts: batch-blastn-disco.sh (to split your assembly up for parallel processing), blastn-disco.sh
- take only the top hit from your blast output:
awk '!x[$1]++' file.blast.out > tophits.blast.out
- for annotation purposes, you could try blasting to the protein genomic resources:
- make a protein database:
makeblastdb -in <infile> -dbtype prot -out <outfile>
- use blastx, example scripts where you will need to change the database to your genome: batch-blastx-uniprot.sh, blastx-uniprot.sh
###Filter Assembly without available genome
- Uniprot is a more curated database and is recommended over NCBI-nr
-
download Swiss-Prot & Trembl databases
wget <link> gunzip <file>.gz # merge two database files into one database cat uniprot_sprot.fasta uniprot_trembl.fasta > unitprot_db.fasta # make your new fasta into a database makeblastdb -in uniprot_db.fasta -dbtype prot -out uniprot_db
- usage:
bash batch-blast-uniprot.sh <infile>
- the script above splits your assembly into smaller files and calls the blastx on your uniprot database
- after running, check if you get an error during your blasts
cat slurm*
- most often, your blast may time out
- if you get an error:
grep -B 1 error slurm*.out
- for any file with error, take line before (the tempfile)
cat <TEMPwerror.fa> <TEMPwerror.fa> > <didnotfinish.fa>
- this concatenates all TEMP files that contain an error into a new file
- you may want to reduce the # of contigs the batch-blast-uniprot.sh script generates for each TEMP file
bash batch-blast-uniprot.sh <didnotfinish.fa>
-
Make sure local database on sherlock is up to date
- link to ncbi databases: ftp://ftp.ncbi.nlm.nih.gov/blast/db/
- nt = nucleotide, nr = protein
- for ex:
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr.*.tar.gz
-
to open .tar files downloaded from genbank:
for i in *.tar ; do tar -xvf $i ; done &
- ncbi files are already databases, no need to use makedb script
-
To run blast script:
bash batch-blast-ncbiprot.sh input.fasta
- this script splits your assembly into TEMP files for parallel processing
-
after running, check if you get an error during your blasts
cat slurm*
- your blast search may time out if it runs too long on the cluster
-
if you get an error:
grep -B 1 error slurm*.out
- for any file with error, take line before (the tempfile)
cat <TEMPwerror.fa> <TEMPwerror.fa> > <didnotfinish.fa>
- this concatenates all TEMP files that contain an error into a new file
- you may want to reduce the # of contigs the batch-blast-ncbiprot.sh script generates for each TEMP file
bash batch-blast-ncbiprot.sh <didnotfinish.fa>
- Both the uniprot and nr batch scripts above output XML formatted results
- XML files are difficult to work with, pick what information you want and parse to a tab delimited format
- XML files currently hold the most information about each blast result
- i.e. gene name, taxonomy
- to parse uniprot results:
bash batch-parse-uniprot.sh
- this script calls parse-uniprot-xml.py on your many temp files
- to parse ncbi-protein results:
bash batch-parse-nr.sh TEMP*.fa.blast.out
-
If you BLAST to a general database and not to a specific genome, this step is necessary to filter out any contamination
-
remove blasts that are likely environmental contamination, i.e. bacteria, fungi, viruses, alveolata, viridiplantae, haptophyceae, etc.
-
For uniprot results:
- first, merge all of your parsed blast results into one file
cat *_parsed.txt > all_parsed.txt
- second, create a file of only 'good contigs', i.e. metazoans:
bash grep-good-contigs.sh all_parsed.txt assembly.fa
- third, if you only want to use uniprot results, pull only good contigs from your assembly fasta file
sbatch batch-filter-assembly.sh assembly.fa goodcontigs.txt
- to check how filtering went, count how many contigs you have before and after filtering:
grep -c “>” filteredassembly.fa
- first, merge all of your parsed blast results into one file
-
For ncbi-protein results:
- first, merge all of your parsed blast results into one file
cat *_parsed.txt > all_parsed.txt
- second, create a taxonomic tree to use for filtering
- make a tab delimited file with contigs names and their associated GI numbers
- i.e., import parsed file into excel and keep only those to columns, export as txt file
- use this GI list to get taxonomic tree
sbatch batch-taxonID-from-gi.sh type('n' or 'p') GIlist output
- make a tab delimited file with contigs names and their associated GI numbers
- third, grep good contigs from uniprot
bash grep-good-contigs-ncbiprot.sh uniprot_goodcontigs.txt
- first, merge all of your parsed blast results into one file
-
combine results from uniprot & ncbi-protein databases:
cat uniprot_goodcontigs ncbi_goodcongits > combined_goodcontigs_duplicates.txt
-then take only the uniq contigs from the combined filecat combined_goodcontigs_duplicates.txt | sort | uniq > combined_goodcontigs_uniq.txt
- you can check the # contigs remaining after this to see how filtering went
- finally, pull only good contigs from your assembly
sbatch batch-filter-assembly.sh assembly.fa combined_goodcontigs_uniq.txt
- BUSCO provides quantitative measures for the assessment of genome assembly, gene set, and transcriptome completeness, based on evolutionarily-informed expectations of gene content from near-universal single-copy orthologs
#Install and configure (this has to be done for each individual's Sherlock account)
- Edit busco's config.ini file to include paths to necessary dependencies on Sherlock
cd ~/.config/
mkdir busco
nano config.ini #copy this file from busco's github directory
#update the following paths in the config.ini file:
[tblastn]
# path to tblastn
path = /share/software/user/open/ncbi-blast+/2.7.1/bin
[makeblastdb]
# path to makeblastdb
path = /share/software/user/open/ncbi-blast+/2.7.1/bin
[augustus]
# path to augustus
path = /share/software/user/open/augustus/3.3.2/bin/
[etraining]
# path to augustus etraining
path = /share/software/user/open/augustus/3.3.2/bin/
# path to augustus perl scripts, redeclare it for each new script
[gff2gbSmallDNA.pl]
path = /share/software/user/open/augustus/3.3.2/scripts/
[new_species.pl]
path = /share/software/user/open/augustus/3.3.2/scripts/
[optimize_augustus.pl]
path = /share/software/user/open/augustus/3.3.2/scripts/
[hmmsearch]
# path to HMMsearch executable
path = /home/groups/spalumbi/programs/busco-master/hmmer-3.2.1/src/
#Run
- sbatch busco.sh input.fa output transcriptome
#create an hisat index
hisat-2 build assembly.fa <output-hisat2-index>
#call the program through a batch script
bash batch-hisat2-fq-paired-eas.sh hisat2-index chunksize *_1.txt.gz
- chunksize is how many samples to run per one Sherlock job submitted, for example 2
- for single end reads, batch-hisat2-fq-single_dedup_eas.sh
- for mapping to RNAseq data to a genome (i.e. takes splice variants into account), batch-hisat2-RNAseq-paired-cyp.sh
- step 1: go to a directory with all of your bam files in it
- step 2:
mkdir vcfout
#make a directory for all split/parallelized files to be placed into - to call:
bash bcftools_parallel_eas.sh assembly.fa vcfout #cpus *.bam
- number of cpus can be something like 24
-
can filter for: read depth, read mapping quality, base quality, minor allele frequency, min number of reads mapped, etc.
-
First, combine all of your VCF outputs from the parallel processing into one file called combined
bash fastVCFcombine.sh combined *.vcf
-
Next, use VCF tools to filter your SNPS:
sbatch vcftools-snpfilter.sh combined.vcf
-
this script removes indels, takes only biallelic SNPS, requires 7 mapped reads per allele at each SNP, and sets your min/max allele frequencies
-
you can also allow some missing data with this script using --max-missing (1 means no missing data allowed, 0 means allow all missing data)
bash vcftools-012genotype-matrix.sh <combined_filtered_file.vcf> <outfile>
- this is a simple format used by many downsteams applications
# create a list of SNP names
# here, we call file of SNPs we have goodsnps.txt, it's formatted: contig_name \t SNP_location_in_contig
awk -v OFS="\t" '{print $1,$2-1,$2}' goodsnps.txt > goodsnps.bed #reformats above into a bed file
bedtools intersect -b goodsnps.bed -a allsnps.vcf > goodsnps.vcf
#get open reading frame predictions for your assembly
#usage without blastx info:
sbatch get-orf-predictors.sh assembly.fa empytyfile outfile
#get ORF headers
grep ">" orf.pep | sed 's/>//g' | sed 's/+//g' > orfpredictor_headers.txt
#get snp protein changes
snp_protein_changes.py vcf ref.fa orfpredictor_headers.txt OUT
- this will output a file with
- the contig and snp position
- the reference allele and alternate allele
- the reference and alternate protein
- if the protein change was synonymous (S), nonsynonymous (NS), or in a UTR region
- the position of the change in the codon (CPOS)
- download program
- make a .bsc file, see R script:
Bayescan_generate_input_file.R
- rsync .bsc file to sherlock
- on Sherlock cluster, run Bayescan
sbatch bayescan.sh input.bsc
- can take SNPs from three separate Bayescan runs
- rsync FST file back to your computer and analyze with R scripts: Bayescan_plot_R.r and Bayescan_plot_results_sample.R
bash get-bam-counts.sh *.bam
- see guide-to-R-scripts.md for more analyses