Chytrid-Phylogenomics

Repository of scripts and shared data for chytrid phylogenomics analyses.

An emphasis of the workflow described here is that arbitrarily choosing the "best hits" for a phlogenomic marker may not always produce the best gene trees, negatively impacting downstream concatenated analyses and so on.

Setup

Clone this repository

git clone https://github.com/Michigan-Mycology/Chytrid-Phylogenomics.git

Set an environmental variable that points to where you have this repository downloaded. Do this by adding the following line to the the file at /home/uniqname/.bashrc or just typing it into your current session. Adding it to /home/uniqname/.bashrc will make this permanent.

export CHYTRID_PHYLO=/path/to/Chytrid-Phylogenomics/scripts/python

Install SCGid Some of the python scripts included in this repositiory require a FASTA-reading module that is part of SCGid. If you are going to use them, you need to have SCGid installed on your system and be using it in a virtual python3 environment. Doing this will also bring along some other dependencies that are used by these and other scripts.

Start to follow the installation instructions at the SCGid github repository to install SCGid, but you can stop after you enter:

python setup.py develop

Before you start...

You need genomes (specifically predicted proteomes) for the taxa that you want to include in phylogenomic trees. You can get these by sequencing them yourself, from public databases like the JGI Genome Portal or NCBI Genbank, or most commonly some combination of both. The sequence headers in the FASTAs you get from different places are going to be formatted differently. The first step after downloading all of your predicted proteomes needs to be getting all of those names into the standard NCBI format:
```
LocusTagPrefix|ProteinHeader OptionalDescription
```
There are a variety of ways to do this, but it depends a lot on what the headers in each file are. I don't have any batch scripts for doing this yet, but feel free to ask for help. Some useful command line tools for doing this include sed and awk. Here are some examples working with proteomes from GenBank and JGI.

Put the final predicted proteome FASTA files at $CHYTRID_PHYLO/data. You will have to make this directory first, mkdir $CHYTRID_PHYLO/data.
You need phylogenomic markers that you want to pull out of the predicted proteomes for taxa you want to include in phylogenomic trees. For this particular project we opted to use the markers used by BUSCO and contained in the fungi_odb10 database. There are other varieties of BUSCO markers that are tuned for different groups of organisms.
- You can find fungi_odb10 HERE
- You can find a list of all the odb10 databases HERE
You can download fungi_odb10 to your computer from the terminal by entering
```
# Download from the internet
wget https://busco-data.ezlab.org/v4/data/lineages/fungi_odb10.2019-12-13.tar.gz

# Untar/Unzip compressed archive
tar xvf fungi_odb10.2019-12-13.tar.gz

# Now you should have a directory called `fungi_odb10`
```
The odb10 archive comes with several directories and files. We don't need all of them to run this pipeline, so I like to copy the important stuff (for us) out of that archive and put them in a working directory that I'll use for the entire pipeline.
```
# Make a direcotry for the markers
mkdir $CHYTRID_PHYLO/markers
mkdir $CHYTRID_PHYLO/markers/hmm

# Copy the markers over
# `*.hmm` means all files whose names end with ".hmm"
cp fungi_odb10/hmms/*.hmm $CHYTRID_PHYLO/markers/hmm/.

# Copy the cutoff scores that BUSCO uses for each marker - we'll use this later
cp fungi_odb10/scores_cutoff $CHYTRID_PHYLO/odb10_busco_scores_cutoff
```

Pipeline

Make sure you have markers and genomes downloaded and FASTA headers reformatted as discussed above before continuing.

Combine all the hmm markers into a single file and format as a database.

Install/load hmmer. If you are on greatlakes, hmmer is already installed, but you need to load it:
```
module load Bioinformatics hmmer
```
Combine files and build database like this:
```
cd $CHYTRID_PHYLO/markers/
cat hmm/*.hmm > odb10_combined.hmm
hmmpress odb10_combined.hmm
```
Now you should have .h3p, .h3m, .h3f, and h3i files in addition to odb10_combined.hmm.
Search predicted proteomes with marker hmm models to get hits in domtbl format.

The essential command that needs to be called for each proteome is:
```
hmmsearch --cpu 1 -E 1e-5 --domtblout predicted_proteome.domtbl  /path/to/odb10_combined.hmm /path/to/predicted_proteome.fasta
```
If you are working on greatlakes through the James Lab there is a python helper script you can use to autogenerate slurm scripts to run this command for all your FASTA files as long as they're stored together in a single directory.
```
python $CHYTRID_PHYLO/scripts/slurm/1_hmmsearch.py /path/to/FASTA/directory /path/to/odb10_combined.hmm
```
Now you should have 10 HmSe_X.sh files which can each be submitted to slurm, the results of which should be collected in a separate directory:
```
mkdir $CHYTRID_PHYLO/search
cd $CHYTRID_PHYLO/search
for i in $(ls /path/to/slurm/scripts/HmSe_*.sh); sbatch $i
```

Name		Name	Last commit message	Last commit date
Latest commit History 209 Commits
Example_Sheets		Example_Sheets
ancestral_state_recontructions		ancestral_state_recontructions
assembly		assembly
bin		bin
figures		figures
figures_pnas_revisions		figures_pnas_revisions
ploidy		ploidy
pursuit_ploidy_histograms		pursuit_ploidy_histograms
scripts		scripts
spreadsheets		spreadsheets
utilities		utilities
README.md		README.md
README_vcf_to_af		README_vcf_to_af
best_tree_nodenum.pdf		best_tree_nodenum.pdf
blastos_autest_prep.R		blastos_autest_prep.R
genetree_filtering_readme_in_progress.docx		genetree_filtering_readme_in_progress.docx
genetree_filtering_readme_in_progress.odt		genetree_filtering_readme_in_progress.odt
genome_stats_for_results.R		genome_stats_for_results.R
references_EDF_name.txt		references_EDF_name.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chytrid-Phylogenomics

Setup

Before you start...

Pipeline

About

Releases

Packages

Contributors 3

Languages

Michigan-Mycology/Chytrid-Phylogenomics

Folders and files

Latest commit

History

Repository files navigation

Chytrid-Phylogenomics

Setup

Before you start...

Pipeline

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages