-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
merge from release-0.1.0 branch for release
- Loading branch information
Showing
238 changed files
with
11,595 additions
and
2,676 deletions.
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,142 +1,59 @@ | ||
# ViCA | ||
|
||
#ViCA | ||
## Classifying virus from metagenomic and metatransciptomic contigs | ||
Classifying virus from metagenomic and metatransciptomic contigs | ||
|
||
|
||
|
||
# Dependencies | ||
* [GenemarkS version 4.29](http://exon.gatech.edu/GeneMark/) | ||
* [RefTree](https://bitbucket.org/berkeleylab/jgi_reftree) | ||
* [Task Farmer](http://jgi.goe.gov) | ||
* [Python v2.74](https://www.python.org/) | ||
* [Scikit-learn](https://scikits.appspot.com/scikit-learn) | ||
* [Biopython](http://biopython.org) | ||
* [simplejson](https://github.com/simplejson/simplejson) | ||
* [numpy](http://www.numpy.org/) | ||
* [scipy](http://www.scipy.org/) | ||
* [matplotlib](http://matplotlib.org/) | ||
* [khmer v1.4](https://pypi.python.org/pypi/khmer/1.4/) | ||
* Pfam/Vfam HMMER | ||
pip install khmer==1.4 | ||
* [Spark] | ||
|
||
# Preparation | ||
## Spark: | ||
|
||
You may come cross warnings like below if you run it on your laptop | ||
``` | ||
17/05/04 14:34:34 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS | ||
17/05/04 14:34:34 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS | ||
``` | ||
This is because the native BLAS is not used. This may affect the speed of | ||
model training and prediction. | ||
|
||
If you want to avoid this, you may want to build the Spark from source code. | ||
|
||
Please refer to this page: | ||
http://www.spark.tc/blas-libraries-in-mllib/ | ||
|
||
|
||
# User Case | ||
## Usage | ||
|
||
With this package, a model is offered with training using simulated data from | ||
RefSeq genomes. Tools are provided if the users want to train the model | ||
themselves with their own data. | ||
RefSeq genomes. | ||
|
||
|
||
## Model Tuning | ||
split into training and testing | ||
```angular2html | ||
scripts/5_create_training_testing_with_seq_name.py | ||
``` | ||
model training | ||
```angular2html | ||
spark-submit ./scripts/spark_training_model.py training.vect training.vect_model | ||
``` | ||
|
||
model evaluation | ||
``` | ||
spark-submit ./scripts/spark_evaluating_model.py testing.vect training.vect_model/ testing.vect.prediction testing.vect.report testing.vect.prc.png | ||
``` | ||
Tools are provided if the users want to train the model | ||
themselves with their own data. Please refer to documentation. | ||
|
||
|
||
## Training | ||
|
||
model training (small data set) | ||
```angular2html | ||
spark-submit ./scripts/spark_training_model.py training.vect training.vect_model | ||
``` | ||
|
||
## Prediction | ||
There are three use cases for doing the prediction: | ||
|
||
- Large scale prediction - pipeline (in NextFlow) used for prediction on large | ||
### 1. Prediction for large number of sequences | ||
pipeline (in NextFlow) used for prediction on large | ||
number of sequences using HPC or Cloud system | ||
|
||
a. feature extraction using nextflow workflow management | ||
#### Step 1. feature extraction using Nextflow workflow management | ||
```angular2html | ||
scripts/feature_extraction.nf | ||
``` | ||
b. using spark to do prediction on the vectors | ||
#### Step 2. using spark to do prediction on the vectors | ||
```angular2html | ||
scripts/spark_prediction.py | ||
$SPARK_PATH/bin/spark-submit spark_prediction.py | ||
usage: spark_prediction.py [-h] libsvm model scaler outfile | ||
``` | ||
|
||
- Small scale prediction - downloadable package used for prediction on small | ||
### 2. Prediction for small number of sequences | ||
downloadable package used for prediction on small | ||
number of sequences running locally (like a laptop) | ||
```angular2html | ||
~/scripts/prediction_pipeline_lite.py | ||
usage: prediction_pipeline_lite.py [-h] | ||
input_file output_file genemark_path | ||
hmmer_path hmmer_db spark_path feature_file | ||
model_directory | ||
model_directory scaler_directory | ||
``` | ||
|
||
|
||
- Web Application - a web interface where the users can submit small number of | ||
### 3. Web Application | ||
a web interface where the users can submit small number of | ||
sequences for prediction | ||
```angular2html | ||
~/web/server.py | ||
``` | ||
|
||
### Other helper scripts | ||
select vectors from specific features: | ||
```angular2html | ||
scripts/pick_vectors_by_feature.py | ||
``` | ||
## Installation | ||
Please refer to documentation for dependency and installation details | ||
|
||
## Diagram of program | ||
|
||
 | ||
|
||
## Description of scripts | ||
|
||
scripts/sklearn_training_model.py : training model using sklearn, with weighted option | ||
scripts/sklearn_evaluate_model.py: evaluate model using sklearn | ||
|
||
HMMER installation | ||
======= | ||
Tutorial: | ||
http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf | ||
|
||
on genepool: | ||
|
||
module load hmmer/3.1b2 | ||
|
||
|
||
Pfam database | ||
======== | ||
You can download Pfam database from | ||
ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam30.0/Pfam-A.hmm.gz | ||
|
||
Vfam database | ||
====== | ||
http://derisilab.ucsf.edu/software/vFam/vFam-B_2014.hmm | ||
|
||
Prepare Pfam/Vfam database for HMMER run | ||
==== | ||
hmmpress Pfam-A.hmm | ||
hmmpress vFam-B_2014.hmm | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
.. GeneLearn documentation master file, created by | ||
sphinx-quickstart on Thu Jul 9 13:38:57 2015. | ||
You can adapt this file completely to your liking, but it should at least | ||
contain the root `toctree` directive. | ||
Welcome to GeneLearn's documentation! | ||
===================================== | ||
|
||
Contents: | ||
|
||
.. toctree:: | ||
:maxdepth: 2 | ||
|
||
|
||
|
||
Indices and tables | ||
================== | ||
|
||
* :ref:`genindex` | ||
* :ref:`modindex` | ||
* :ref:`search` | ||
|
||
|
||
|
||
|
||
# Dependencies | ||
* [GenemarkS version 4.29](http://exon.gatech.edu/GeneMark/) | ||
* [RefTree](https://bitbucket.org/berkeleylab/jgi_reftree) | ||
* [Task Farmer](http://jgi.goe.gov) | ||
* [Python v2.74](https://www.python.org/) | ||
* [Scikit-learn](https://scikits.appspot.com/scikit-learn) | ||
* [Biopython](http://biopython.org) | ||
* [simplejson](https://github.com/simplejson/simplejson) | ||
* [numpy](http://www.numpy.org/) | ||
* [scipy](http://www.scipy.org/) | ||
* [matplotlib](http://matplotlib.org/) | ||
* [khmer v1.4](https://pypi.python.org/pypi/khmer/1.4/) pip install khmer==1.4 | ||
* [Pfam/Vfam HMMER] | ||
* [Spark 2.1.0] | ||
|
||
# Preparation | ||
## Spark: | ||
|
||
You may come cross warnings like below if you run it on your laptop | ||
``` | ||
17/05/04 14:34:34 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS | ||
17/05/04 14:34:34 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS | ||
``` | ||
This is because the native BLAS is not used. This may affect the speed of | ||
model training and prediction. | ||
|
||
If you want to avoid this, you may want to build the Spark from source code. | ||
|
||
Please refer to this page: | ||
http://www.spark.tc/blas-libraries-in-mllib/ | ||
|
||
|
||
HMMER installation | ||
======= | ||
Tutorial: | ||
http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf | ||
|
||
on genepool: | ||
|
||
module load hmmer/3.1b2 | ||
|
||
|
||
Pfam database | ||
======== | ||
You can download Pfam database from | ||
ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam30.0/Pfam-A.hmm.gz | ||
|
||
Vfam database | ||
====== | ||
http://derisilab.ucsf.edu/software/vFam/vFam-B_2014.hmm | ||
|
||
Prepare Pfam/Vfam database for HMMER run | ||
==== | ||
hmmpress Pfam-A.hmm | ||
hmmpress vFam-B_2014.hmm |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
.. GeneLearn documentation master file, created by | ||
sphinx-quickstart on Thu Jul 9 13:38:57 2015. | ||
You can adapt this file completely to your liking, but it should at least | ||
contain the root `toctree` directive. | ||
Welcome to GeneLearn's documentation! | ||
===================================== | ||
|
||
Contents: | ||
|
||
.. toctree:: | ||
:maxdepth: 2 | ||
|
||
|
||
|
||
Indices and tables | ||
================== | ||
|
||
* :ref:`genindex` | ||
* :ref:`modindex` | ||
* :ref:`search` | ||
|
||
|
||
## Model Tuning | ||
split into training and testing | ||
```angular2html | ||
scripts/5_create_training_testing_with_seq_name.py | ||
``` | ||
model training | ||
```angular2html | ||
$SPARKPATH/bin/spark-submit ./scripts/spark_training_model.py training.vect training.vect_model | ||
``` | ||
|
||
model evaluation | ||
``` | ||
$SPARKPATH/bin/spark-submit ./scripts/spark_evaluating_model.py testing.vect training.vect_model/ testing.vect.prediction testing.vect.report testing.vect.prc.png | ||
``` | ||
|
||
|
||
## Training | ||
|
||
model training (small data set) | ||
```angular2html | ||
$SPARKPATH/bin/spark-submit ./scripts/spark_training_model.py training.vect training.vect_model | ||
``` |
Oops, something went wrong.