Skip to content

Commit

Permalink
merge from release-0.1.0 branch for release
Browse files Browse the repository at this point in the history
  • Loading branch information
qingpeng committed May 23, 2017
2 parents 56e44d4 + a55863d commit 6671d43
Show file tree
Hide file tree
Showing 238 changed files with 11,595 additions and 2,676 deletions.
3 changes: 3 additions & 0 deletions .idea/dictionaries/qingpeng.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

638 changes: 270 additions & 368 deletions .idea/workspace.xml

Large diffs are not rendered by default.

18 changes: 18 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,24 @@ and this project adheres to [Semantic Versioning](http://semver.org/).

## [Unreleased]

## [0.1.0] - 2017-05-22
### Added
- Implement script for prediction using MLlib DataFrame-based API
- Implement script for prediction evaluation using DataFrame-based API
- testing script and testing file for DataFrame-based spark scripts

### Changed
- Update model with the one from full scale training set, using Spark MLlib
DataFrame-based API, feature 0-3, k-mer, codon, pfam, vfam
but no IMG virus HMMs hit, and with scaler, sd unit only.
- Update Spark running version from 2.0.0 to 2.1.0
- Update prediction pipeline using Spark MLlib DataFrame-based API

### Fixed
- Fix web application for DataFrame-based API, feature 0-3.
- Update and fix documentation


## [0.1.0-alpha] - 2017-05-16
### Added
- start this changelog file
Expand Down
121 changes: 19 additions & 102 deletions Readme.md
Original file line number Diff line number Diff line change
@@ -1,142 +1,59 @@
# ViCA

#ViCA
## Classifying virus from metagenomic and metatransciptomic contigs
Classifying virus from metagenomic and metatransciptomic contigs



# Dependencies
* [GenemarkS version 4.29](http://exon.gatech.edu/GeneMark/)
* [RefTree](https://bitbucket.org/berkeleylab/jgi_reftree)
* [Task Farmer](http://jgi.goe.gov)
* [Python v2.74](https://www.python.org/)
* [Scikit-learn](https://scikits.appspot.com/scikit-learn)
* [Biopython](http://biopython.org)
* [simplejson](https://github.com/simplejson/simplejson)
* [numpy](http://www.numpy.org/)
* [scipy](http://www.scipy.org/)
* [matplotlib](http://matplotlib.org/)
* [khmer v1.4](https://pypi.python.org/pypi/khmer/1.4/)
* Pfam/Vfam HMMER
pip install khmer==1.4
* [Spark]

# Preparation
## Spark:

You may come cross warnings like below if you run it on your laptop
```
17/05/04 14:34:34 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
17/05/04 14:34:34 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
```
This is because the native BLAS is not used. This may affect the speed of
model training and prediction.

If you want to avoid this, you may want to build the Spark from source code.

Please refer to this page:
http://www.spark.tc/blas-libraries-in-mllib/


# User Case
## Usage

With this package, a model is offered with training using simulated data from
RefSeq genomes. Tools are provided if the users want to train the model
themselves with their own data.
RefSeq genomes.


## Model Tuning
split into training and testing
```angular2html
scripts/5_create_training_testing_with_seq_name.py
```
model training
```angular2html
spark-submit ./scripts/spark_training_model.py training.vect training.vect_model
```

model evaluation
```
spark-submit ./scripts/spark_evaluating_model.py testing.vect training.vect_model/ testing.vect.prediction testing.vect.report testing.vect.prc.png
```
Tools are provided if the users want to train the model
themselves with their own data. Please refer to documentation.


## Training

model training (small data set)
```angular2html
spark-submit ./scripts/spark_training_model.py training.vect training.vect_model
```

## Prediction
There are three use cases for doing the prediction:

- Large scale prediction - pipeline (in NextFlow) used for prediction on large
### 1. Prediction for large number of sequences
pipeline (in NextFlow) used for prediction on large
number of sequences using HPC or Cloud system

a. feature extraction using nextflow workflow management
#### Step 1. feature extraction using Nextflow workflow management
```angular2html
scripts/feature_extraction.nf
```
b. using spark to do prediction on the vectors
#### Step 2. using spark to do prediction on the vectors
```angular2html
scripts/spark_prediction.py
$SPARK_PATH/bin/spark-submit spark_prediction.py
usage: spark_prediction.py [-h] libsvm model scaler outfile
```

- Small scale prediction - downloadable package used for prediction on small
### 2. Prediction for small number of sequences
downloadable package used for prediction on small
number of sequences running locally (like a laptop)
```angular2html
~/scripts/prediction_pipeline_lite.py
usage: prediction_pipeline_lite.py [-h]
input_file output_file genemark_path
hmmer_path hmmer_db spark_path feature_file
model_directory
model_directory scaler_directory
```


- Web Application - a web interface where the users can submit small number of
### 3. Web Application
a web interface where the users can submit small number of
sequences for prediction
```angular2html
~/web/server.py
```

### Other helper scripts
select vectors from specific features:
```angular2html
scripts/pick_vectors_by_feature.py
```
## Installation
Please refer to documentation for dependency and installation details

## Diagram of program

![Diagram](./doc/images/vica.png)

## Description of scripts

scripts/sklearn_training_model.py : training model using sklearn, with weighted option
scripts/sklearn_evaluate_model.py: evaluate model using sklearn

HMMER installation
=======
Tutorial:
http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf

on genepool:

module load hmmer/3.1b2


Pfam database
========
You can download Pfam database from
ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam30.0/Pfam-A.hmm.gz

Vfam database
======
http://derisilab.ucsf.edu/software/vFam/vFam-B_2014.hmm

Prepare Pfam/Vfam database for HMMER run
====
hmmpress Pfam-A.hmm
hmmpress vFam-B_2014.hmm


47 changes: 47 additions & 0 deletions doc/Pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,3 +77,50 @@ spark evaluating model!
```
~/Downloads/spark-2.0.0-bin-hadoop2.7/bin/spark-submit ~/Dropbox/Development/Github/jgi-ViCA/scripts/spark_evaluating_model.py testing.vect training.vect_model/ testing.vect.prediction testing.vect.report testing.vect.prc.png
```

creating vector files for unit-testing
====
Generate vectors for 100 virus segments and 100 non-virus segments for testing purpose.
```angular2html
$python ~/Dropbox/Development/Github/jgi-ViCA/scripts/subsample_training_100.py training.vect 100 100 training.vect.200
$ more training.vect.200|cut -f 1 -d ' '|grep -c '0'
111
$ more training.vect.200|cut -f 1 -d ' '|grep -c '1'
86
$ python ~/Dropbox/Development/Github/jgi-ViCA/scripts/subsample_training_100.py testing.vect 100 100 testing.vect.200
$ more testing.vect.200|cut -f 1 -d ' '|grep -c '1'
108
$ more testing.vect.200|cut -f 1 -d ' '|grep -c '0'
112
```


Get Spark ML model for prediction
======


```angular2html
/global/projectb/scratch/qpzhang/Run_Genelearn/Full_nextflow/Test_Spark/Spark_1X> python ~/Github/jgi-ViCA/scripts/subsample_random.py ../../^C
```

```
Training model on Cori
Using 1x non-virus - all virus training data
```angular2html
spark-submit --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=$SCRATCH/spark/ --conf spark.driver.maxResultSize=60g --driver
-memory 60G --executor-memory 60G /global/homes/q/qpzhang/Github/jgi-ViCA/scripts/spark_training_model_dataframe.py /global/projectb/scrat
ch/qpzhang/Run_Genelearn/Full_nextflow/Test_Spark/all_segment.fasta.vect.family.training.svmlib.no4.1x_nonvirus /global/projectb/scratch/q
pzhang/Run_Genelearn/Full_nextflow/Test_Spark/all_segment.fasta.vect.family.training.svmlib.no4.1x_nonvirus_model /global/projectb/scratch
/qpzhang/Run_Genelearn/Full_nextflow/Test_Spark/all_segment.fasta.vect.family.training.svmlib.no4.1x_nonvirus_scaler
```

Evaluate the performance of the model
```angular2html
~/Downloads/spark-2.1.0-bin-hadoop2.7/bin/spark-submit ~/Dropbox/Development/Github/jgi-ViCA/scripts/spark_evaluating_model_dataframe.py ../testing.vect all_segment.fasta.vect.family.training.svmlib.no4.1x_nonvirus_model all_segment.fasta.vect.family.training.svmlib.no4.1x_nonvirus_scaler all_segment.fasta.vect.family.training.svmlib.no4.1x_nonvirus_model.report all_segment.fasta.vect.family.training.svmlib.no4.1x_nonvirus_model.png
```
80 changes: 80 additions & 0 deletions doc/installation.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
.. GeneLearn documentation master file, created by
sphinx-quickstart on Thu Jul 9 13:38:57 2015.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to GeneLearn's documentation!
=====================================

Contents:

.. toctree::
:maxdepth: 2



Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`




# Dependencies
* [GenemarkS version 4.29](http://exon.gatech.edu/GeneMark/)
* [RefTree](https://bitbucket.org/berkeleylab/jgi_reftree)
* [Task Farmer](http://jgi.goe.gov)
* [Python v2.74](https://www.python.org/)
* [Scikit-learn](https://scikits.appspot.com/scikit-learn)
* [Biopython](http://biopython.org)
* [simplejson](https://github.com/simplejson/simplejson)
* [numpy](http://www.numpy.org/)
* [scipy](http://www.scipy.org/)
* [matplotlib](http://matplotlib.org/)
* [khmer v1.4](https://pypi.python.org/pypi/khmer/1.4/) pip install khmer==1.4
* [Pfam/Vfam HMMER]
* [Spark 2.1.0]

# Preparation
## Spark:

You may come cross warnings like below if you run it on your laptop
```
17/05/04 14:34:34 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
17/05/04 14:34:34 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
```
This is because the native BLAS is not used. This may affect the speed of
model training and prediction.

If you want to avoid this, you may want to build the Spark from source code.

Please refer to this page:
http://www.spark.tc/blas-libraries-in-mllib/


HMMER installation
=======
Tutorial:
http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf

on genepool:

module load hmmer/3.1b2


Pfam database
========
You can download Pfam database from
ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam30.0/Pfam-A.hmm.gz

Vfam database
======
http://derisilab.ucsf.edu/software/vFam/vFam-B_2014.hmm

Prepare Pfam/Vfam database for HMMER run
====
hmmpress Pfam-A.hmm
hmmpress vFam-B_2014.hmm
45 changes: 45 additions & 0 deletions doc/model_training.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
.. GeneLearn documentation master file, created by
sphinx-quickstart on Thu Jul 9 13:38:57 2015.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to GeneLearn's documentation!
=====================================

Contents:

.. toctree::
:maxdepth: 2



Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`


## Model Tuning
split into training and testing
```angular2html
scripts/5_create_training_testing_with_seq_name.py
```
model training
```angular2html
$SPARKPATH/bin/spark-submit ./scripts/spark_training_model.py training.vect training.vect_model
```

model evaluation
```
$SPARKPATH/bin/spark-submit ./scripts/spark_evaluating_model.py testing.vect training.vect_model/ testing.vect.prediction testing.vect.report testing.vect.prc.png
```


## Training

model training (small data set)
```angular2html
$SPARKPATH/bin/spark-submit ./scripts/spark_training_model.py training.vect training.vect_model
```
Loading

0 comments on commit 6671d43

Please sign in to comment.