merge from release-0.1.0 branch for release

qingpeng · May 23, 2017 · 6671d43 · 6671d43
2 parents 56e44d4 + a55863d
commit 6671d43
Show file tree

Hide file tree

Showing 238 changed files with 11,595 additions and 2,676 deletions.
diff --git a/.idea/dictionaries/qingpeng.xml b/.idea/dictionaries/qingpeng.xml
diff --git a/.idea/workspace.xml b/.idea/workspace.xml
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,24 @@ and this project adheres to [Semantic Versioning](http://semver.org/).
 
 ## [Unreleased]
 
+## [0.1.0] - 2017-05-22
+### Added
+- Implement script for prediction using MLlib DataFrame-based API
+- Implement script for prediction evaluation using DataFrame-based API
+- testing script and testing file for DataFrame-based spark scripts
+
+### Changed
+- Update model with the one from full scale training set, using Spark MLlib 
+DataFrame-based API, feature 0-3, k-mer, codon, pfam, vfam 
+but no IMG virus HMMs hit, and with scaler, sd unit only.
+- Update Spark running version from 2.0.0 to 2.1.0
+- Update prediction pipeline using Spark MLlib DataFrame-based API
+
+### Fixed
+- Fix web application for DataFrame-based API, feature 0-3.
+- Update and fix documentation
+
+
 ## [0.1.0-alpha] - 2017-05-16
 ### Added
 - start this changelog file

diff --git a/Readme.md b/Readme.md
@@ -1,142 +1,59 @@
+# ViCA
 
-#ViCA
-##   Classifying virus from metagenomic and metatransciptomic contigs
+Classifying virus from metagenomic and metatransciptomic contigs
 
 
 
-# Dependencies
-* [GenemarkS version 4.29](http://exon.gatech.edu/GeneMark/)
-* [RefTree](https://bitbucket.org/berkeleylab/jgi_reftree)
-* [Task Farmer](http://jgi.goe.gov)
-* [Python v2.74](https://www.python.org/)
-* [Scikit-learn](https://scikits.appspot.com/scikit-learn)
-* [Biopython](http://biopython.org)
-* [simplejson](https://github.com/simplejson/simplejson)
-* [numpy](http://www.numpy.org/)
-* [scipy](http://www.scipy.org/)
-* [matplotlib](http://matplotlib.org/)
-* [khmer v1.4](https://pypi.python.org/pypi/khmer/1.4/)
-* Pfam/Vfam HMMER
-pip install khmer==1.4
-* [Spark]
-
-# Preparation
-## Spark:
-
-You may come cross warnings like below if you run it on your laptop
-```
-17/05/04 14:34:34 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
-17/05/04 14:34:34 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
-```
-This is because the native BLAS is not used. This may affect the speed of 
-model training and prediction.
-
-If you want to avoid this, you may want to build the Spark from source code. 
-
-Please refer to this page:
-http://www.spark.tc/blas-libraries-in-mllib/
 
-
-# User Case
+## Usage
 
 With this package, a model is offered with training using simulated data from
-RefSeq genomes. Tools are provided if the users want to train the model 
-themselves with their own data. 
+RefSeq genomes. 
 
 
-## Model Tuning
-split into training and testing
-```angular2html
-scripts/5_create_training_testing_with_seq_name.py
-```
-model training
-```angular2html
-spark-submit ./scripts/spark_training_model.py training.vect training.vect_model
-```
-
-model evaluation
-```
-spark-submit ./scripts/spark_evaluating_model.py testing.vect training.vect_model/ testing.vect.prediction testing.vect.report testing.vect.prc.png
-```
+Tools are provided if the users want to train the model 
+themselves with their own data. Please refer to documentation.
 
 
-## Training
-
-model training (small data set)
-```angular2html
-spark-submit ./scripts/spark_training_model.py training.vect training.vect_model
-```
-
-## Prediction
 There are three use cases for doing the prediction:
 
-- Large scale prediction - pipeline (in NextFlow) used for prediction on large 
+### 1. Prediction for large number of sequences 
+pipeline (in NextFlow) used for prediction on large 
 number of sequences using HPC or Cloud system
 
-a. feature extraction using nextflow workflow management
+#### Step 1. feature extraction using Nextflow workflow management
 ```angular2html
 scripts/feature_extraction.nf
 ```
-b. using spark to do prediction on the vectors
+#### Step 2. using spark to do prediction on the vectors
 ```angular2html
-scripts/spark_prediction.py
+$SPARK_PATH/bin/spark-submit spark_prediction.py
+usage: spark_prediction.py [-h] libsvm model scaler outfile
 ```
 
-- Small scale prediction  - downloadable package used for prediction on small
+### 2. Prediction for small number of sequences   
+downloadable package used for prediction on small
 number of sequences running locally (like a laptop)
 ```angular2html
 ~/scripts/prediction_pipeline_lite.py
 usage: prediction_pipeline_lite.py [-h]
                                    input_file output_file genemark_path
                                    hmmer_path hmmer_db spark_path feature_file
-                                   model_directory
-
+                                   model_directory scaler_directory
 ```
 
-
-- Web Application - a web interface where the users can submit small number of
+### 3. Web Application 
+a web interface where the users can submit small number of
 sequences for prediction
 ```angular2html
 ~/web/server.py
 ```
 
-### Other helper scripts
-select vectors from specific features:
-```angular2html
-scripts/pick_vectors_by_feature.py
-```
+## Installation
+Please refer to documentation for dependency and installation details
 
 ## Diagram of program
 
 ![Diagram](./doc/images/vica.png)
 
-## Description of scripts
-
-scripts/sklearn_training_model.py : training model using sklearn, with weighted option
-scripts/sklearn_evaluate_model.py: evaluate model using sklearn
-
-HMMER installation
-=======
-Tutorial:
-http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf
-
-on genepool:
-
-module load hmmer/3.1b2
-
-
-Pfam database
-========
-You can download Pfam database from
-ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam30.0/Pfam-A.hmm.gz
-
-Vfam database
-======
-http://derisilab.ucsf.edu/software/vFam/vFam-B_2014.hmm
-
-Prepare Pfam/Vfam database for HMMER run
-====
-hmmpress Pfam-A.hmm
-hmmpress vFam-B_2014.hmm
-
 
diff --git a/doc/Pipeline.md b/doc/Pipeline.md
@@ -77,3 +77,50 @@ spark evaluating model!
 ```
  ~/Downloads/spark-2.0.0-bin-hadoop2.7/bin/spark-submit ~/Dropbox/Development/Github/jgi-ViCA/scripts/spark_evaluating_model.py testing.vect training.vect_model/ testing.vect.prediction testing.vect.report testing.vect.prc.png
 ```
+
+creating vector files for unit-testing
+====
+Generate vectors for 100 virus segments and 100 non-virus segments for testing purpose.
+```angular2html
+$python ~/Dropbox/Development/Github/jgi-ViCA/scripts/subsample_training_100.py training.vect 100 100 training.vect.200
+
+$ more training.vect.200|cut -f 1 -d ' '|grep -c '0'
+111
+$ more training.vect.200|cut -f 1 -d ' '|grep -c '1'
+86
+
+$ python ~/Dropbox/Development/Github/jgi-ViCA/scripts/subsample_training_100.py testing.vect 100 100 testing.vect.200
+$ more testing.vect.200|cut -f 1 -d ' '|grep -c '1'
+108
+$ more testing.vect.200|cut -f 1 -d ' '|grep -c '0'
+112
+
+```
+
+
+Get Spark ML model for prediction
+======
+
+
+```angular2html
+/global/projectb/scratch/qpzhang/Run_Genelearn/Full_nextflow/Test_Spark/Spark_1X> python ~/Github/jgi-ViCA/scripts/subsample_random.py ../../^C
+```
+
+```
+Training model on Cori
+Using 1x non-virus - all virus training data
+
+```angular2html
+spark-submit --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=$SCRATCH/spark/  --conf spark.driver.maxResultSize=60g  --driver
+-memory 60G --executor-memory 60G /global/homes/q/qpzhang/Github/jgi-ViCA/scripts/spark_training_model_dataframe.py /global/projectb/scrat
+ch/qpzhang/Run_Genelearn/Full_nextflow/Test_Spark/all_segment.fasta.vect.family.training.svmlib.no4.1x_nonvirus /global/projectb/scratch/q
+pzhang/Run_Genelearn/Full_nextflow/Test_Spark/all_segment.fasta.vect.family.training.svmlib.no4.1x_nonvirus_model /global/projectb/scratch
+/qpzhang/Run_Genelearn/Full_nextflow/Test_Spark/all_segment.fasta.vect.family.training.svmlib.no4.1x_nonvirus_scaler
+
+```
+
+Evaluate the performance of the model
+```angular2html
+ ~/Downloads/spark-2.1.0-bin-hadoop2.7/bin/spark-submit ~/Dropbox/Development/Github/jgi-ViCA/scripts/spark_evaluating_model_dataframe.py ../testing.vect all_segment.fasta.vect.family.training.svmlib.no4.1x_nonvirus_model all_segment.fasta.vect.family.training.svmlib.no4.1x_nonvirus_scaler all_segment.fasta.vect.family.training.svmlib.no4.1x_nonvirus_model.report all_segment.fasta.vect.family.training.svmlib.no4.1x_nonvirus_model.png
+ 
+```
diff --git a/doc/installation.rst b/doc/installation.rst
@@ -0,0 +1,80 @@
+.. GeneLearn documentation master file, created by
+   sphinx-quickstart on Thu Jul  9 13:38:57 2015.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+Welcome to GeneLearn's documentation!
+=====================================
+
+Contents:
+
+.. toctree::
+   :maxdepth: 2
+
+
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+* :ref:`modindex`
+* :ref:`search`
+
+
+
+
+# Dependencies
+* [GenemarkS version 4.29](http://exon.gatech.edu/GeneMark/)
+* [RefTree](https://bitbucket.org/berkeleylab/jgi_reftree)
+* [Task Farmer](http://jgi.goe.gov)
+* [Python v2.74](https://www.python.org/)
+* [Scikit-learn](https://scikits.appspot.com/scikit-learn)
+* [Biopython](http://biopython.org)
+* [simplejson](https://github.com/simplejson/simplejson)
+* [numpy](http://www.numpy.org/)
+* [scipy](http://www.scipy.org/)
+* [matplotlib](http://matplotlib.org/)
+* [khmer v1.4](https://pypi.python.org/pypi/khmer/1.4/) pip install khmer==1.4
+* [Pfam/Vfam HMMER]
+* [Spark 2.1.0]
+
+# Preparation
+## Spark:
+
+You may come cross warnings like below if you run it on your laptop
+```
+17/05/04 14:34:34 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
+17/05/04 14:34:34 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
+```
+This is because the native BLAS is not used. This may affect the speed of
+model training and prediction.
+
+If you want to avoid this, you may want to build the Spark from source code.
+
+Please refer to this page:
+http://www.spark.tc/blas-libraries-in-mllib/
+
+
+HMMER installation
+=======
+Tutorial:
+http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf
+
+on genepool:
+
+module load hmmer/3.1b2
+
+
+Pfam database
+========
+You can download Pfam database from
+ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam30.0/Pfam-A.hmm.gz
+
+Vfam database
+======
+http://derisilab.ucsf.edu/software/vFam/vFam-B_2014.hmm
+
+Prepare Pfam/Vfam database for HMMER run
+====
+hmmpress Pfam-A.hmm
+hmmpress vFam-B_2014.hmm
diff --git a/doc/model_training.rst b/doc/model_training.rst
@@ -0,0 +1,45 @@
+.. GeneLearn documentation master file, created by
+   sphinx-quickstart on Thu Jul  9 13:38:57 2015.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+Welcome to GeneLearn's documentation!
+=====================================
+
+Contents:
+
+.. toctree::
+   :maxdepth: 2
+
+
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+* :ref:`modindex`
+* :ref:`search`
+
+
+## Model Tuning
+split into training and testing
+```angular2html
+scripts/5_create_training_testing_with_seq_name.py
+```
+model training
+```angular2html
+$SPARKPATH/bin/spark-submit ./scripts/spark_training_model.py training.vect training.vect_model
+```
+
+model evaluation
+```
+$SPARKPATH/bin/spark-submit ./scripts/spark_evaluating_model.py testing.vect training.vect_model/ testing.vect.prediction testing.vect.report testing.vect.prc.png
+```
+
+
+## Training
+
+model training (small data set)
+```angular2html
+$SPARKPATH/bin/spark-submit ./scripts/spark_training_model.py training.vect training.vect_model
+```