bradenkatzman
diff --git a/‎README.md
Lines changed: 149 additions & 49 deletions b/‎README.md
Lines changed: 149 additions & 49 deletions
diff --git a/‎RNASeqData.py
Lines changed: 8 additions & 6 deletions b/‎RNASeqData.py
Lines changed: 8 additions & 6 deletions
diff --git a/‎RNASeq_SingleCellClassification_Results/!DS & !CV/1knn_04-19-2016_08-43-05.txt
Lines changed: 11 additions & 0 deletions b/‎RNASeq_SingleCellClassification_Results/!DS & !CV/1knn_04-19-2016_08-43-05.txt
Lines changed: 11 additions & 0 deletions
diff --git a/‎RNASeq_SingleCellClassification_Results/!DS & !CV/1knn_04-19-2016_08-44-20.txt
Lines changed: 11 additions & 0 deletions b/‎RNASeq_SingleCellClassification_Results/!DS & !CV/1knn_04-19-2016_08-44-20.txt
Lines changed: 11 additions & 0 deletions
diff --git a/‎RNASeq_SingleCellClassification_Results/!DS & !CV/1knn_04-19-2016_09-35-26.txt
Lines changed: 11 additions & 0 deletions b/‎RNASeq_SingleCellClassification_Results/!DS & !CV/1knn_04-19-2016_09-35-26.txt
Lines changed: 11 additions & 0 deletions
diff --git a/‎RNASeq_SingleCellClassification_Results/!DS & !CV/1knn_04-19-2016_09-36-47.txt
Lines changed: 11 additions & 0 deletions b/‎RNASeq_SingleCellClassification_Results/!DS & !CV/1knn_04-19-2016_09-36-47.txt
Lines changed: 11 additions & 0 deletions
diff --git a/‎RNASeq_SingleCellClassification_Results/!DS & !CV/1knn_04-19-2016_09-38-04.txt
Lines changed: 11 additions & 0 deletions b/‎RNASeq_SingleCellClassification_Results/!DS & !CV/1knn_04-19-2016_09-38-04.txt
Lines changed: 11 additions & 0 deletions
diff --git a/‎RNASeq_SingleCellClassification_Results/!DS & !CV/2knn_04-19-2016_09-39-32.txt
Lines changed: 11 additions & 0 deletions b/‎RNASeq_SingleCellClassification_Results/!DS & !CV/2knn_04-19-2016_09-39-32.txt
Lines changed: 11 additions & 0 deletions
diff --git a/‎RNASeq_SingleCellClassification_Results/!DS & !CV/2knn_04-19-2016_09-40-54.txt
Lines changed: 11 additions & 0 deletions b/‎RNASeq_SingleCellClassification_Results/!DS & !CV/2knn_04-19-2016_09-40-54.txt
Lines changed: 11 additions & 0 deletions
diff --git a/‎RNASeq_SingleCellClassification_Results/!DS & !CV/2knn_04-19-2016_09-42-17.txt
Lines changed: 11 additions & 0 deletions b/‎RNASeq_SingleCellClassification_Results/!DS & !CV/2knn_04-19-2016_09-42-17.txt
Lines changed: 11 additions & 0 deletions
@@ -1,49 +1,149 @@
-```html
-<h3>Braden Katzman & Emily Berghoff<br>
-Columbia University Spring 2016<br>
-COMS 4761 - Computational Genomics<br>
-Professor Itsik Pe'er<br>
-Final Project</h3>
-
-
-<br><br>
-<div>
-	<h4>Description (3/13/16):</h4>
-	<p>
-	This program classifies single cell types from RNA-seq data in mice. Our approach is to use supervised machine learning algorithms on a single cell RNA-seq dataset with a 80% marked (training) and 20% unmarked (testing) partition. We use 5 machine learning algorithms with the hopes of determining the best supervised learning algorithm to classify unmarked cells.
-	</p>
-</div>
-
-<br><br>
-<div>
-	<h4>Technical Notes:</h4>
-	<p>
-	This project is written in Python using version 2.7. All machine learning algorithms are from the scikit-learn library which uses SciPy, NumPy and matplotlib to implement data mining and data analysis tools. Preprocessing of data is also implemented using scikit
-	</p>
-	<p><em>Scikit-learn: http://scikit-learn.org/stable/</em></p>
-</div>
-
-
-<br><br>
-<div>
-<h4>Machine Learning Algorithms:</h4>
-<ul>
-	<li>Naive Bayes</li>
-	<li>SVMs</li>
-	<li>Random Forests, Decision Trees</li>
-	<li>KNNs</li>
-	<li>Neural Networks</li>
-</ul>
-</div>
-
-<div>
-	<h4>Conclusions</h4>
-
-</div>
-
-<div>
-	<h4>Cell Classifications and RNA-Seq Data obtained from:</h4>
-
-	<p><em>Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138-1142 (2015). </em></p>
-</div>
-```
+### Braden Katzman & Emily Berghoff
+### Columbia University Spring 2016
+### COMS W4761 - Computational Genomics
+### Professor Itsik Pe'er
+### Final Project
+
+
+
+## Abstract:
+
+	This program classifies single cell types from RNA-seq data in mice. Our approach is to use supervised machine learning algorithms
+	on a single cell RNA-seq dataset with a marked training set and unmarked testing set. We used 4 machine learning algorithms with 
+	the hopes of determining the best supervised learning algorithm to classify unmarked cells.
+
+## Machine Learning Algorithms:
+- Support Vector Machine using Radial Basis Function Kernel
+
+- Random Forests, Decision Trees
+
+- KNN
+
+- Multi-Layer Perceptron (Neural Network)
+
+## Conclusions:
+	We found that the Neural Network gave the best performance, and the Support Vector Machine gave the worst performance.
+	
+
+## Technical Notes:
+
+	This project is written in Python using version 2.7. All machine learning algorithms are from the scikit-learn library which 
+	uses SciPy, NumPy and matplotlib to implement data mining and data analysis tools. The stable version of scikit-learn is available here:
+
+	- Scikit-learn: http://scikit-learn.org/stable/
+	
+	*** IMPORTANT NOTE: The Multi-Layer Perceptron is not supported in the stable version of scikit (as of 5/7/16). In order 
+	to use this classifier, the development version needs to be used. As such, the import
+	for the Neural Network file is commented out in main.py so that the program compiles on the stable 
+	version of scikit-learn. In order to use the Multi-Layer Perceptron, download and compile the 
+	development version of scikit-learn and uncomment 'import neuralNetwork_RNASeq' on line 7 of main.py.
+	Then, the neural network classifier can be used.
+
+## Input Files:
+
+- GSE60361C13005Expression.txt: this file holds the raw RNA seq data. It is organized by cells (columns) and
+their gene expression levels (rows). It is roughly 3000 cells x 20000 genes in size
+
+- expressionmRNAAnnotations.txt: this file holds the annotations for the raw data including molecule count
+and cell type (1-9) classification
+
+	*** IMPORTANT NOTE:
+	Github's file size limit is 100MB. Both the raw data and annotations files exceed these limits. As such, both files are compressed into a directory, 'RawData_Annotations_Compress.zip'. Upon cloning the source code, unpack this .zip and either supply the path into the directory to pass the files to main.py, or move the 2 files into the root directory of the project and just supply the file name. 
+
+## Output Files:
+
+- RNASeq_SingleCellClassification_Results: this directory holds the results from each of the classifiers. 
+There are 4 run options in terms of preprocessing, and each of these options has a subdirectory in the
+results folder. The two options which have a total of 4 combinations are down sampling or not down
+sampling, and 10-fold cross validation or 1-fold cross validation. In order to converge on the 'true'
+results of the classifiers, each classifier is run 5 times on each of the 4 preprocessing run options.
+In addition, as the user has the ability to supply the number of neighbors to be used for KNN, there are 5
+runs for each K = 1-8, on each of the 4 preprocessing run options.
+
+## How to Run the Program:
+
+	The usage of the program is as follows:
+
+- python main.py <raw_data> <data_annotations> <classifier 1-4> <down_sample 0 or 1> <cross_validate 0 or 1> <n_neighbors>
+
+	* To select a classifier, supply a number 1-4 corresponding to:
+	- 1 == SVM
+	- 2 == Neural Network
+	- 3 == KNN
+	- 4 == Random Forest
+
+	* To enable down sampling, supply 1. To disable, supply 0
+
+	* To enable cross validation, supply 1. To disable, supply 0
+
+	*** ONLY IF USING KNN:
+	* Supply a valid integer n_neighbors
+
+
+Examples:
+
+	KNN with 5 neighbors, down sampling enabled, cross validation disabled:
+
+	- python main.py GSE60361C13005Expression.txt expressionmRNAAnnotations.txt 3 1 0 5
+
+	SVM with down sampling disabled, cross validation enabled
+
+	- python main.py GSE60361C13005Expression.txt expressionmRNAAnnotations.txt 1 0 1
+
+
+
+## Description of Each File:
+
+- main.py: 
+	This file processes the command line arguments supplied and runs the program using the selected classifier. Based on user input,
+	the program decides whether to down sample, cross validate, and which classifier to use. This file defines the four classifiers
+	supplied to the user for classification: Support Vector Machine using Radial Basis Function Kernel, Mutli-Layer Perceptron (Neural Network),
+	K-Nearest Neighbor, and Random Forest. After calling the preproccess code and running the classification, this class sends the 
+	results to an analysis file that performs evaluations and writes the results to file.
+
+- RNASeqData.py
+	This class object represents the RNA Seq Data. It holds the raw data, the annotations,
+	and provides methods for partitioning the data. The partitions (for both down sampling
+	and non down sampling and cross validation and no cross validation) randomly make partitions
+	of the data for both training and testing, while simultaneously holding the annotations
+	for the randomly selected testing data. The class also provides accessor methods for all data,
+	annotations, training data, testing data, and training data target values to evaluate performance.
+
+- preprocess.py
+	This file does all of the preprocessing work before running classification. This file loads
+	raw data and annotations into memory. Next, this file is used to down dample by both
+	cluster size and molecule count.
+
+- analysis.py
+	After classification, this file is used to evaluate the performance of the classifier and write the results to an output file.
+	First, this file creates a confusion matrix and then computes the accuracy, sensitivity, specificity, MCC, and F1 Score at both
+	the class level and the global level across the supplied number of folds (10 folds for cross validation and 1 fold for 
+	non-cross validation). This class also uses a basic metric of merely counting the number of correct classifications that was used
+	initially to check the performance of the classifiers. After evaluating, the results are written to a file.
+
+- knn_RNASeq.py
+	This file defines the K Nearest Neighbor classifier. It allows the user to specificy the number of neighbors used 
+	and then fits the training data and the samples to the classifier. Then, it takes training data and makes predictions,
+	returning the results of the predictions.
+
+- neuralNetwork_RNASeq.py
+	This file defines the Multi-Layer Perceptron (Neural Network). It fits the training data and the
+	samples to the classifier. Then, it takes training data and makes predictions, returning the 
+	results of the predictions.
+
+- randomForest_RNASeq.py
+	This file defines the Random Forest Classifier. It fits the training data and the
+	samples to the classifier. Then, it takes training data and makes predictions, returning the 
+	results of the predictions.
+
+- rbfSVC_RNASeq.py
+	This file defines the Support Vector Machine using a Radial Basis Function Kernel. It fits the training data and the
+	samples to the classifier. Then, it takes training data and makes predictions, returning the 
+	results of the predictions.
+
+
+
+## Bibliography:
+	Project motivated by data from:
+
+- Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138-1142 (2015).
@@ -1,6 +1,14 @@
 import sys
 import random
 
+# File: RNASeqData.py
+#	This class object represents the RNA Seq Data. It holds the raw data, the annotations,
+#	and provides methods for partitioning the data. The partitions (for both down sampling
+#	and non down sampling and cross validation and no cross validation) randomly make partitions
+#	of the data for both training and testing, while simultaneously holding the annotations
+#	for the randomly selected testing data. The class also provides accessor methods for all data,
+#	annotations, training data, testing data, and training data target values to evaluate performance.
+
 class RNASeqData(object):
 
 	def __init__(self, raw_data_file, annotations_file):
@@ -32,9 +40,6 @@ def setRandIndicesFromDS(self, randIndices):
 		# put the indices in ascending order
 		self.randIndices = sorted(randIndices)
 
-	def setFeatures(self, features):
-		self.features = features
-
 	def makeDSTrainingAndTestingData(self):
 		print "\npartitioning data set - 70% training, 30% testing"
 		# randomly selecte 70% of each cluster for training, 30% for training
@@ -596,9 +601,6 @@ def getDSTrainingData(self):
 	def getDSTestingData(self):
 		return self.dsTestingData
 
-	def getFeatures(self):
-		return self.features
-
 	def getDSTargetValues(self):
 		return self.dsTargetValues
 
 
@@ -0,0 +1,11 @@
+ 	Accuracy	Sensitivity	Specificity	MCC	F1
+Interneuron	0.802325581395	0.779069767442	0.920343137255	0.581159869962	0.614678899083
+S1 Pyramidal	0.870967741935	0.790322580645	0.925449871465	0.651694278074	0.7
+CA1 Pyramidal	0.905109489051	0.978102189781	0.875796178344	0.807580533168	0.864516129032
+Oligodendrocyte	0.988721804511	0.962406015038	0.883647798742	0.800962861154	0.859060402685
+Microglia	0.857142857143	0.771428571429	0.912341407151	0.415204065959	0.391304347826
+Endothelial	0.977272727273	0.909090909091	0.90675990676	0.517471856237	0.487804878049
+Astrocyte	0.875	0.946428571429	0.904255319149	0.577184147337	0.557894736842
+Ependymal	0.666666666667	0.166666666667	0.911830357143	0.0224440857175	0.0232558139535
+Mural	0.545454545455	0.727272727273	0.909090909091	0.234207216884	0.16
+Global Evaluations	0.906873614191	0.781198666533	0.9055016539	0.511989879388	0.51761280083
@@ -0,0 +1,11 @@
+ 	Accuracy	Sensitivity	Specificity	MCC	F1
+Interneuron	0.831578947368	0.810526315789	0.900867410161	0.575944503663	0.611111111111
+S1 Pyramidal	0.990909090909	0.872727272727	0.893939393939	0.62771233558	0.662068965517
+CA1 Pyramidal	0.967272727273	0.949090909091	0.866028708134	0.77211913861	0.841935483871
+Oligodendrocyte	0.95	0.970833333333	0.86253776435	0.767602933508	0.826241134752
+Microglia	0.636363636364	0.606060606061	0.902186421174	0.297521184948	0.289855072464
+Endothelial	0.964285714286	0.767857142857	0.899527186761	0.461508532646	0.467391304348
+Astrocyte	0.929577464789	0.87323943662	0.892900120337	0.552623221599	0.558558558559
+Ependymal	0.833333333333	0.5	0.893973214286	0.102909451519	0.0576923076923
+Mural	0.875	0.5625	0.897291196388	0.193308219062	0.155172413793
+Global Evaluations	0.891352549889	0.768092779609	0.889916823948	0.483472169015	0.496669594679
@@ -0,0 +1,11 @@
+ 	Accuracy	Sensitivity	Specificity	MCC	F1
+Interneuron	0.796116504854	0.796116504854	0.909887359199	0.596743807096	0.63813229572
+S1 Pyramidal	0.97520661157	0.826446280992	0.90781049936	0.63700158683	0.682593856655
+CA1 Pyramidal	0.954385964912	0.947368421053	0.87358184765	0.784038442062	0.85308056872
+Oligodendrocyte	0.95358649789	0.978902953586	0.867669172932	0.778775856914	0.833034111311
+Microglia	0.75	0.71875	0.903448275862	0.355949154456	0.330935251799
+Endothelial	0.905660377358	0.811320754717	0.902237926973	0.484066848897	0.480446927374
+Astrocyte	0.78431372549	0.901960784314	0.896592244418	0.518585799308	0.497297297297
+Ependymal	0.428571428571	0.285714285714	0.901675977654	0.0548666305032	0.0412371134021
+Mural	0.538461538462	0.846153846154	0.897637795276	0.279916675052	0.191304347826
+Global Evaluations	0.89689578714	0.790303759043	0.895615677703	0.49888275568	0.505340196678
@@ -0,0 +1,11 @@
+ 	Accuracy	Sensitivity	Specificity	MCC	F1
+Interneuron	0.885057471264	0.850574712644	0.917791411043	0.624603991257	0.649122807018
+S1 Pyramidal	0.974789915966	0.873949579832	0.916985951469	0.685962079027	0.722222222222
+CA1 Pyramidal	0.982269503546	0.950354609929	0.893548387097	0.810150168908	0.87012987013
+Oligodendrocyte	0.952991452991	0.974358974359	0.889221556886	0.802084098082	0.850746268657
+Microglia	0.807692307692	0.807692307692	0.914383561644	0.391758493941	0.344262295082
+Endothelial	0.91935483871	0.806451612903	0.919047619048	0.544343961767	0.555555555556
+Astrocyte	0.984375	0.859375	0.915274463007	0.573721449886	0.578947368421
+Ependymal	0.6	0.4	0.914158305463	0.0825118465075	0.047619047619
+Mural	0.739130434783	0.869565217391	0.912400455063	0.397889278282	0.333333333333
+Global Evaluations	0.911308203991	0.82136911275	0.910312412302	0.545891707517	0.550215418671
@@ -0,0 +1,11 @@
+ 	Accuracy	Sensitivity	Specificity	MCC	F1
+Interneuron	0.8	0.8	0.928223844282	0.603287758046	0.630541871921
+S1 Pyramidal	0.940677966102	0.864406779661	0.924744897959	0.694924266087	0.731182795699
+CA1 Pyramidal	0.965397923875	0.9723183391	0.890701468189	0.827285510013	0.882260596546
+Oligodendrocyte	0.9921875	0.9609375	0.899380804954	0.816058833422	0.867724867725
+Microglia	0.846153846154	0.769230769231	0.921232876712	0.387381849357	0.347826086957
+Endothelial	0.931818181818	0.863636363636	0.91958041958	0.521766887341	0.503311258278
+Astrocyte	0.857142857143	0.952380952381	0.914183551847	0.624913378898	0.615384615385
+Ependymal	0.625	0.25	0.922818791946	0.0601693100582	0.0506329113924
+Mural	0.666666666667	0.777777777778	0.919683257919	0.333858043753	0.271844660194
+Global Evaluations	0.916851441242	0.801187609087	0.915616657043	0.541071759664	0.544523296011
@@ -0,0 +1,11 @@
+ 	Accuracy	Sensitivity	Specificity	MCC	F1
+Interneuron	0.84375	0.8125	0.920595533499	0.620744303246	0.655462184874
+S1 Pyramidal	0.825	0.933333333333	0.90537084399	0.70402782253	0.732026143791
+CA1 Pyramidal	0.992805755396	0.928057553957	0.900641025641	0.799787856485	0.862876254181
+Oligodendrocyte	0.976377952756	0.972440944882	0.884259259259	0.804250424871	0.857638888889
+Microglia	0.928571428571	0.928571428571	0.908466819222	0.450786154569	0.388059701493
+Endothelial	0.926829268293	0.878048780488	0.910569105691	0.496229299809	0.467532467532
+Astrocyte	0.884615384615	0.942307692308	0.907058823529	0.567323541787	0.544444444444
+Ependymal	0.25	0.166666666667	0.919101123596	0.035807897284	0.046511627907
+Mural	0.619047619048	0.571428571429	0.917139614075	0.252175153407	0.22641509434
+Global Evaluations	0.909090909091	0.792594996848	0.908133572056	0.525681383776	0.531218534161
@@ -0,0 +1,11 @@
+ 	Accuracy	Sensitivity	Specificity	MCC	F1
+Interneuron	0.93023255814	0.848837209302	0.908088235294	0.600248217395	0.623931623932
+S1 Pyramidal	0.903225806452	0.887096774194	0.904884318766	0.676775021313	0.714285714286
+CA1 Pyramidal	0.988970588235	0.926470588235	0.892063492063	0.785154296508	0.851351351351
+Oligodendrocyte	0.96694214876	0.97520661157	0.875757575758	0.789170696189	0.842857142857
+Microglia	0.84375	0.8125	0.905747126437	0.409247828466	0.371428571429
+Endothelial	0.877192982456	0.877192982456	0.904142011834	0.539560590515	0.531914893617
+Astrocyte	0.984375	0.890625	0.903341288783	0.566275584237	0.564356435644
+Ependymal	0.222222222222	0.222222222222	0.909294512878	0.0452217126527	0.0434782608696
+Mural	0.5625	0.5	0.909706546275	0.182262035811	0.153846153846
+Global Evaluations	0.90243902439	0.771127931998	0.901447234232	0.510435109232	0.521938905315
@@ -0,0 +1,11 @@
+ 	Accuracy	Sensitivity	Specificity	MCC	F1
+Interneuron	0.959459459459	0.891891891892	0.910628019324	0.608198495062	0.616822429907
+S1 Pyramidal	0.806451612903	0.959677419355	0.901028277635	0.718662414919	0.74375
+CA1 Pyramidal	0.962585034014	0.918367346939	0.904605263158	0.801895854226	0.868167202572
+Oligodendrocyte	0.987124463519	0.952789699571	0.893871449925	0.791331028153	0.844106463878
+Microglia	0.852941176471	0.794117647059	0.913594470046	0.42560906873	0.397058823529
+Endothelial	0.964285714286	0.875	0.91134751773	0.551052623211	0.544444444444
+Astrocyte	0.984126984127	0.888888888889	0.910607866508	0.578369155432	0.577319587629
+Ependymal	0.428571428571	0.285714285714	0.913966480447	0.0619845294688	0.046511627907
+Mural	0.588235294118	0.529411764706	0.916384180791	0.209725487819	0.18
+Global Evaluations	0.909090909091	0.788428771569	0.908448169507	0.527425406336	0.535353397763