Skip to content

Commit 4c56069

Browse files
committed
* added all input files and results. Compressed input files to meet size requirement. README.md complete
1 parent 5d5ab0b commit 4c56069

File tree

235 files changed

+2680
-425
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

235 files changed

+2680
-425
lines changed

Diff for: README.md

+149-49
Original file line numberDiff line numberDiff line change
@@ -1,49 +1,149 @@
1-
```html
2-
<h3>Braden Katzman & Emily Berghoff<br>
3-
Columbia University Spring 2016<br>
4-
COMS 4761 - Computational Genomics<br>
5-
Professor Itsik Pe'er<br>
6-
Final Project</h3>
7-
8-
9-
<br><br>
10-
<div>
11-
<h4>Description (3/13/16):</h4>
12-
<p>
13-
This program classifies single cell types from RNA-seq data in mice. Our approach is to use supervised machine learning algorithms on a single cell RNA-seq dataset with a 80% marked (training) and 20% unmarked (testing) partition. We use 5 machine learning algorithms with the hopes of determining the best supervised learning algorithm to classify unmarked cells.
14-
</p>
15-
</div>
16-
17-
<br><br>
18-
<div>
19-
<h4>Technical Notes:</h4>
20-
<p>
21-
This project is written in Python using version 2.7. All machine learning algorithms are from the scikit-learn library which uses SciPy, NumPy and matplotlib to implement data mining and data analysis tools. Preprocessing of data is also implemented using scikit
22-
</p>
23-
<p><em>Scikit-learn: http://scikit-learn.org/stable/</em></p>
24-
</div>
25-
26-
27-
<br><br>
28-
<div>
29-
<h4>Machine Learning Algorithms:</h4>
30-
<ul>
31-
<li>Naive Bayes</li>
32-
<li>SVMs</li>
33-
<li>Random Forests, Decision Trees</li>
34-
<li>KNNs</li>
35-
<li>Neural Networks</li>
36-
</ul>
37-
</div>
38-
39-
<div>
40-
<h4>Conclusions</h4>
41-
42-
</div>
43-
44-
<div>
45-
<h4>Cell Classifications and RNA-Seq Data obtained from:</h4>
46-
47-
<p><em>Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138-1142 (2015). </em></p>
48-
</div>
49-
```
1+
### Braden Katzman & Emily Berghoff
2+
### Columbia University Spring 2016
3+
### COMS W4761 - Computational Genomics
4+
### Professor Itsik Pe'er
5+
### Final Project
6+
7+
8+
9+
## Abstract:
10+
11+
This program classifies single cell types from RNA-seq data in mice. Our approach is to use supervised machine learning algorithms
12+
on a single cell RNA-seq dataset with a marked training set and unmarked testing set. We used 4 machine learning algorithms with
13+
the hopes of determining the best supervised learning algorithm to classify unmarked cells.
14+
15+
## Machine Learning Algorithms:
16+
- Support Vector Machine using Radial Basis Function Kernel
17+
18+
- Random Forests, Decision Trees
19+
20+
- KNN
21+
22+
- Multi-Layer Perceptron (Neural Network)
23+
24+
## Conclusions:
25+
We found that the Neural Network gave the best performance, and the Support Vector Machine gave the worst performance.
26+
27+
28+
## Technical Notes:
29+
30+
This project is written in Python using version 2.7. All machine learning algorithms are from the scikit-learn library which
31+
uses SciPy, NumPy and matplotlib to implement data mining and data analysis tools. The stable version of scikit-learn is available here:
32+
33+
- Scikit-learn: http://scikit-learn.org/stable/
34+
35+
*** IMPORTANT NOTE: The Multi-Layer Perceptron is not supported in the stable version of scikit (as of 5/7/16). In order
36+
to use this classifier, the development version needs to be used. As such, the import
37+
for the Neural Network file is commented out in main.py so that the program compiles on the stable
38+
version of scikit-learn. In order to use the Multi-Layer Perceptron, download and compile the
39+
development version of scikit-learn and uncomment 'import neuralNetwork_RNASeq' on line 7 of main.py.
40+
Then, the neural network classifier can be used.
41+
42+
## Input Files:
43+
44+
- GSE60361C13005Expression.txt: this file holds the raw RNA seq data. It is organized by cells (columns) and
45+
their gene expression levels (rows). It is roughly 3000 cells x 20000 genes in size
46+
47+
- expressionmRNAAnnotations.txt: this file holds the annotations for the raw data including molecule count
48+
and cell type (1-9) classification
49+
50+
*** IMPORTANT NOTE:
51+
Github's file size limit is 100MB. Both the raw data and annotations files exceed these limits. As such, both files are compressed into a directory, 'RawData_Annotations_Compress.zip'. Upon cloning the source code, unpack this .zip and either supply the path into the directory to pass the files to main.py, or move the 2 files into the root directory of the project and just supply the file name.
52+
53+
## Output Files:
54+
55+
- RNASeq_SingleCellClassification_Results: this directory holds the results from each of the classifiers.
56+
There are 4 run options in terms of preprocessing, and each of these options has a subdirectory in the
57+
results folder. The two options which have a total of 4 combinations are down sampling or not down
58+
sampling, and 10-fold cross validation or 1-fold cross validation. In order to converge on the 'true'
59+
results of the classifiers, each classifier is run 5 times on each of the 4 preprocessing run options.
60+
In addition, as the user has the ability to supply the number of neighbors to be used for KNN, there are 5
61+
runs for each K = 1-8, on each of the 4 preprocessing run options.
62+
63+
## How to Run the Program:
64+
65+
The usage of the program is as follows:
66+
67+
- python main.py <raw_data> <data_annotations> <classifier 1-4> <down_sample 0 or 1> <cross_validate 0 or 1> <n_neighbors>
68+
69+
* To select a classifier, supply a number 1-4 corresponding to:
70+
- 1 == SVM
71+
- 2 == Neural Network
72+
- 3 == KNN
73+
- 4 == Random Forest
74+
75+
* To enable down sampling, supply 1. To disable, supply 0
76+
77+
* To enable cross validation, supply 1. To disable, supply 0
78+
79+
*** ONLY IF USING KNN:
80+
* Supply a valid integer n_neighbors
81+
82+
83+
Examples:
84+
85+
KNN with 5 neighbors, down sampling enabled, cross validation disabled:
86+
87+
- python main.py GSE60361C13005Expression.txt expressionmRNAAnnotations.txt 3 1 0 5
88+
89+
SVM with down sampling disabled, cross validation enabled
90+
91+
- python main.py GSE60361C13005Expression.txt expressionmRNAAnnotations.txt 1 0 1
92+
93+
94+
95+
## Description of Each File:
96+
97+
- main.py:
98+
This file processes the command line arguments supplied and runs the program using the selected classifier. Based on user input,
99+
the program decides whether to down sample, cross validate, and which classifier to use. This file defines the four classifiers
100+
supplied to the user for classification: Support Vector Machine using Radial Basis Function Kernel, Mutli-Layer Perceptron (Neural Network),
101+
K-Nearest Neighbor, and Random Forest. After calling the preproccess code and running the classification, this class sends the
102+
results to an analysis file that performs evaluations and writes the results to file.
103+
104+
- RNASeqData.py
105+
This class object represents the RNA Seq Data. It holds the raw data, the annotations,
106+
and provides methods for partitioning the data. The partitions (for both down sampling
107+
and non down sampling and cross validation and no cross validation) randomly make partitions
108+
of the data for both training and testing, while simultaneously holding the annotations
109+
for the randomly selected testing data. The class also provides accessor methods for all data,
110+
annotations, training data, testing data, and training data target values to evaluate performance.
111+
112+
- preprocess.py
113+
This file does all of the preprocessing work before running classification. This file loads
114+
raw data and annotations into memory. Next, this file is used to down dample by both
115+
cluster size and molecule count.
116+
117+
- analysis.py
118+
After classification, this file is used to evaluate the performance of the classifier and write the results to an output file.
119+
First, this file creates a confusion matrix and then computes the accuracy, sensitivity, specificity, MCC, and F1 Score at both
120+
the class level and the global level across the supplied number of folds (10 folds for cross validation and 1 fold for
121+
non-cross validation). This class also uses a basic metric of merely counting the number of correct classifications that was used
122+
initially to check the performance of the classifiers. After evaluating, the results are written to a file.
123+
124+
- knn_RNASeq.py
125+
This file defines the K Nearest Neighbor classifier. It allows the user to specificy the number of neighbors used
126+
and then fits the training data and the samples to the classifier. Then, it takes training data and makes predictions,
127+
returning the results of the predictions.
128+
129+
- neuralNetwork_RNASeq.py
130+
This file defines the Multi-Layer Perceptron (Neural Network). It fits the training data and the
131+
samples to the classifier. Then, it takes training data and makes predictions, returning the
132+
results of the predictions.
133+
134+
- randomForest_RNASeq.py
135+
This file defines the Random Forest Classifier. It fits the training data and the
136+
samples to the classifier. Then, it takes training data and makes predictions, returning the
137+
results of the predictions.
138+
139+
- rbfSVC_RNASeq.py
140+
This file defines the Support Vector Machine using a Radial Basis Function Kernel. It fits the training data and the
141+
samples to the classifier. Then, it takes training data and makes predictions, returning the
142+
results of the predictions.
143+
144+
145+
146+
## Bibliography:
147+
Project motivated by data from:
148+
149+
- Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138-1142 (2015).

Diff for: RNASeqData.py

+8-6
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,14 @@
11
import sys
22
import random
33

4+
# File: RNASeqData.py
5+
# This class object represents the RNA Seq Data. It holds the raw data, the annotations,
6+
# and provides methods for partitioning the data. The partitions (for both down sampling
7+
# and non down sampling and cross validation and no cross validation) randomly make partitions
8+
# of the data for both training and testing, while simultaneously holding the annotations
9+
# for the randomly selected testing data. The class also provides accessor methods for all data,
10+
# annotations, training data, testing data, and training data target values to evaluate performance.
11+
412
class RNASeqData(object):
513

614
def __init__(self, raw_data_file, annotations_file):
@@ -32,9 +40,6 @@ def setRandIndicesFromDS(self, randIndices):
3240
# put the indices in ascending order
3341
self.randIndices = sorted(randIndices)
3442

35-
def setFeatures(self, features):
36-
self.features = features
37-
3843
def makeDSTrainingAndTestingData(self):
3944
print "\npartitioning data set - 70% training, 30% testing"
4045
# randomly selecte 70% of each cluster for training, 30% for training
@@ -596,9 +601,6 @@ def getDSTrainingData(self):
596601
def getDSTestingData(self):
597602
return self.dsTestingData
598603

599-
def getFeatures(self):
600-
return self.features
601-
602604
def getDSTargetValues(self):
603605
return self.dsTargetValues
604606

Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
Accuracy Sensitivity Specificity MCC F1
2+
Interneuron 0.802325581395 0.779069767442 0.920343137255 0.581159869962 0.614678899083
3+
S1 Pyramidal 0.870967741935 0.790322580645 0.925449871465 0.651694278074 0.7
4+
CA1 Pyramidal 0.905109489051 0.978102189781 0.875796178344 0.807580533168 0.864516129032
5+
Oligodendrocyte 0.988721804511 0.962406015038 0.883647798742 0.800962861154 0.859060402685
6+
Microglia 0.857142857143 0.771428571429 0.912341407151 0.415204065959 0.391304347826
7+
Endothelial 0.977272727273 0.909090909091 0.90675990676 0.517471856237 0.487804878049
8+
Astrocyte 0.875 0.946428571429 0.904255319149 0.577184147337 0.557894736842
9+
Ependymal 0.666666666667 0.166666666667 0.911830357143 0.0224440857175 0.0232558139535
10+
Mural 0.545454545455 0.727272727273 0.909090909091 0.234207216884 0.16
11+
Global Evaluations 0.906873614191 0.781198666533 0.9055016539 0.511989879388 0.51761280083
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
Accuracy Sensitivity Specificity MCC F1
2+
Interneuron 0.831578947368 0.810526315789 0.900867410161 0.575944503663 0.611111111111
3+
S1 Pyramidal 0.990909090909 0.872727272727 0.893939393939 0.62771233558 0.662068965517
4+
CA1 Pyramidal 0.967272727273 0.949090909091 0.866028708134 0.77211913861 0.841935483871
5+
Oligodendrocyte 0.95 0.970833333333 0.86253776435 0.767602933508 0.826241134752
6+
Microglia 0.636363636364 0.606060606061 0.902186421174 0.297521184948 0.289855072464
7+
Endothelial 0.964285714286 0.767857142857 0.899527186761 0.461508532646 0.467391304348
8+
Astrocyte 0.929577464789 0.87323943662 0.892900120337 0.552623221599 0.558558558559
9+
Ependymal 0.833333333333 0.5 0.893973214286 0.102909451519 0.0576923076923
10+
Mural 0.875 0.5625 0.897291196388 0.193308219062 0.155172413793
11+
Global Evaluations 0.891352549889 0.768092779609 0.889916823948 0.483472169015 0.496669594679
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
Accuracy Sensitivity Specificity MCC F1
2+
Interneuron 0.796116504854 0.796116504854 0.909887359199 0.596743807096 0.63813229572
3+
S1 Pyramidal 0.97520661157 0.826446280992 0.90781049936 0.63700158683 0.682593856655
4+
CA1 Pyramidal 0.954385964912 0.947368421053 0.87358184765 0.784038442062 0.85308056872
5+
Oligodendrocyte 0.95358649789 0.978902953586 0.867669172932 0.778775856914 0.833034111311
6+
Microglia 0.75 0.71875 0.903448275862 0.355949154456 0.330935251799
7+
Endothelial 0.905660377358 0.811320754717 0.902237926973 0.484066848897 0.480446927374
8+
Astrocyte 0.78431372549 0.901960784314 0.896592244418 0.518585799308 0.497297297297
9+
Ependymal 0.428571428571 0.285714285714 0.901675977654 0.0548666305032 0.0412371134021
10+
Mural 0.538461538462 0.846153846154 0.897637795276 0.279916675052 0.191304347826
11+
Global Evaluations 0.89689578714 0.790303759043 0.895615677703 0.49888275568 0.505340196678
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
Accuracy Sensitivity Specificity MCC F1
2+
Interneuron 0.885057471264 0.850574712644 0.917791411043 0.624603991257 0.649122807018
3+
S1 Pyramidal 0.974789915966 0.873949579832 0.916985951469 0.685962079027 0.722222222222
4+
CA1 Pyramidal 0.982269503546 0.950354609929 0.893548387097 0.810150168908 0.87012987013
5+
Oligodendrocyte 0.952991452991 0.974358974359 0.889221556886 0.802084098082 0.850746268657
6+
Microglia 0.807692307692 0.807692307692 0.914383561644 0.391758493941 0.344262295082
7+
Endothelial 0.91935483871 0.806451612903 0.919047619048 0.544343961767 0.555555555556
8+
Astrocyte 0.984375 0.859375 0.915274463007 0.573721449886 0.578947368421
9+
Ependymal 0.6 0.4 0.914158305463 0.0825118465075 0.047619047619
10+
Mural 0.739130434783 0.869565217391 0.912400455063 0.397889278282 0.333333333333
11+
Global Evaluations 0.911308203991 0.82136911275 0.910312412302 0.545891707517 0.550215418671
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
Accuracy Sensitivity Specificity MCC F1
2+
Interneuron 0.8 0.8 0.928223844282 0.603287758046 0.630541871921
3+
S1 Pyramidal 0.940677966102 0.864406779661 0.924744897959 0.694924266087 0.731182795699
4+
CA1 Pyramidal 0.965397923875 0.9723183391 0.890701468189 0.827285510013 0.882260596546
5+
Oligodendrocyte 0.9921875 0.9609375 0.899380804954 0.816058833422 0.867724867725
6+
Microglia 0.846153846154 0.769230769231 0.921232876712 0.387381849357 0.347826086957
7+
Endothelial 0.931818181818 0.863636363636 0.91958041958 0.521766887341 0.503311258278
8+
Astrocyte 0.857142857143 0.952380952381 0.914183551847 0.624913378898 0.615384615385
9+
Ependymal 0.625 0.25 0.922818791946 0.0601693100582 0.0506329113924
10+
Mural 0.666666666667 0.777777777778 0.919683257919 0.333858043753 0.271844660194
11+
Global Evaluations 0.916851441242 0.801187609087 0.915616657043 0.541071759664 0.544523296011
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
Accuracy Sensitivity Specificity MCC F1
2+
Interneuron 0.84375 0.8125 0.920595533499 0.620744303246 0.655462184874
3+
S1 Pyramidal 0.825 0.933333333333 0.90537084399 0.70402782253 0.732026143791
4+
CA1 Pyramidal 0.992805755396 0.928057553957 0.900641025641 0.799787856485 0.862876254181
5+
Oligodendrocyte 0.976377952756 0.972440944882 0.884259259259 0.804250424871 0.857638888889
6+
Microglia 0.928571428571 0.928571428571 0.908466819222 0.450786154569 0.388059701493
7+
Endothelial 0.926829268293 0.878048780488 0.910569105691 0.496229299809 0.467532467532
8+
Astrocyte 0.884615384615 0.942307692308 0.907058823529 0.567323541787 0.544444444444
9+
Ependymal 0.25 0.166666666667 0.919101123596 0.035807897284 0.046511627907
10+
Mural 0.619047619048 0.571428571429 0.917139614075 0.252175153407 0.22641509434
11+
Global Evaluations 0.909090909091 0.792594996848 0.908133572056 0.525681383776 0.531218534161
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
Accuracy Sensitivity Specificity MCC F1
2+
Interneuron 0.93023255814 0.848837209302 0.908088235294 0.600248217395 0.623931623932
3+
S1 Pyramidal 0.903225806452 0.887096774194 0.904884318766 0.676775021313 0.714285714286
4+
CA1 Pyramidal 0.988970588235 0.926470588235 0.892063492063 0.785154296508 0.851351351351
5+
Oligodendrocyte 0.96694214876 0.97520661157 0.875757575758 0.789170696189 0.842857142857
6+
Microglia 0.84375 0.8125 0.905747126437 0.409247828466 0.371428571429
7+
Endothelial 0.877192982456 0.877192982456 0.904142011834 0.539560590515 0.531914893617
8+
Astrocyte 0.984375 0.890625 0.903341288783 0.566275584237 0.564356435644
9+
Ependymal 0.222222222222 0.222222222222 0.909294512878 0.0452217126527 0.0434782608696
10+
Mural 0.5625 0.5 0.909706546275 0.182262035811 0.153846153846
11+
Global Evaluations 0.90243902439 0.771127931998 0.901447234232 0.510435109232 0.521938905315
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
Accuracy Sensitivity Specificity MCC F1
2+
Interneuron 0.959459459459 0.891891891892 0.910628019324 0.608198495062 0.616822429907
3+
S1 Pyramidal 0.806451612903 0.959677419355 0.901028277635 0.718662414919 0.74375
4+
CA1 Pyramidal 0.962585034014 0.918367346939 0.904605263158 0.801895854226 0.868167202572
5+
Oligodendrocyte 0.987124463519 0.952789699571 0.893871449925 0.791331028153 0.844106463878
6+
Microglia 0.852941176471 0.794117647059 0.913594470046 0.42560906873 0.397058823529
7+
Endothelial 0.964285714286 0.875 0.91134751773 0.551052623211 0.544444444444
8+
Astrocyte 0.984126984127 0.888888888889 0.910607866508 0.578369155432 0.577319587629
9+
Ependymal 0.428571428571 0.285714285714 0.913966480447 0.0619845294688 0.046511627907
10+
Mural 0.588235294118 0.529411764706 0.916384180791 0.209725487819 0.18
11+
Global Evaluations 0.909090909091 0.788428771569 0.908448169507 0.527425406336 0.535353397763
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
Accuracy Sensitivity Specificity MCC F1
2+
Interneuron 0.925 0.875 0.914841849148 0.620125427616 0.636363636364
3+
S1 Pyramidal 0.965517241379 0.870689655172 0.917302798982 0.680721355008 0.716312056738
4+
CA1 Pyramidal 0.993103448276 0.941379310345 0.897058823529 0.809968032065 0.872204472843
5+
Oligodendrocyte 0.98 0.96 0.89263803681 0.803557214931 0.857142857143
6+
Microglia 0.964285714286 0.857142857143 0.913043478261 0.425444934323 0.375
7+
Endothelial 0.981132075472 0.830188679245 0.916372202591 0.526405870215 0.52380952381
8+
Astrocyte 0.933333333333 0.933333333333 0.90973871734 0.594379790639 0.583333333333
9+
Ependymal 1.0 0.75 0.912026726058 0.153013529833 0.0697674418605
10+
Mural 0.571428571429 0.52380952381 0.920544835414 0.23437590317 0.21568627451
11+
Global Evaluations 0.911308203991 0.837949262116 0.910396385348 0.5386657842 0.538846621845
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
Accuracy Sensitivity Specificity MCC F1
2+
Interneuron 0.882978723404 0.840425531915 0.904702970297 0.601961991556 0.632
3+
S1 Pyramidal 0.796610169492 0.898305084746 0.897959183673 0.663656388489 0.697368421053
4+
CA1 Pyramidal 0.989399293286 0.925795053004 0.885298869144 0.77988656593 0.850649350649
5+
Oligodendrocyte 0.995867768595 0.96694214876 0.872727272727 0.778696109415 0.835714285714
6+
Microglia 0.96 0.88 0.898517673888 0.389034884738 0.323529411765
7+
Endothelial 0.96 0.78 0.904929577465 0.461484077084 0.458823529412
8+
Astrocyte 0.983870967742 0.887096774194 0.89880952381 0.549116828303 0.544554455446
9+
Ependymal 0.444444444444 0.333333333333 0.903695408735 0.0789964763096 0.0612244897959
10+
Mural 0.684210526316 0.526315789474 0.906002265006 0.204147840429 0.178571428571
11+
Global Evaluations 0.89800443459 0.782023746158 0.896960304972 0.500775684695 0.509159485823

0 commit comments

Comments
 (0)