|
1 |
| -```html |
2 |
| -<h3>Braden Katzman & Emily Berghoff<br> |
3 |
| -Columbia University Spring 2016<br> |
4 |
| -COMS 4761 - Computational Genomics<br> |
5 |
| -Professor Itsik Pe'er<br> |
6 |
| -Final Project</h3> |
7 |
| - |
8 |
| - |
9 |
| -<br><br> |
10 |
| -<div> |
11 |
| - <h4>Description (3/13/16):</h4> |
12 |
| - <p> |
13 |
| - This program classifies single cell types from RNA-seq data in mice. Our approach is to use supervised machine learning algorithms on a single cell RNA-seq dataset with a 80% marked (training) and 20% unmarked (testing) partition. We use 5 machine learning algorithms with the hopes of determining the best supervised learning algorithm to classify unmarked cells. |
14 |
| - </p> |
15 |
| -</div> |
16 |
| - |
17 |
| -<br><br> |
18 |
| -<div> |
19 |
| - <h4>Technical Notes:</h4> |
20 |
| - <p> |
21 |
| - This project is written in Python using version 2.7. All machine learning algorithms are from the scikit-learn library which uses SciPy, NumPy and matplotlib to implement data mining and data analysis tools. Preprocessing of data is also implemented using scikit |
22 |
| - </p> |
23 |
| - <p><em>Scikit-learn: http://scikit-learn.org/stable/</em></p> |
24 |
| -</div> |
25 |
| - |
26 |
| - |
27 |
| -<br><br> |
28 |
| -<div> |
29 |
| -<h4>Machine Learning Algorithms:</h4> |
30 |
| -<ul> |
31 |
| - <li>Naive Bayes</li> |
32 |
| - <li>SVMs</li> |
33 |
| - <li>Random Forests, Decision Trees</li> |
34 |
| - <li>KNNs</li> |
35 |
| - <li>Neural Networks</li> |
36 |
| -</ul> |
37 |
| -</div> |
38 |
| - |
39 |
| -<div> |
40 |
| - <h4>Conclusions</h4> |
41 |
| - |
42 |
| -</div> |
43 |
| - |
44 |
| -<div> |
45 |
| - <h4>Cell Classifications and RNA-Seq Data obtained from:</h4> |
46 |
| - |
47 |
| - <p><em>Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138-1142 (2015). </em></p> |
48 |
| -</div> |
49 |
| -``` |
| 1 | +### Braden Katzman & Emily Berghoff |
| 2 | +### Columbia University Spring 2016 |
| 3 | +### COMS W4761 - Computational Genomics |
| 4 | +### Professor Itsik Pe'er |
| 5 | +### Final Project |
| 6 | + |
| 7 | + |
| 8 | + |
| 9 | +## Abstract: |
| 10 | + |
| 11 | + This program classifies single cell types from RNA-seq data in mice. Our approach is to use supervised machine learning algorithms |
| 12 | + on a single cell RNA-seq dataset with a marked training set and unmarked testing set. We used 4 machine learning algorithms with |
| 13 | + the hopes of determining the best supervised learning algorithm to classify unmarked cells. |
| 14 | + |
| 15 | +## Machine Learning Algorithms: |
| 16 | +- Support Vector Machine using Radial Basis Function Kernel |
| 17 | + |
| 18 | +- Random Forests, Decision Trees |
| 19 | + |
| 20 | +- KNN |
| 21 | + |
| 22 | +- Multi-Layer Perceptron (Neural Network) |
| 23 | + |
| 24 | +## Conclusions: |
| 25 | + We found that the Neural Network gave the best performance, and the Support Vector Machine gave the worst performance. |
| 26 | + |
| 27 | + |
| 28 | +## Technical Notes: |
| 29 | + |
| 30 | + This project is written in Python using version 2.7. All machine learning algorithms are from the scikit-learn library which |
| 31 | + uses SciPy, NumPy and matplotlib to implement data mining and data analysis tools. The stable version of scikit-learn is available here: |
| 32 | + |
| 33 | + - Scikit-learn: http://scikit-learn.org/stable/ |
| 34 | + |
| 35 | + *** IMPORTANT NOTE: The Multi-Layer Perceptron is not supported in the stable version of scikit (as of 5/7/16). In order |
| 36 | + to use this classifier, the development version needs to be used. As such, the import |
| 37 | + for the Neural Network file is commented out in main.py so that the program compiles on the stable |
| 38 | + version of scikit-learn. In order to use the Multi-Layer Perceptron, download and compile the |
| 39 | + development version of scikit-learn and uncomment 'import neuralNetwork_RNASeq' on line 7 of main.py. |
| 40 | + Then, the neural network classifier can be used. |
| 41 | + |
| 42 | +## Input Files: |
| 43 | + |
| 44 | +- GSE60361C13005Expression.txt: this file holds the raw RNA seq data. It is organized by cells (columns) and |
| 45 | +their gene expression levels (rows). It is roughly 3000 cells x 20000 genes in size |
| 46 | + |
| 47 | +- expressionmRNAAnnotations.txt: this file holds the annotations for the raw data including molecule count |
| 48 | +and cell type (1-9) classification |
| 49 | + |
| 50 | + *** IMPORTANT NOTE: |
| 51 | + Github's file size limit is 100MB. Both the raw data and annotations files exceed these limits. As such, both files are compressed into a directory, 'RawData_Annotations_Compress.zip'. Upon cloning the source code, unpack this .zip and either supply the path into the directory to pass the files to main.py, or move the 2 files into the root directory of the project and just supply the file name. |
| 52 | + |
| 53 | +## Output Files: |
| 54 | + |
| 55 | +- RNASeq_SingleCellClassification_Results: this directory holds the results from each of the classifiers. |
| 56 | +There are 4 run options in terms of preprocessing, and each of these options has a subdirectory in the |
| 57 | +results folder. The two options which have a total of 4 combinations are down sampling or not down |
| 58 | +sampling, and 10-fold cross validation or 1-fold cross validation. In order to converge on the 'true' |
| 59 | +results of the classifiers, each classifier is run 5 times on each of the 4 preprocessing run options. |
| 60 | +In addition, as the user has the ability to supply the number of neighbors to be used for KNN, there are 5 |
| 61 | +runs for each K = 1-8, on each of the 4 preprocessing run options. |
| 62 | + |
| 63 | +## How to Run the Program: |
| 64 | + |
| 65 | + The usage of the program is as follows: |
| 66 | + |
| 67 | +- python main.py <raw_data> <data_annotations> <classifier 1-4> <down_sample 0 or 1> <cross_validate 0 or 1> <n_neighbors> |
| 68 | + |
| 69 | + * To select a classifier, supply a number 1-4 corresponding to: |
| 70 | + - 1 == SVM |
| 71 | + - 2 == Neural Network |
| 72 | + - 3 == KNN |
| 73 | + - 4 == Random Forest |
| 74 | + |
| 75 | + * To enable down sampling, supply 1. To disable, supply 0 |
| 76 | + |
| 77 | + * To enable cross validation, supply 1. To disable, supply 0 |
| 78 | + |
| 79 | + *** ONLY IF USING KNN: |
| 80 | + * Supply a valid integer n_neighbors |
| 81 | + |
| 82 | + |
| 83 | +Examples: |
| 84 | + |
| 85 | + KNN with 5 neighbors, down sampling enabled, cross validation disabled: |
| 86 | + |
| 87 | + - python main.py GSE60361C13005Expression.txt expressionmRNAAnnotations.txt 3 1 0 5 |
| 88 | + |
| 89 | + SVM with down sampling disabled, cross validation enabled |
| 90 | + |
| 91 | + - python main.py GSE60361C13005Expression.txt expressionmRNAAnnotations.txt 1 0 1 |
| 92 | + |
| 93 | + |
| 94 | + |
| 95 | +## Description of Each File: |
| 96 | + |
| 97 | +- main.py: |
| 98 | + This file processes the command line arguments supplied and runs the program using the selected classifier. Based on user input, |
| 99 | + the program decides whether to down sample, cross validate, and which classifier to use. This file defines the four classifiers |
| 100 | + supplied to the user for classification: Support Vector Machine using Radial Basis Function Kernel, Mutli-Layer Perceptron (Neural Network), |
| 101 | + K-Nearest Neighbor, and Random Forest. After calling the preproccess code and running the classification, this class sends the |
| 102 | + results to an analysis file that performs evaluations and writes the results to file. |
| 103 | + |
| 104 | +- RNASeqData.py |
| 105 | + This class object represents the RNA Seq Data. It holds the raw data, the annotations, |
| 106 | + and provides methods for partitioning the data. The partitions (for both down sampling |
| 107 | + and non down sampling and cross validation and no cross validation) randomly make partitions |
| 108 | + of the data for both training and testing, while simultaneously holding the annotations |
| 109 | + for the randomly selected testing data. The class also provides accessor methods for all data, |
| 110 | + annotations, training data, testing data, and training data target values to evaluate performance. |
| 111 | + |
| 112 | +- preprocess.py |
| 113 | + This file does all of the preprocessing work before running classification. This file loads |
| 114 | + raw data and annotations into memory. Next, this file is used to down dample by both |
| 115 | + cluster size and molecule count. |
| 116 | + |
| 117 | +- analysis.py |
| 118 | + After classification, this file is used to evaluate the performance of the classifier and write the results to an output file. |
| 119 | + First, this file creates a confusion matrix and then computes the accuracy, sensitivity, specificity, MCC, and F1 Score at both |
| 120 | + the class level and the global level across the supplied number of folds (10 folds for cross validation and 1 fold for |
| 121 | + non-cross validation). This class also uses a basic metric of merely counting the number of correct classifications that was used |
| 122 | + initially to check the performance of the classifiers. After evaluating, the results are written to a file. |
| 123 | + |
| 124 | +- knn_RNASeq.py |
| 125 | + This file defines the K Nearest Neighbor classifier. It allows the user to specificy the number of neighbors used |
| 126 | + and then fits the training data and the samples to the classifier. Then, it takes training data and makes predictions, |
| 127 | + returning the results of the predictions. |
| 128 | + |
| 129 | +- neuralNetwork_RNASeq.py |
| 130 | + This file defines the Multi-Layer Perceptron (Neural Network). It fits the training data and the |
| 131 | + samples to the classifier. Then, it takes training data and makes predictions, returning the |
| 132 | + results of the predictions. |
| 133 | + |
| 134 | +- randomForest_RNASeq.py |
| 135 | + This file defines the Random Forest Classifier. It fits the training data and the |
| 136 | + samples to the classifier. Then, it takes training data and makes predictions, returning the |
| 137 | + results of the predictions. |
| 138 | + |
| 139 | +- rbfSVC_RNASeq.py |
| 140 | + This file defines the Support Vector Machine using a Radial Basis Function Kernel. It fits the training data and the |
| 141 | + samples to the classifier. Then, it takes training data and makes predictions, returning the |
| 142 | + results of the predictions. |
| 143 | + |
| 144 | + |
| 145 | + |
| 146 | +## Bibliography: |
| 147 | + Project motivated by data from: |
| 148 | + |
| 149 | +- Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138-1142 (2015). |
0 commit comments