|
1 |
| -# SimpleCudaNeuralNet |
2 |
| -This is for studying both neural network and CUDA. |
3 |
| - |
4 |
| -I focused on simplicity and conciseness while coding. That means there is no error handling but assertations. It is a self-study result for better understanding of back-propagation algorithm. It'd be good if this C++ code fragment helps someone who has an interest in deep learning. [CS231n](http://cs231n.stanford.edu/2017/syllabus) from Stanford provides a good starting point to learn deep learning. |
5 |
| - |
6 |
| -## Status |
7 |
| -#### Weight layers |
8 |
| -* 2D Convolutional |
9 |
| -* Fully connected |
10 |
| -* Batch normalization |
11 |
| - |
12 |
| -#### Non-linearity |
13 |
| -* Relu |
14 |
| - |
15 |
| -#### Regularisation |
16 |
| -* Max pooling |
17 |
| -* Dropout |
18 |
| - |
19 |
| -#### Loss |
20 |
| -* Mean squared error |
21 |
| -* Cross entropy loss |
22 |
| - |
23 |
| -#### Optimizer |
24 |
| -* Adam |
25 |
| - |
26 |
| -## Result |
27 |
| -### Handwritten digit recognition |
28 |
| - |
29 |
| - |
30 |
| -After basic components for deep learning implemented, I built a handwritten digit recognizer using [MNIST database](http://yann.lecun.com/exdb/mnist/). A simple 2-layer FCNN(1000 hidden unit) could achieve 1.56% Top-1 error rate after 14 epochs which take less than 20 seconds of training time on RTX 2070 graphics card. (See [mnist.cpp](mnist.cpp)) |
31 |
| - |
32 |
| -### CIFAR-10 photo classification |
33 |
| - |
34 |
| - |
35 |
| -In [cifar10.cpp](cifar10.cpp), you can find a VGG-like convolutional network which has 8 weight layers. [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset is used to train the model. It achieves 12.3% top-1 error rate after 31 epoches. It took 26.5 seconds of training time per epoch on my RTX 2070. If you try a larger model and have enough time to train you can improve it. |
36 |
| - |
37 |
| -### Notes |
38 |
| -- Even naive CUDA implementation easily speeds up by 700x more than single-core/no-SIMD CPU version. |
39 |
| -- Double precision floating point on the CUDA kernels was 3~4x slower than single precision operations. |
40 |
| -- Training performance is not comparable to PyTorch. PyTorch is much faster (x7~) to train the same model. |
41 |
| -- Coding this kind of numerical algorithms is tricky and even hard to figure out if there is a bug or not. Thorough unit testing of every functions strongly recommended if you try. |
| 1 | +# SimpleCudaNeuralNet |
| 2 | +This is for studying both neural network and CUDA. |
| 3 | + |
| 4 | +I focused on simplicity and conciseness while coding. That means there is no error handling but assertations. It is a self-study result for better understanding of back-propagation algorithm. It'd be good if this C++ code fragment helps someone who has an interest in deep learning. [CS231n](http://cs231n.stanford.edu/2017/syllabus) from Stanford provides a good starting point to learn deep learning. |
| 5 | + |
| 6 | +## Status |
| 7 | +#### Weight layers |
| 8 | +* 2D Convolutional |
| 9 | +* Fully connected |
| 10 | +* Batch normalization |
| 11 | + |
| 12 | +#### Non-linearity |
| 13 | +* Relu |
| 14 | + |
| 15 | +#### Regularisation |
| 16 | +* Max pooling |
| 17 | +* Dropout |
| 18 | + |
| 19 | +#### Loss |
| 20 | +* Mean squared error |
| 21 | +* Cross entropy loss |
| 22 | + |
| 23 | +#### Optimizer |
| 24 | +* Adam |
| 25 | + |
| 26 | +## Result |
| 27 | +### Handwritten digit recognition |
| 28 | + |
| 29 | + |
| 30 | +After basic components for deep learning implemented, I built a handwritten digit recognizer using [MNIST database](http://yann.lecun.com/exdb/mnist/). A simple 2-layer FCNN(1000 hidden unit) could achieve 1.56% Top-1 error rate after 14 epochs which take less than 20 seconds of training time on RTX 2070 graphics card. (See [mnist.cpp](mnist.cpp)) |
| 31 | + |
| 32 | +### CIFAR-10 photo classification |
| 33 | + |
| 34 | + |
| 35 | +In [cifar10.cpp](cifar10.cpp), you can find a VGG-like convolutional network which has 8 weight layers. [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset is used to train the model. It achieves 12.3% top-1 error rate after 31 epoches. It took 26.5 seconds of training time per epoch on my RTX 2070. If you try a larger model and have enough time to train you can improve it. |
| 36 | + |
| 37 | +### Notes |
| 38 | +- Even naive CUDA implementation easily speeds up by 700x more than single-core/no-SIMD CPU version. |
| 39 | +- Double precision floating point on the CUDA kernels was 3~4x slower than single precision operations. |
| 40 | +- Training performance is not comparable to PyTorch. PyTorch is much faster (x7~) to train the same model. |
| 41 | +- Coding this kind of numerical algorithms is tricky and even hard to figure out if there is a bug or not. Thorough unit testing of every functions strongly recommended if you try. |
0 commit comments