An high performance artificial Neural Network program that classifies handwritten digits, trained with online images data and built upon matrix multiplication
- Implemented a C++ version of a deep neural network that recognizes handwritten numbers on a Supercomnting Cluster
- Trained 50000 images and tested 10000 images on the local temporary storage, achieving more than 50% accuracy
- Shaved off CPU usage and optimized performance of the neuralnet classification program with the memory caching and SIMD parallelism techniques
On the Ohio Supercomputing Center Pfizer cluster
| Component | Details |
|---|---|
| CPU Model | Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz |
| CPU/Core Speed | 2.40GHz |
| RAM | 200GB |
| Operating system used | Linux 3.10.0-1160.95.1.el7.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux |
| Name and version of C++ compiler | gcc version 8.4.0 (GCC ) |
| Name and version of other non-standard software tools & components | Linux perf |
Performance Statistics
| Rep | User Time (s) | Elapsed Time (s) | Peak memory (KB) |
|---|---|---|---|
| 1 | 32.77 | 33.45 | 3176 |
| 2 | 31.88 | 32.48 | 3176 |
| 3 | 31.40 | 31.95 | 3176 |
| 4 | 32.24 | 32.91 | 3176 |
| 5 | 32.94 | 33.62 | 3176 |
Performance Statistics
| Rep | User Time (s) | Elapsed Time (s) | Peak memory (KB) |
|---|---|---|---|
| 1 | 9.47 | 09.75 | 98192 |
| 2 | 9.70 | 09.97 | 98192 |
| 3 | 9.36 | 09.64 | 98192 |
| 4 | 9.47 | 09.77 | 98192 |
| 5 | 9.34 | 09.64 | 98192 |
Performance Statistics
| Replicate# | Reference | Improved |
|---|---|---|
| 1 | 33.45 | 09.75 |
| 2 | 32.48 | 09.97 |
| 3 | 31.95 | 09.64 |
| 4 | 32.91 | 09.77 |
| 5 | 33.62 | 09.64 |
| Average | 32.882 | 9.754 |
| SD | 0.6888904122 | 0.1350185172 |
| 95% CI Range | 0.8553704235 | 0.167647632 |
| Stats | 32.882 ± 0.86 | 9.754 ± 0.17 |
| T-Test (H0: μ1=μ2) | 0.00000007633396889 |
First, there’s a drastic change in elapsed runtime between two versions: from roughly 32.882 to 9.754 seconds. The t-test indicates that the change is significant, and it is achieved by following reasons. First, changing the matrix representation from 2d to 1d vector plays a key role which enables efficient caching, especially for matrix of size (n, 1). Second, previously, for every epoch, images is loaded automatically from disk space. In this version, a map is used to caches repetitive image file in many epochs, which reduce the I/O time. Note that the peak memory therefore increases by roughly 33 times. In conclusion, this version makes a good tradeoff of memory for runtime improvement.