Skip to content

Commit ca14eaa

Browse files
committed
update Knn
1 parent 7676d21 commit ca14eaa

File tree

7 files changed

+324
-2
lines changed

7 files changed

+324
-2
lines changed

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -151,4 +151,6 @@ code_109 | [BLOB Feature Analysis](python/code_109) | ✔️
151151
code_110 | [KMeans Data Classification](python/code_110) | ✔️
152152
code_111 | [KMeans Image Segmentation](python/code_111) | ✔️
153153
code_112 | [KMeans Background Change](python/code_112) | ✔️
154-
code_113 | [KMeans Extract Image Color Card](python/code_113) | ✔️
154+
code_113 | [KMeans Extract Image Color Card](python/code_113) | ✔️
155+
code_114 | [KNN Classification](python/code_114) | ✔️
156+
code_115 | [KNN-Train Data Save and Load](python/code_115) | ✔️

README_CN.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -150,4 +150,6 @@ code_109 | [BLOB 特征分析](python/code_109) | ✔️
150150
code_110 | [KMeans 数据分类](python/code_110) | ✔️
151151
code_111 | [KMeans 图像分割](python/code_111) | ✔️
152152
code_112 | [KMeans 图像替换](python/code_112) | ✔️
153-
code_113 | [KMeans 图像色卡提取](python/code_113) | ✔️
153+
code_113 | [KMeans 图像色卡提取](python/code_113) | ✔️
154+
code_114 | [KNN 分类模型](python/code_114) | ✔️
155+
code_115 | [KNN 数据保存](python/code_115) | ✔️

python/code_114/README.md

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# Opencv KNN算法
2+
✏️ ⛳️👍 ✔️
3+
4+
## 概述
5+
6+
**KNN定义**
7+
8+
✔️ K最近邻(kNN,k-NearestNeighbor)分类算法是数据挖掘分类技术中最简单的方法之一, 通俗理解它,就是近朱者赤,近墨者黑。
9+
10+
**KNN原理**
11+
12+
✔️ 为了判断未知样本的类别,以所有已知类别的样本作为参照,计算未知样本与所有已知样本的距离,从中选取与未知样本距离最近的K个已知样本,根据少数服从多数的投票法则(majority-voting),将未知样本与K个最邻近样本中所属类别占比较多的归为一类。
13+
14+
**形象理解**
15+
16+
如下图所示,是一个二分类问题,现在需要确认红星的类别。
17+
18+
- k=3, 存在两个蓝色,一个绿色,则红星属于蓝色类别;
19+
- k=5, 存在三个绿色,2个蓝色,则红星属于绿色类别;
20+
21+
<img src=https://i.loli.net/2019/09/19/eS9tI1buNmnCW8Y.png width=350>
22+
23+
> k 的取值很关键,K值选的太大易引起欠拟合,太小容易过拟合,需交叉验证确定K值。
24+
25+
**算法的描述**
26+
27+
1)计算测试数据与各个训练数据之间的距离;
28+
29+
2)按照距离的递增关系进行排序;
30+
31+
3)选取距离最小的K个点;
32+
33+
4)确定前K个点所在类别的出现频率;
34+
35+
5)返回前K个点中出现频率最高的类别作为测试数据的预测分类。
36+
37+
**优点**
38+
39+
1.简单,易于理解,易于实现,无需估计参数,无需训练;
40+
41+
2.适合对稀有事件进行分类;
42+
43+
3.特别适合于多分类问题, kNN比SVM的表现要好。
44+
45+
46+
**缺点**
47+
48+
1> 当训练数据集很大时,需要大量的存储空间,而且需要计算待测样本和训练数据集中所有样本的距离,所以非常耗时;
49+
50+
2> KNN对于样本不均衡,以及随机分布的数据效果不好。
51+
52+
## 函数
53+
54+
1)创建 `cv2.ml.KNearest_create()`
55+
56+
2)训练 `knn.train(train, cv.ml.ROW_SAMPLE, train_labels)`
57+
58+
3)预测 `ret,result,neighbours,dist = nn.findNearest(sample,k=5)`
59+
60+
- sample 是待预测的数据样本;
61+
- k 表示选择最近邻的数目;
62+
- result 表示预测结果;
63+
- neighbours 表示每个样本的前k个邻居;
64+
- dist表示每个样本前k的邻居的距离;
65+
66+
67+
## 示例
68+
69+
Opencv官方Sample中有一张包含5000个手写数据集的图片,我们使用这张图结合KNN,实现手写数字的识别:
70+
71+
<img src=https://i.loli.net/2019/09/19/kBlJrMeyRIbLASU.png width=800>
72+
73+
**代码**
74+
75+
```python
76+
import numpy as np
77+
import cv2 as cv
78+
79+
# 读取数据
80+
img = cv.imread('digits.png')
81+
gray = cv.cvtColor(img,cv.COLOR_BGR2GRAY)
82+
cells = [np.hsplit(row,100) for row in np.vsplit(gray,50)]
83+
x = np.array(cells)
84+
85+
# 创建训练与测试数据
86+
train = x[:,:50].reshape(-1,400).astype(np.float32) # Size = (2500,400)
87+
test = x[:,50:100].reshape(-1,400).astype(np.float32) # Size = (2500,400)
88+
k = np.arange(10)
89+
train_labels = np.repeat(k,250)[:,np.newaxis]
90+
test_labels = train_labels.copy()
91+
92+
# 初始化KNN,并训练
93+
knn = cv.ml.KNearest_create()
94+
knn.train(train, cv.ml.ROW_SAMPLE, train_labels)
95+
ret,result,neighbours,dist = knn.findNearest(test,k=5)
96+
97+
# 计算准确率
98+
matches = result==test_labels
99+
correct = np.count_nonzero(matches)
100+
accuracy = correct*100.0/result.size
101+
print('acc = ', accuracy)
102+
```
103+
输出
104+
> acc = 91.76
105+
106+
进一步提高准确率的方法是增加训练数据,特别是错误的数据.每次训练时最好是保存训练数据,以便下次使用。
107+
```python
108+
# 保存训练数据
109+
np.savez('knn_data.npz',train=train, train_labels=train_labels)
110+
111+
# 加载训练数据
112+
with np.load('knn_data.npz') as data:
113+
print( data.files )
114+
train = data['train']
115+
train_labels = data['train_labels']
116+
```

python/code_114/digits.png

704 KB
Loading

python/code_114/opencv_114.py

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
import numpy as np
2+
import cv2 as cv
3+
4+
# 读取数据
5+
img = cv.imread('digits.png')
6+
gray = cv.cvtColor(img,cv.COLOR_BGR2GRAY)
7+
cells = [np.hsplit(row,100) for row in np.vsplit(gray,50)]
8+
x = np.array(cells)
9+
10+
# 创建训练与测试数据
11+
train = x[:,:50].reshape(-1,400).astype(np.float32) # Size = (2500,400)
12+
test = x[:,50:100].reshape(-1,400).astype(np.float32) # Size = (2500,400)
13+
k = np.arange(10)
14+
train_labels = np.repeat(k,250)[:,np.newaxis]
15+
test_labels = train_labels.copy()
16+
17+
# 初始化KNN,并训练
18+
knn = cv.ml.KNearest_create()
19+
knn.train(train, cv.ml.ROW_SAMPLE, train_labels)
20+
ret,result,neighbours,dist = knn.findNearest(test,k=5)
21+
22+
# 计算准确率
23+
matches = result==test_labels
24+
correct = np.count_nonzero(matches)
25+
accuracy = correct*100.0/result.size
26+
print(accuracy)
27+
28+
'''
29+
# 保存训练数据
30+
np.savez('knn_data.npz',train=train, train_labels=train_labels)
31+
32+
# 加载训练数据
33+
with np.load('knn_data.npz') as data:
34+
print( data.files )
35+
train = data['train']
36+
train_labels = data['train_labels']
37+
'''

python/code_115/README.md

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# Opencv KNN算法
2+
✏️ ⛳️👍 ✔️
3+
4+
## 概述
5+
6+
**KNN定义**
7+
8+
✔️ K最近邻(kNN,k-NearestNeighbor)分类算法是数据挖掘分类技术中最简单的方法之一, 通俗理解它,就是近朱者赤,近墨者黑。
9+
10+
**KNN原理**
11+
12+
✔️ 为了判断未知样本的类别,以所有已知类别的样本作为参照,计算未知样本与所有已知样本的距离,从中选取与未知样本距离最近的K个已知样本,根据少数服从多数的投票法则(majority-voting),将未知样本与K个最邻近样本中所属类别占比较多的归为一类。
13+
14+
**形象理解**
15+
16+
如下图所示,是一个二分类问题,现在需要确认红星的类别。
17+
18+
- k=3, 存在两个蓝色,一个绿色,则红星属于蓝色类别;
19+
- k=5, 存在三个绿色,2个蓝色,则红星属于绿色类别;
20+
21+
<img src=https://i.loli.net/2019/09/19/eS9tI1buNmnCW8Y.png width=350>
22+
23+
> k 的取值很关键,K值选的太大易引起欠拟合,太小容易过拟合,需交叉验证确定K值。
24+
25+
**算法的描述**
26+
27+
1)计算测试数据与各个训练数据之间的距离;
28+
29+
2)按照距离的递增关系进行排序;
30+
31+
3)选取距离最小的K个点;
32+
33+
4)确定前K个点所在类别的出现频率;
34+
35+
5)返回前K个点中出现频率最高的类别作为测试数据的预测分类。
36+
37+
**优点**
38+
39+
1.简单,易于理解,易于实现,无需估计参数,无需训练;
40+
41+
2.适合对稀有事件进行分类;
42+
43+
3.特别适合于多分类问题, kNN比SVM的表现要好。
44+
45+
46+
**缺点**
47+
48+
1> 当训练数据集很大时,需要大量的存储空间,而且需要计算待测样本和训练数据集中所有样本的距离,所以非常耗时;
49+
50+
2> KNN对于样本不均衡,以及随机分布的数据效果不好。
51+
52+
## 函数
53+
54+
1)创建 `cv2.ml.KNearest_create()`
55+
56+
2)训练 `knn.train(train, cv.ml.ROW_SAMPLE, train_labels)`
57+
58+
3)预测 `ret,result,neighbours,dist = nn.findNearest(sample,k=5)`
59+
60+
- sample 是待预测的数据样本;
61+
- k 表示选择最近邻的数目;
62+
- result 表示预测结果;
63+
- neighbours 表示每个样本的前k个邻居;
64+
- dist表示每个样本前k的邻居的距离;
65+
66+
67+
## 示例
68+
69+
Opencv官方Sample中有一张包含5000个手写数据集的图片,我们使用这张图结合KNN,实现手写数字的识别:
70+
71+
<img src=https://i.loli.net/2019/09/19/kBlJrMeyRIbLASU.png width=800>
72+
73+
**代码**
74+
75+
```python
76+
import numpy as np
77+
import cv2 as cv
78+
79+
# 读取数据
80+
img = cv.imread('digits.png')
81+
gray = cv.cvtColor(img,cv.COLOR_BGR2GRAY)
82+
cells = [np.hsplit(row,100) for row in np.vsplit(gray,50)]
83+
x = np.array(cells)
84+
85+
# 创建训练与测试数据
86+
train = x[:,:50].reshape(-1,400).astype(np.float32) # Size = (2500,400)
87+
test = x[:,50:100].reshape(-1,400).astype(np.float32) # Size = (2500,400)
88+
k = np.arange(10)
89+
train_labels = np.repeat(k,250)[:,np.newaxis]
90+
test_labels = train_labels.copy()
91+
92+
# 初始化KNN,并训练
93+
knn = cv.ml.KNearest_create()
94+
knn.train(train, cv.ml.ROW_SAMPLE, train_labels)
95+
ret,result,neighbours,dist = knn.findNearest(test,k=5)
96+
97+
# 计算准确率
98+
matches = result==test_labels
99+
correct = np.count_nonzero(matches)
100+
accuracy = correct*100.0/result.size
101+
print('acc = ', accuracy)
102+
```
103+
输出
104+
> acc = 91.76
105+
106+
进一步提高准确率的方法是增加训练数据,特别是错误的数据.每次训练时最好是保存训练数据,以便下次使用。
107+
```python
108+
# 保存训练数据
109+
np.savez('knn_data.npz',train=train, train_labels=train_labels)
110+
111+
# 加载训练数据
112+
with np.load('knn_data.npz') as data:
113+
print( data.files )
114+
train = data['train']
115+
train_labels = data['train_labels']
116+
```

python/code_115/opencv_115.py

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
#!/usr/bin/env python
2+
# -*- coding: utf-8 -*-
3+
"""
4+
@Date: 2019-09-19 18:13:22
5+
6+
@author: JimmyHua
7+
"""
8+
9+
import numpy as np
10+
import cv2 as cv
11+
12+
# 读取数据
13+
img = cv.imread('../code_114/digits.png')
14+
gray = cv.cvtColor(img,cv.COLOR_BGR2GRAY)
15+
cells = [np.hsplit(row,100) for row in np.vsplit(gray,50)]
16+
x = np.array(cells)
17+
18+
# 创建训练与测试数据
19+
train = x[:,:50].reshape(-1,400).astype(np.float32) # Size = (2500,400)
20+
test = x[:,50:100].reshape(-1,400).astype(np.float32) # Size = (2500,400)
21+
k = np.arange(10)
22+
train_labels = np.repeat(k,250)[:,np.newaxis]
23+
24+
25+
# 保存训练数据
26+
np.savez('knn_data.npz',train=train, train_labels=train_labels)
27+
28+
test_labels = train_labels.copy()
29+
30+
# 加载训练数据
31+
with np.load('knn_data.npz') as data:
32+
print( data.files )
33+
train = data['train']
34+
train_labels = data['train_labels']
35+
36+
# 初始化KNN,并训练
37+
knn = cv.ml.KNearest_create()
38+
knn.train(train, cv.ml.ROW_SAMPLE, train_labels)
39+
ret,result,neighbours,dist = knn.findNearest(test,k=5)
40+
41+
# 计算准确率
42+
matches = result==test_labels
43+
correct = np.count_nonzero(matches)
44+
accuracy = correct*100.0/result.size
45+
print(accuracy)
46+
47+
48+
49+

0 commit comments

Comments
 (0)