1 KNN recognizes handwritten digits 1.1 KNN(K-nearest neighbor algorithm) to recognize handwritten digits 1.1.1 Introduction to KNN 1.1.2 Taking the recognition of handwritten digits as an example to introduce the implementation of the KNN algorithm
In the feature space, if most of the k nearest(that is, the nearest neighbors in the feature space) samples near a sample belong to a certain category, the sample also belongs to this category.
In official words, the so-called K-nearest neighbor algorithm is to give a training data set, for a new input instance, find the K instances closest to the instance in the training data set(that is, the K neighbors mentioned above).), the majority of these K instances belong to a certain class, and the input instance is classified into this class.
There is a sample data set, also known as a training sample set, and each data in the sample set has a label, that is, we know the relationship between each data in the sample set and the category to which it belongs. After inputting data without labels, compare each feature in the new data with the features corresponding to the data in the sample set, and extract the classification labels of the most similar data(nearest neighbors) in the sample set. Generally speaking, we only select the top K most similar data in the sample data set, which is the origin of K in the K nearest neighbor algorithm, usually K is an integer not greater than 20. Finally, the classification with the most occurrences in the K most similar data is selected as the classification of the new data.
advantage
Flexible usage, convenient for small sample prediction, high precision, insensitive to outliers, no data input assumptions
shortcoming
Lack of training phase, unable to cope with multiple samples, high computational complexity and high space complexity
Calculate distance
Euclidean distance, that is
Sort by increasing distance
Select the K points with the smallest distance(generally no more than 20)
Determine the frequency of occurrence of the category in which the first K points belong, frequency = a category/k
Returns the category with the highest frequency in the top K points as the predicted classification of the test data
Training set
The training set is already classified data, you can refer to the directory ~/KNN/knn-digits/trainingDigits
Note: The training set has been trained. If the user wants to train by himself, he needs to back up the original training set first. The number of training sets is large.
test set
The test set is used to test the algorithm, you can refer to the directory ~/KNN/knn-digits/testDigits
Hold down ctrl and slide the mouse wheel up to enlarge the picture to the maximum. Pick the Paint Bucket Tool and choose a black color to fill the entire screen.
Pick the Pencil Tool, choose White for the Color, and the Maximum Width for the Line Thickness. As shown below:
After drawing, save it as a png image(this example takes 8.png as an example), and copy it to the project directory through the WinSCP tool
Code location: ~/KNN/img2file.py
xfrom PIL import Image
import numpy as np
def img2txt(img_path, txt_name):
im = Image.open(img_path). convert('1'). resize((32, 32)) # type:Image.Image
data = np.asarray(im)
np.savetxt(txt_name, data, fmt = '%d', delimiter = '')
img2txt("8.png", "./knn-digits/testDigits/8_1.txt")
img_path: image file name
txt_name: After conversion, save it in the ~/KNN/knn-digits/testDigits directory, the name is 8_1.txt
xxxxxxxxxx
cd KNN
python img2file.py
After the program runs, a file named 8_1.txt will be generated in the KNN directory. Double-click to open it and you can see that this is the case.
It can be seen that the part with the number 8 is roughly surrounded by a number 8.
xxxxxxxxxx
In the ~/KNN directory, open a terminal and run
python knn.py
Code location: ~/KNN/knn.py
xxxxxxxxxx
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Sun Apr 16 14:53:46 2017
@author: jiangkang
"""
import numpy
import operator
import os
import matplotlib.pyplot as plt
def img2vector(filename):
returnVect = numpy.zeros((1, 1024))
file = open(filename)
for i in range(32):
lineStr = file.readline()
for j in range(32):
returnVect [ 0, 32 * i + j ] = int(lineStr [ j ])
return returnVect
def classifier(inX, dataSet, labels, k):
dataSetSize = dataSet.shape [ 0 ]
diffMat = numpy.tile(inX,(dataSetSize, 1)) - dataSet
sqDiffMat = diffMat ** 2
sqDistances = sqDiffMat.sum(axis = 1)
distances = sqDistances ** 0.5
sortedDistIndicies = distances.argsort()
classCount = {}
for i in range(k):
voteIlabel = labels [ sortedDistIndicies [ i ]]
classCount [ voteIlabel ] = classCount.get(voteIlabel, 0) + 1
sortedClassCount = sorted(classCount.items(), key = operator.itemgetter(1), reverse = True)
return sortedClassCount [ 0 ][ 0 ]
#Train first, then test recognition, k represents the k value in the algorithm - select the top K most similar data in the sample data set, the k value is generally an integer not greater than 20
def handWritingClassTest(k):
hwLabels = []
trainingFileList = os.listdir('knn-digits/trainingDigits') training set data
m = len(trainingFileList)
trainingMat = numpy.zeros((m, 1024))
for i in range(m):
fileNameStr = trainingFileList [ i ]
fileStr = fileNameStr.split('.')[ 0 ]
classNumStr = int(fileStr.split('_')[ 0 ])
hwLabels.append(classNumStr)
trainingMat [ i, :] = img2vector("knn-digits/trainingDigits/%s" % fileNameStr)
testFileList = os.listdir('knn-digits/testDigits') test set data
errorCount = 0.0
mTest = len(testFileList)
for i in range(mTest):
fileNameStr = testFileList [ i ]
fileStr = fileNameStr.split('.')[ 0 ]
classNumStr = int(fileStr.split('_')[ 0 ])
vectorTest = img2vector("knn-digits/testDigits/%s" % fileNameStr)
result = classifier(vectorTest, trainingMat, hwLabels, k)
print("The classification result is: %d, the real result is: %d" %(result, classNumStr))
if result != classNumStr :
errorCount += 1.0
'''fileStr = "2.txt"
classNumStr = int(fileStr.split('.')[0])
vectorTest = img2vector("./2.txt")
result = classifier(vectorTest, trainingMat, hwLabels, 3)'''
print("Total number of errors: %d" % errorCount)
print("Error rate: %f" %(errorCount / mTest))
return errorCount
handWritingClassTest(1)
As shown in the figure, the identification is 8, and the identification is correct. If the recognition result is different from the real result, please copy the converted txt file to knn-digits/trainingDigits/ and name it as follows: 8_801.txt, and then retrain it to recognize it normally.
Program Description
The name of the text file converted from the picture, taking 8_1 as an example, knn.py will parse the file name in the program, the underscore is the limit, the front 8 represents the real number, and the latter part can be customized, see the code,
xxxxxxxxxx
classNumStr = int(fileStr.split('_')[0])
Train first, then identify.