OCR

Optical character recognition

Feature Extraction

[I reduced the number of features in the training data to the first 1150 so that i could store the features in the model dictionary with the key '14395' (and also save space for the stop-list model['stop']) in order to carry them over to the function call where dimensionality reduction is applied to the dev data. I need the training data to be accessible because i need to use the training data to calculate the covariance matrix and the mean to find the test pcas. I also have another key in model dictionary called 'best' which stores the top 10 indexes of the pcas with the highest divergence. I calculated the divergence of features in every pair combination of classes and then i summed the divergence of each feature to find the indexes of top 10 features with the highest divergences. I find that i get the best results when i get the the top 10 pcas out of 20. ]

Classifier

[I used the k nearest neighbour classifier. The accuracy of documents 1 and 2 decreases as k increases but the accuracy of 3,4,5,6 increases significantly. The performance below is from a 5 nearest neighbour classifier. I find that 5 nearest neighbour to be the point where the accuracy increase in page 3,4,5 is most significant, without decreasing the accuracy in pages 1 and 2 by too much. From k=10 and above the increase in accuracy of pages 3,4,5 is only by 1 or 2 percent ]

Error Correction

[I got the x values of the bb boxes and put each line into a tuple because the distance between x1 of the next line and x2 of the previous line is noticeably greater between words. I sliced the labels based on the distance between x1 of the next tuple and x2 of the current tuple being greater than 7. Now with my list of words i iterate through the list , joining each word into a string and if they are not in my stop list, i sequentially convert the letters of the word string into an ascii value increase the value by 1 convert it back and then replace the letter in the string with the new character and check again to see if its my stop list. If it is in the stop list i will replace the word in the word list with the new word list otherwise i iterate through the ascii values of the letters in the word until i get a word that is in the stop list otherwise i return the word as it is.]

Performance

The percentage errors (to 1 decimal place) for the development data are as follows:

Page 1: [97.4%]
Page 2: [96.7%]
Page 3: [71.0%]
Page 4: [39.1%]
Page 5: [27.1%]
Page 6: [22.5%]

Other information (Optional, Max 100 words)

[Optional: I used the stop list from http://www.mieliestronk.com/wordlist.html and put the stop list into model['stop'] in the json file]

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.idea		.idea
OCR_assignment/code		OCR_assignment/code
OCR_project/code		OCR_project/code
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR

Feature Extraction

Classifier

Error Correction

Performance

Other information (Optional, Max 100 words)

About

Releases

Packages

Languages

Ltwo3five/OCR

Folders and files

Latest commit

History

Repository files navigation

OCR

Feature Extraction

Classifier

Error Correction

Performance

Other information (Optional, Max 100 words)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages