Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build classifier for typed vs handwritten text #108

Open
funkyvoong opened this issue Jun 6, 2023 · 6 comments
Open

Build classifier for typed vs handwritten text #108

funkyvoong opened this issue Jun 6, 2023 · 6 comments
Assignees

Comments

@funkyvoong
Copy link
Contributor

No description provided.

@trgardos
Copy link
Collaborator

@funkyvoong
Copy link
Contributor Author

Herbarium of the Future Paper: https://www.cell.com/trends/ecology-evolution/fulltext/S0169-5347(22)00295-6
Detecting Handwritten and Printed Text from Doctor's Notes: https://www.proquest.com/docview/2505259735?pq-origsite=gscholar&fromopenview=true

@kabilanmohanraj
Copy link
Collaborator

kabilanmohanraj commented Jun 27, 2023

Updates (27th June 2023):

Datasets in use:

  • Handwritten
  1. IAM HW
  2. IAM Online-HW
  3. IAM Washington
  • Machine Printed
  1. FUNSD
  2. SROIE2019 (typewriter-looking font)
  3. Synthetic data (generated with different fonts and styles)

Unused for now:

  1. COCO-Text (need to filter usable images strictly)
  2. CVL-HW (need to segment lines of text using CRAFT, working on this now)
  3. CVIT-HW (the text looks too organized, like printed text, and does not closely represent our data)

Work done:

  1. I went through readings about preprocessing steps for OCR. The aspect ratio of our image plays a major role, so I plotted the statistic to understand its distribution, based on which I have written specific resizing and cropping transforms for each of the datasets we are currently using.
  2. Populated test set with more images (working to increase the number of samples even more).
  3. Testing out DenseNet model as an alternative to VGG16 because VGG16 overfits very quickly.

Model Performance:
Our pipeline accuracy is capped at 80% now. I have identified problematic images which the model misclassified, and I am working on adding more suitable images to counter this problem.

I think this is a good direction, as the COCO-Text dataset has not been included this time (last week, it was part of the training set). Individual preprocessing of each dataset has been proven effective, as our model's performance hasn't dropped much without the dataset.

Current Tasks:

  1. Add samples to test dataset
  2. Evaluate DenseNet model performance
  3. Preprocess CVL-HW dataset
  4. [Prof. Langdon] Add more samples with varied fonts to Typed text (Text+Font -> PDF -> Images)

@kabilanmohanraj
Copy link
Collaborator

kabilanmohanraj commented Jul 1, 2023

Updates (30th June 2023):

  1. Added synthetic font data with different font sizes and styles (LibreOffice UNO API -> ODT file -> PDF file -> JPG image -> CRAFT -> individual images). For sample data, please refer here [files] [images]. Adding the new data increased the accuracy score.
  2. Modifications to the preprocessing pipeline - added erosion (morphological operation)
  3. DenseNet121 model accuracy is over 88% (F1 score > 0.9). Unlocked more layers to fine-tune. Tuning the number of layers to unlock.
  4. Discussion with Freddie:
    a. Focus more on the data
    b. Metrics to classify plant sample images into hw or typed (looking into this)
    c. Pointers on DocAI models (Hugging Face, model distillation) (looking into this as well)

@kabilanmohanraj
Copy link
Collaborator

Updates (3rd July 2023):

  1. Added more test images. Over 100 images were handpicked.
  2. Both DenseNet121 and a custom VGG16-like model (trained from scratch) exhibit an accuracy of 88%. DenseNet is a little bit lower in accuracy.
  3. Tried out some Document AI models hosted on Hugging Face. They don't identify labels from samples that well. Looking to label some samples to fine-tune such models.
  4. Working on the post-processing pipeline to classify plant samples based on the average confidence scores for each segmentation classification. Scripting is almost done.

@kabilanmohanraj
Copy link
Collaborator

kabilanmohanraj commented Jul 25, 2023

Updates (24th July 2023):

  1. Classification task:
    1.1 Implemented a Transformer classifier exhibiting a classification accuracy of about 96%.
    1.2 Attempted to implement an AdaBoost-type training with a simple neural network. The accuracy was only about 64%.
    1.3 Updated the post-processing (plant sample classification) step with the Transformer-based pipeline from 1.1. The classified images are in their respective folders.
    1.4 Did readings on Transformers, attention mechanism, multi-head attention, TrOCR, and hugging face implementation of the TrOCR pipeline (from 17th-24th July).
    1.5 Code commenting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants