HME_Training

Training code for handwritten mathematical expression recognition using a bidirectionally-trained transformer, and a canvas GUI to write math, perform inference, and copy resulting LaTeX to clipboard. See a quick demo here: https://youtu.be/-FE8Shs6JJw?feature=shared.

Background

This project evolved from my SPIS final project a few months prior, which was my first real coding experience apart from AP CSP. An ongoing attempt to deploy as a web app can be found here.

Color Channel Embedding

Just as in text-to-text translation or image-captioning tasks, an encoder-decoder architecture is used here. The encoder is a pretrained Densenet-121 CNN and the decoder is a standard implementation of the one presented in Attention is All You Need. The pretrained Densenet means that all three color channels are inputted, and thus it is a bit of a waste to only process black-on-white images. For that reason, it becomes useful to embed time and distance information into color channels. This is performed with online InkML-format data (online as in opposed to offline image data) that represents the coordinates and timestamp of points sampled along pen strokes. For each line segment, its red, green, and blue channels are calculated by timestamp / time of entire stroke, x-displacement * scaling_x / (max x - min x) , and y-displacement * scaling_y / (max y - min y) respectively. The distance values were initially consistently less than 0.004 = 1/255, so the scaling was added to ensure their impact. Below are some examples of images modified in this way (and also with an applied erode/dilate transform).

Bidirectionality

A single decoder is trained on both left-to-right and right-to-left sequences, and during inference, after beam search in each direction, each beam is compared to every beam in the reverse direction and its probability is adjusted accordingly.

LaTeX and math itself are inherently somewhat palindromic (\begin{matrix} is always followed by \end{matrix}, every opening bracket or parenthesis is always followed by a closing bracket or parenthesis, operands are always sandwiched between expressions, etc.), much more so than the English language, which reflects in this simpler approach compared to the more complex and deeper bidirectionality within some language models such as BERT.

References

The model architecture and bidirectionality are from this paper by Zhao et al. (2021)
The idea to embed time and distance information into color channels came from this paper by Fadeeva et al. (2024)
Training samples are obtained from MathWriting coming out of Google Research
Timothy Leung's blog post was a great help in getting started with the PyTorch code

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
data		data
export		export
public		public
.gitignore		.gitignore
Models.py		Models.py
README.md		README.md
Tokenizer.py		Tokenizer.py
Train.py		Train.py
Util.py		Util.py
draw.py		draw.py
search.py		search.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HME_Training

Background

Color Channel Embedding

Bidirectionality

References

About

Releases

Packages

Languages

catsandsoup32/HME_Training

Folders and files

Latest commit

History

Repository files navigation

HME_Training

Background

Color Channel Embedding

Bidirectionality

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages