The official PyTorch implementation of LevOCR (ECCV 2022).
LevOCR can perform text sequence generation task and text sequence refinement task with the cross-modal fusion feature generated by Vision-Language Transformer (VLT) model. The refinement process is accomplished via two basic character-level operations: Deletion and Insertion, which are learned with Imitation Learning and allow for parallel decoding, dynamic length change and good interpretability. LevOCR exhibits the good interpretability and transparency in the inference phase, which could be very crucial for diagnosing and improving text recognition models in the future.
- PyTorch version >= 1.8.0
- Python version >= 3.6
pip3 install -r requirements.txt
- For training new models, you need to install fairseq(We borrowed the parts of fairseq during training)
git clone https://github.com/pytorch/fairseq
cd fairseq
git checkout 0.12.2-release
pip install --editable ./
python setup.py build_ext --inplace
Download lmdb dataset from Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition.
-
Training datasets
- MJSynth (MJ):
- Use
tools/create_lmdb_dataset.py
to convert images into LMDB dataset - LMDB dataset BaiduNetdisk(passwd:n23k)
- Use
- SynthText (ST):
- Use
tools/crop_by_word_bb.py
to crop images from original SynthText dataset, and convert images into LMDB dataset bytools/create_lmdb_dataset.py
- LMDB dataset BaiduNetdisk(passwd:n23k)
- Use
- Train_language:
- This text dataset is only used for the pre-trainig of the language model.
- It contains words from WikiText103, MJSynth and SynthText.
- MJSynth (MJ):
-
Evaluation datasets LMDB datasets can be downloaded from BaiduNetdisk(passwd:1dbv), GoogleDrive.
- ICDAR 2013 (IC13)
- ICDAR 2015 (IC15)
- IIIT5K Words (IIIT)
- Street View Text (SVT)
- Street View Text-Perspective (SVTP)
- CUTE80 (CUTE)
-
The structure of data folder as below.
data
├── evaluation
│ ├── CUTE80
│ ├── IC13_857
│ ├── IC15_1811
│ ├── IIIT5k_3000
│ ├── SVT
│ └── SVTP
├── training
│ ├── MJ
│ │ ├── MJ_test
│ │ ├── MJ_train
│ │ └── MJ_valid
│ ├── ST
│ └── train_language.txt
At this time, training datasets and evaluation datasets are LMDB datasets
Available model weights:
Language | Vision | LevOCR |
---|---|---|
Pretrain-language-model | Pretrain-vision-model | LevOCR-model |
Performances of the reproduced pretrained models are summaried as follows:
Model | Iteration | IC13 | SVT | IIIT | IC15 | SVTP | CUTE | AVG |
---|---|---|---|---|---|---|---|---|
LevOCR-VP | - | 95.8 | 92.4 | 95.4 | 84.5 | 84.6 | 88.8 | 91.2 |
LevOCR | #1 | 96.7 | 94.2 | 96.5 | 86.1 | 88.6 | 90.6 | 92.8 |
#2 | 96.7 | 94.4 | 96.6 | 86.5 | 88.8 | 90.6 | 92.9 | |
#3 | 96.7 | 94.4 | 96.6 | 86.5 | 88.8 | 90.6 | 92.9 |
- Download pretrained model
- Add image files to test into
demo_imgs/
- Run demo_imgs.py
python3 demo_imgs.py --imgH 32 --imgW 128 --max_iter 2 --batch_size 16 --model_dir <path_to/model.pth> --rgb --th 0.5 --demo_imgs demo_imgs
- Pre-train language model
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 --master_port 29501 train_language_dist.py --train_data data/training/train_language.txt \
--valInterval 5000 --lr 0.3 --saved_path <path/to/save/dir> --exp_name levocr_pretrain_language --batch_size 512 --num_iter 2400000
- Train LevOCR
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 --master_port 29501 train_final_dist.py --train_data data/training \
--valid_data data/evaluation --select_data MJ-ST --batch_ratio 0.5-0.5 --valInterval 5000 --lr 0.3 --rgb \
--saved_path <path/to/save/dir> --exp_name levocr_32_128 --batch_size 32 --manualSeed 21223 --seed 223 --num_iter 2400000 \
--vis_model <path/to/pretrain-vision-model.pth> --levt_model <path/to/pretrain-language-model.pth>
Find the path to best_accuracy.pth
checkpoint file (usually in saved_path
folder).
python3 eval.py --eval_data data/evaluation --data_filtering_off --fast_acc --imgH 32 --imgW 128 --batch_size 128 --rgb --th 0.5 --max_iter 2 --model_dir <path_to/best_accuracy.pth>
The detailed iterative process of LevOCR with different initial sequences on 6 public benchmarks.
This implementation has been based on these repository fairseq, CLOVA AI Deep Text Recognition Benchmark, ABINet.
If you find this work useful, please cite:
@inproceedings{ECCV2022LevOCR,
title={Levenshtein OCR},
author={Cheng Da, Peng Wang, and Cong Yao},
booktitle = {ECCV},
year={2022}
}
LevOCR is released under the terms of the Apache License, Version 2.0.
LevOCR is an algorithm for scene text recognition and the code and models herein created by the authors from Alibaba can only be used for research purpose.
Copyright (C) 1999-2022 Alibaba Group Holding Ltd.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.