Given a product name in text, classify its category using a transformer based categorization model.
- Dataset contains more than 1.2 million product name and its label.
- The number of category label is 3,827.
The overview of the model structure will be illustrated below.
In this model, BPE(Byte Pair Encoding) tokenization method is used. Google provides inofficial package named sentencepiece.
BPE splits texts into subword tokens, based on frequency in the text. It makes a vocabulary list of specified number (vocab_size), and in this project I used 40,000 for the vocab_size.
For more detailed explanation on BPE and WPM(Word Piece Model), please check this post (In Korean) : https://lovit.github.io/nlp/2018/04/02/wpm/
In this repo, there are 4 main parts used to build the categorization model. 1) Preprocessor 2) Spm trainer 3) Deep learning model structure (transformer using keras) 4) Prediction
textPreprocessor.py
This module is implemented to preprocess text data to make it into a clear text form as the data used in training session. It contains removing punctuation, separating words written in Korean and English, etc.
2_spm_training.ipynb
SPM trainer contains codes used to train BPE model. hyperparameters used are listed below:
templates= "--input={} \
--pad_id={} \
--vocab_size={} \
--model_prefix={} \
--bos_id={} \
--eos_id={} \
--unk_id={} \
--character_coverage={} \
--model_type={}"
train_input_file = "sentencepiece_trsfm.txt"
pad_id=0
vocab_size = "40000"
prefix = "sentencepiece"
bos_id=1
eos_id=2
unk_id=3
character_coverage = "0.9995"
model_type ="bpe"
textClassfierModel.py
Model structure is defined in the module.
1_executePrediction.ipynb
preprocessing and spm-encoding are executed in this part. the trained neural network weights are loaded and it is used to make prediction on the tokens of the given texts.
Transformer based model outperforms the baseline model using Bidirectional LSTM and Convolutional Network. please note that CNN layer is applied to the transformer based model as well.
methods | Accuracy | Micro f1 score | Macro f1 score |
---|---|---|---|
Transformer based model | 92.14 | 78.81 | 82.21 |
BiLSTM+CNN | 89.47 | 76.42 | 77.91 |
-
micro f1 score:
Calculate metrics globally by counting the total true positives, false negatives and false positives. -
macro f1 score:
Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.