TODO: [] fine tunning to find the best parameters [] tensorboardX [] Prediction GUI
===================================================================== Jul 10, 2023
- With the recent version of Pandas,
read_csv
should be loaded with the parameterkeep_default_na=False
to prevent readingNone
asNaN
, as 'None' is a word in normal English.
===================================================================== Oct 12, 2022
- Make reproducible output by fixing seed for torch and numpy
===================================================================== July 28, 2022
- Adding labels for Confusion Matrix axises
- Revision of confusion matrix's labels
===================================================================== July 26, 2022
ignore_index=0
forCrossEntropyLoss
to ignore the padding index. This option Specifies a target value that is ignored and does not contribute to the input gradient resulting in lower computation consumption and the more accurateloss
andf1-score
as without theignore_index
all the paddings are also included in the criteria.- Revision of
flatten
function applying the sentences' lengths in order to compute the loss more accurately. - By these two modifications, the issue with padding seems solved.
- Now, the predicition of the first token is almost correct.
- target_size is now 3 instead of 4 (to count the token); actually, the loss function needs to be fed
ignore_index=target_size
. - Classification Report
- Confusion Matrix
===================================================================== July 25, 2022
list
inlist(zip(*batch))
ofcollate_fn
function was not necessary and just made the running time more..to(device)
inside thecollate_fn
to get rid of device migration inside the training phase.- some retouchments in char2vec
===================================================================== July 20, 2022
- Solved issue with prediction. Using the
set
to remove the redundant characters resulted in a new order in each run. To get rid of this issue, thesorted
function guarantees that we have a unique order. Another solution to this problem is to save thechr2id
dictionary with the model and reload it during the prediction.
===================================================================== July 19, 2022
padding_idx=0
fornn.Embedding
layer.- GPU support; run and tested on Google Colab
- increasing dropout, out_ch1, out_ch2 to 0.3, 37, 35, respectively doesn't help so much (
f1-score
of0.926
in comparison to0.924
!), so I reverted them to the smaller size.
===================================================================== July 18, 2022
predict.py
for the ordinary application.
python predict.py --text [sample text] --model [pretrained model]
- Since I've used F1_Score with micro average, I should mention here that it means
micro-F1 = micro-precision = micro-recall = accuracy
, i.e. I've reported in all casesaccuray
notf1-score
. From now on, I usemacro
average forf1-score
. Therefore, the result would be more realistic now.
- Saving the best model for prediction
- multiple-width filter bank in the second layer of the Char2Vec --> better result and less overfitting.
BiLSTMtagger(
(word_embeddings): Char2Vec(
(embeds): Embedding(298, 9)
(conv1): Sequential(
(0): Conv1d(9, 12, kernel_size=(3,), stride=(1,))
(1): ReLU()
(2): Dropout(p=0.25, inplace=False)
)
(convs2): ModuleList(
(0): Sequential(
(0): Conv1d(12, 5, kernel_size=(3,), stride=(1,))
(1): ReLU()
)
(1): Sequential(
(0): Conv1d(12, 5, kernel_size=(4,), stride=(1,))
(1): ReLU()
)
(2): Sequential(
(0): Conv1d(12, 5, kernel_size=(5,), stride=(1,))
(1): ReLU()
)
)
(linear): Sequential(
(0): Linear(in_features=15, out_features=15, bias=True)
(1): ReLU()
)
)
(lstm): LSTM(15, 128, num_layers=2, batch_first=True, dropout=0.25, bidirectional=True)
(hidden2tag): Linear(in_features=256, out_features=4, bias=True)
)
===================================================================== July 17, 2022
- F1 score with
weighted
average instead ofmicro
. - Char2Vec class
- removing repetition in a token with more than 4 characters and truncation of any words to the length of at most 20 characters; ==> a slightly better result
- Char2Vec+BiLSTM finished, with f1=0.9549, val_f1=0.9443; another slight improvement in the model
BiLSTMtagger(
(word_embeddings): Char2Vec(
(embeds): Embedding(298, 9)
(conv1): Sequential(
(0): Conv1d(9, 12, kernel_size=(3,), stride=(1,))
(1): ReLU()
(2): Dropout(p=0.1, inplace=False)
)
(conv2): Sequential(
(0): Conv1d(12, 15, kernel_size=(3,), stride=(1,))
(1): ReLU()
)
(linear): Sequential(
(0): Linear(in_features=15, out_features=15, bias=True)
(1): ReLU()
)
)
(lstm): LSTM(15, 128, num_layers=2, batch_first=True, dropout=0.25, bidirectional=True)
(hidden2tag): Linear(in_features=256, out_features=4, bias=True)
)
===================================================================== July 16, 2022
-
dechipher the text/label from the output of network
-
tokens should be considered in the context, not as a collection of single tokens: in the following
audio
is a Spanish token, not an English one.@andres_romero17 si , prometo hacer un audio :)
other es other es es es es other
-
loss/f1_score plot
- data analysis around tweets and their tokens/chars
- code sanitization
===================================================================== July 15, 2022
- Printing the loss for train/val set on the screen
- computation of
f1_score
for both training and validation set shows the network convergence - SGD, lr=0.1, hidden_dim=64
Epoch 1/40, loss=0.9072, val_loss=0.8901 ,train_f1=0.5998, val_f1=0.5462
Epoch 2/40, loss=0.6987, val_loss=0.7863 ,train_f1=0.7165, val_f1=0.6602
Epoch 3/40, loss=0.5788, val_loss=0.7573 ,train_f1=0.7714, val_f1=0.7342
Epoch 4/40, loss=0.4912, val_loss=0.7454 ,train_f1=0.8088, val_f1=0.7589
Epoch 5/40, loss=0.4221, val_loss=0.7322 ,train_f1=0.8367, val_f1=0.7747
Epoch 10/40, loss=0.2226, val_loss=0.6976 ,train_f1=0.9123, val_f1=0.7897
Epoch 15/40, loss=0.1427, val_loss=0.7406 ,train_f1=0.9431, val_f1=0.8072
Epoch 20/40, loss=0.1083, val_loss=0.6276 ,train_f1=0.9577, val_f1=0.8133
Epoch 25/40, loss=0.0925, val_loss=0.6425 ,train_f1=0.9648, val_f1=0.8163
Epoch 30/40, loss=0.0842, val_loss=0.6611 ,train_f1=0.9683, val_f1=0.8171
Epoch 35/40, loss=0.0792, val_loss=0.6735 ,train_f1=0.9701, val_f1=0.8178
Epoch 40/40, loss=0.0763, val_loss=0.6753 ,train_f1=0.9711, val_f1=0.8180
- Adam+ReduceLROnPlateau, lr=1e-3, wd=1e-5, hidden_dim=128
Epoch 1/7, loss=0.5991, val_loss=0.5572 ,train_f1=0.7311, val_f1=0.7483
Epoch 2/7, loss=0.2947, val_loss=0.4787 ,train_f1=0.9005, val_f1=0.8266
Epoch 3/7, loss=0.1783, val_loss=0.4336 ,train_f1=0.9485, val_f1=0.8379
Epoch 4/7, loss=0.1256, val_loss=0.4124 ,train_f1=0.9653, val_f1=0.8494
Epoch 5/7, loss=0.1049, val_loss=0.3998 ,train_f1=0.9698, val_f1=0.8512
Epoch 6/7, loss=0.0977, val_loss=0.3884 ,train_f1=0.9714, val_f1=0.8512
Epoch 7/7, loss=0.0940, val_loss=0.3817 ,train_f1=0.9725, val_f1=0.8529
- Minibatches made a great leap: train_f1=0.97, val_f1=0.94
Epoch 1/40, loss=0.5998, val_loss=0.4027 ,train_f1=0.7502, val_f1=0.7768
Epoch 2/40, loss=0.3764, val_loss=0.3790 ,train_f1=0.8179, val_f1=0.7971
Epoch 3/40, loss=0.3242, val_loss=0.3561 ,train_f1=0.8501, val_f1=0.8307
Epoch 4/40, loss=0.2618, val_loss=0.2922 ,train_f1=0.8861, val_f1=0.8741
Epoch 5/40, loss=0.2209, val_loss=0.2553 ,train_f1=0.9065, val_f1=0.8931
Epoch 10/40, loss=0.1291, val_loss=0.1723 ,train_f1=0.9460, val_f1=0.9291
Epoch 15/40, loss=0.0892, val_loss=0.1429 ,train_f1=0.9616, val_f1=0.9419
Epoch 20/40, loss=0.0665, val_loss=0.1471 ,train_f1=0.9675, val_f1=0.9409
Epoch 25/40, loss=0.0510, val_loss=0.1481 ,train_f1=0.9715, val_f1=0.9397
Epoch 30/40, loss=0.0420, val_loss=0.1676 ,train_f1=0.9742, val_f1=0.9397
Epoch 35/40, loss=0.0359, val_loss=0.1756 ,train_f1=0.9755, val_f1=0.9386
Epoch 40/40, loss=0.0323, val_loss=0.1934 ,train_f1=0.9765, val_f1=0.9403
- BiLSTM with 2 layers and dropout prevents kind of overfitting
===================================================================== July 14, 2022
- Data Class improvement
- several dictionaries to convert token,label,char to id and vice versa
- making the coded sentences and their counterparts labels
- LSTM class
CodeSwitchDataset
as well as customized DataLoader
===================================================================== July 13, 2022
- github repo initailization
- reading the paper
- starting the code with Data Class
- an issue with quoting in reading
tsv
files
- an issue with quoting in reading