-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sub-tokenization with certain transformers #150
Comments
Hello ! Afaik Roberta and its BPE tokenizer are working well in my test with version transformers 4.25.1, but I think not anymore with version 4.15.0 (but it used to work also with this version at some point in the past :). I changed versions in 389eb3d but only in setup.py... I forgot requirements.txt sorry. 4.25.1 changed the behavior of the BPE tokenization, in a good way I think. I try to explain how it works below and how I added the support of Roberta-style tokenizers. We start with a pretokenized input. According to the Tokenizer library doc: "If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences)." The difference with a "traditional" BERT tokenizer is that in this pretokenized input case, the BPE tokenizer has no leading space to perform a proper "tokenization" - note that here (to be clear) it's a sub-tokenization (we tokenize the tokens...). So the developer of Tokenizer introduced a trick to add a leading space to every tokens when "sub-tokenizing" - the I open a parenthesis here:This create a problem for BPE with the first token, which has no space before. The trick will add a leading space before the tokens so that the subtokenization is similar to the one with a complete sentence string including spaces. This trick remains in general I think an approximation because we can have tokens without space prefix of course depending on the pre-tokenization pipeline.
I don't know why they add a space for the token immediately after Closing of the parenthesis!The resulting tokenization is a list of subtokens - whose offsets in the case of BPE does NOT refer to the original string (which is not inputted), but to each token. So what is returned with To illustrate, I change a bit the example for something more complicated:
The offset are all relative to each token: each token is an input which starts at 0 (special tokens and first token in the case of Roberta at least) or 1 (following tokens, because of the fake added space for the above-mentioned BPE trick - the exact encoding symbol for this space depends on the transformer implementation). But this offset behavior changed with version 4.25.1, because it was confusing to be honest, and we have now:
Note the offsets now all starting at 0 when we have a starting space But good because now it is the same as a BERT tokenizer for instance with Should we use rather the self.tokenizer("".join(text_tokens), add_special_tokens=True, is_split_into_words=False, max_length=max_seq_length, truncation=True, return_offsets_mapping=True) First the Normally the |
I tested with current DeLFT version and Transformers
Great result for a base model by the way !
|
We put an empty label to the subtokens added by the tokenizers. This is working better than repeating the label of the token to all its subtokens and this is the original implementation of BERT ("We use the representation of the first sub-token as the input to the token-level classifier over the NER label set.") |
Actually I just realize this is exactly the purpose of issue #128 :D |
Thanks @kermitt2 for the extended answer. Indeed, I did not think of checking the setup.py 😭 . The roberta-based model I'm referring is the following: https://kdrive.infomaniak.com/app/share/104844/41053dc4-5398-4841-939d-c67583de96d6 With the previous version there were no issues but the evaluation results were very low, indicating tokenization problems.
I have this issue only on Linux 😭 and I'm using CUDA 11.2. |
@lfoppiano This Roberta model raises encoding issues because its tokenizer is not loaded properly. I could reproduce the error with (env) lopez@trainer:~/delft$ python3 delft/applications/grobidTagger.py citation train_eval --architecture BERT_CRF --transformer /media/lopez/T51/embeddings/matbert-pedro-scicorpus-20000-vocab_100k/ --input data/sequenceLabelling/grobid/citation/citation-231022.train because the input text includes non-US characters. The error is due to an encoding error of the created BPE tokenizer (RobertaTokenizer) which adds a wrong extra tokens and shift everything. For example:
"É" get encoded as 2 tokens "Ã" and "ī". However in principle the right token exists in the vocab (
From this, the decoded string is then:
So alignment is wrong, encoding is wrong, and this is not recoverable afaik. From what I see, the source of the problem is that the local tokenizer file of this local Roberta model is not loaded ( https://github.com/kermitt2/delft/blob/master/delft/sequenceLabelling/wrapper.py#L168 The transformer tokenizer will be initialized without local_path: https://github.com/kermitt2/delft/blob/master/delft/sequenceLabelling/models.py#L253 So what is initialized I think is a default RobertaTokenizer without vocabulary supporting the actual model/input, via HuggingFace only. To load locally the local transformer tokenizer via the current method ( |
By adding the model path info in the "transformers": [
{
"name": "matbert-pedro-scicorpus-20000-vocab_100k",
"model_dir": "/media/lopez/T51/embeddings/matbert-pedro-scicorpus-20000-vocab_100k"
}
], I have the model loaded I think correctly with (env) lopez@trainer:~/delft$ python3 delft/applications/grobidTagger.py citation train_eval --architecture BERT_CRF --transformer matbert-pedro-scicorpus-20000-vocab_100k --input data/sequenceLabelling/grobid/citation/citation-231022.train But I still have the encoding error unfortunately. If "É" and "ĠÉ" are in the vocabulary, I don't understand why É is not parsed as one character over several bytes and parsed as 2 distinct tokens (both with offset (0, 1), (0, 1) - so no way to know it's one single token originally). |
So to summarize there are still 2 problems:
|
I fixed normally the problem 2. above with #154 But there's still the issue 1., problem with |
In addition, to the first problem I think the tokenizer seems correctly loaded: However, if I replace
and the results is wrong with another wrong character: @pjox any thoughts? |
Indeed it seems correctly loaded, I also have:
But then still when tokenizing:
So
Well here, this should be expected because |
OK. I think I get it now. However, we have only
however if we change the list as
And the reconstructed tokens are :
Am I right to say that this is not correct? |
If we pass So this is correct BPE afaik and the number of characters before encoding and after decoding is not something fixed with BPE, what is fixed is the total number of bytes. |
Also note that in the merges of the tokenizer of the model which - if I have understood correctly- means to merge the sequence to |
I've tested the "pedro" model using the branch of PR #154. There are two good news here:
for point 1, one of the 5 evaluation results were:
which become:
|
Fix issue for alignment of pre-tokenized input with BPE, see #150
Compared results here:
I think they make sense since this model was trained only on 20k iterations. |
In any case I think we can merge this PR since the problem seems to be more related to this specific model, as you said, @kermitt2 |
Thanks @kermitt2 for all the useful insight on the BPE tokenizer! 😄 I have been looking into this problem with @lfoppiano for the last couple of weeks, but we cannot seem to find a solution/explanation for problem 1. I looked a bit into the model I will continue to look at as soon as I have more time, one clue might be in the flair library as I have used Zeldarose trained models with flair and never encountered a problem with the token alignment. I don't want to bother them too much (as I know they are very busy these days), but I'm also tagging @LoicGrobol as they might have encountered this problem earlier (maybe in hopsparser). |
Thanks for tagging, @pjox! Actually I'm not sure I understand everything here, is it an issue that only concerns the character offsets or something bigger? |
Oh, thanks a lot @LoicGrobol for taking the time to comment here! @kermitt2 can correct me if I'm wrong, but I think the problem is more the product of encoding of some special characters even when they are in the vocabulary and specially when the input is pre-tokenized. I am almost sure this is coming from HF, but tagged you just in case. @kermitt2 found a solution last week in #154 (that I haven't been able to check) but apparently it was more of a hack to force realign the offsets after encountering certain characters from what @lfoppiano told me (please do correct me if I'm wrong). |
One issue I had trouble with was that certain tokenizers (Flaubert does iirc, so possibly it's because of xlm) are skipping some characters altogether, leading to inconsistencies in combination with the input being split into words but I don't think that's your problem here. It it is, though I'm happy to dig in my archives and in any case I'm curious about your issue, maybe it'll help me avoid trouble later 👀 |
To try to clarify, the remaining problem is not the offsets (it was due to an update in the huggingface library), nor the re-alignement - there are indeed some tokens in a pre-tokenization input which are added with weird offsets and with different behavior from one BPE/model to another one, but it's easy to skip these tokens just looking at them (I tested Roberta models, CamemBERT, bart-base, albert-base-v2, and XLM model). The problem is that the BPE tokenizer saved with the "Pedro" model (sorry to associate you to the issue Pedro :D) is not working as expected I think. To reproduce, the example is #150 (comment) Basically the vocab contains tokens apparently well loaded:
with merges in the tokenizer as expected for these tokens. However when present in the text input sequence, the token 'ĠÉ` is not encoded as expected, it is encoded with 2 "sub-bytes" as fallback:
So there is something apparently going wrong in the BPE tokenizer as initialized from the saved tokenizer. When looking at the |
@pjox and I are working on a model trained with Roberta and using the BPE tokenizer, in particular zeldarose which uses slightly different special tokens.
We have some problem when the data is tokenized.
In particular, the sub-tokenisation from the tokenizers somehow get messed up when
is_split_into_words=True
and with the library transformers of version 4.15.0 (tokenizers library version 0.10.3):The code here (preprocess.py:304):
text_tokens = ['We', 'are', 'studying', 'the', 'material', 'La', '3', 'A', '2', 'Ge', '2', '(', 'A', '=', 'Ir', ',', 'Rh', ')', '.', 'The', 'critical', 'temperature', 'T', 'C', '=', '4', '.', '7', 'K', 'discovered', 'for', 'La', '3', 'Ir', '2', 'Ge', '2', 'in', 'this', 'work', 'is', 'by', 'about', '1', '.', '2', 'K', 'higher', 'than', 'that', 'found', 'for', 'La', '3', 'Rh', '2', 'Ge', '2', '.']
the output offsets are as follows:
[(0, 0), (0, 2), (1, 3), (1, 8), (1, 3), (1, 8), (1, 2), (1, 1), (1, 1), (1, 1), (1, 2), (1, 1), (1, 1), (1, 1), (1, 1), (1, 2), (1, 1), (1, 2), (1, 1), (1, 1), (1, 3), (1, 8), (1, 11), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 10), (1, 3), (1, 2), (1, 1), (1, 2), (1, 1), (1, 2), (1, 1), (1, 2), (1, 4), (1, 4), (1, 2), (1, 2), (1, 5), (1, 1), (1, 1), (1, 1), (1, 1), (1, 6), (1, 4), (1, 4), (1, 5), (1, 3), (1, 2), (1, 1), (1, 2), (1, 1), (1, 2), (1, 1), (1, 1), (0, 0)]
the first two items are correct, from the third, the sequence get messed up, the third should be (0, 3), then (0,8), etc... and this get wrongly reconstructed by the delft code after that. If the pair does not starts the code is unclear, I don't understand why adding
<PAD>
:if I pass the string and set
is_split_into_words=False
:I obtain the correct result:
[(0, 0), (0, 2), (2, 7), (7, 10), (10, 13), (13, 14), (14, 17), (17, 24), (24, 26), (26, 27), (27, 28), (28, 29), (29, 31), (31, 32), (32, 33), (33, 34), (34, 35), (35, 37), (37, 38), (38, 40), (40, 42), (42, 45), (45, 53), (53, 64), (64, 66), (66, 67), (67, 68), (68, 69), (69, 70), (70, 71), (71, 74), (74, 81), (81, 84), (84, 86), (86, 87), (87, 89), (89, 90), (90, 92), (92, 93), (93, 95), (95, 99), (99, 103), (103, 105), (105, 107), (107, 112), (112, 113), (113, 114), (114, 115), (115, 116), (116, 122), (122, 124), (124, 128), (128, 130), (130, 135), (135, 138), (138, 140), (140, 141), (141, 143), (143, 144), (144, 146), (146, 147), (147, 148), (0, 0)]
The option
is_split_into_words
was though only for split by space, which is not the case for most of our use cases.Here there is an explanation but I did not understand it well: huggingface/transformers#8217
(in any case it works only with the python tokenizers)
Probably, we should consider
which will return a list of list, for each token:
and then, with some additional works, we should be able to reconstruct the output correctly.
I've also find that updating the transformers library to 4.25.1 solves the problem on my M1 Mac, but open to new problems on Linux.
The text was updated successfully, but these errors were encountered: