-
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-NllbTokenizers (non-fast) aren't able to save and load new language codes #44
Comments
It's failing for MBartTokenizer and M2M100Tokenizer. It's succeeding for NllbTokenizer, NllbTokenizerFast, and MBartTokenizerFast. The split is not evenly across PreTrainedTokenizer and PreTrainedTokenizerFast, so I'm not sure what the differentiator is yet. |
This sounds like a bug in Huggingface transformers or tokenizers. If we can figure out exactly what the issues is, we could submit an issue to Huggingface. Until then, we should just work around it. |
Since this appears to only be affecting a subset of PreTrainedTokenizer types and not any of the Fast tokenizers, it might just be an issue with certain slow tokenizer types not having been updated in a while to be compatible with later huggingface releases. This shouldn't be an issue for us as long as we use Fast tokenizers. @ddaspit Should we still submit an issue to huggingface for this? |
I wouldn't worry about submitting an issue to Huggingface. We should leave this issue open to capture the inability to work with non-fast tokenizers. Once, we put in a workaround for these tokenizers, we can close this issue. |
I don't know if this is an issue anymore. I will close it for now. |
I was working on some tests with the
stas/tiny-m2m_100
model, and I ran into an issue where new language codes that were added to the tokenizer seem to not be properly saving. When the tokenizer is loaded from the saved tokenizer files, it does not recognize the new language codes and crashes. This is the error message I'm getting:The text was updated successfully, but these errors were encountered: