Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T5 Tokenzier not load with AttributeError: add_special_tokens conflicts with the method add_special_tokens in T5Tokenizer #36032

Open
2 of 4 tasks
QuantumStaticFR opened this issue Feb 4, 2025 · 4 comments · May be fixed by #36070
Labels

Comments

@QuantumStaticFR
Copy link

QuantumStaticFR commented Feb 4, 2025

System Info

  • transformers version: 4.48.2
  • Platform: Linux-5.10.0-33-cloud-amd64-x86_64-with-glibc2.31
  • Python version: 3.12.0
  • Huggingface_hub version: 0.28.1
  • Safetensors version: 0.5.2
  • Accelerate version: not installed
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.6.0+cu118 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: No
  • Using GPU in script?: No
  • GPU type: NVIDIA A100-SXM4-40GB

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Running this script should reproduce the error

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "StonyBrookNLP/teabreac-preasm-large-drop"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

It fails with

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[6], line 2
      1 model_name = "StonyBrookNLP/teabreac-preasm-large-drop"
----> 2 tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
      3 model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
      4 # enable_digit_tokenization(tokenizer)

File ~/my/path/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py:934, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    930     if tokenizer_class is None:
    931         raise ValueError(
    932             f"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported."
    933         )
--> 934     return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    936 # Otherwise we have to be creative.
    937 # if model is an encoder decoder, the encoder tokenizer class is used by default
    938 if isinstance(config, EncoderDecoderConfig):

File ~/my/path/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:2036, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, trust_remote_code, *init_inputs, **kwargs)
   2033     else:
   2034         logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}")
-> 2036 return cls._from_pretrained(
   2037     resolved_vocab_files,
   2038     pretrained_model_name_or_path,
   2039     init_configuration,
   2040     *init_inputs,
   2041     token=token,
   2042     cache_dir=cache_dir,
   2043     local_files_only=local_files_only,
   2044     _commit_hash=commit_hash,
   2045     _is_local=is_local,
   2046     trust_remote_code=trust_remote_code,
   2047     **kwargs,
   2048 )

File ~/my/path/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:2276, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, trust_remote_code, *init_inputs, **kwargs)
   2274 # Instantiate the tokenizer.
   2275 try:
-> 2276     tokenizer = cls(*init_inputs, **init_kwargs)
   2277 except import_protobuf_decode_error():
   2278     logger.info(
   2279         "Unable to load tokenizer model from SPM, loading from TikToken will be attempted instead."
   2280         "(Google protobuf error: Tried to load SPM model with non-SPM vocab file).",
   2281     )

File ~/my/path/lib/python3.12/site-packages/transformers/models/t5/tokenization_t5.py:189, in T5Tokenizer.__init__(self, vocab_file, eos_token, unk_token, pad_token, extra_ids, additional_special_tokens, sp_model_kwargs, legacy, add_prefix_space, **kwargs)
    186 self._extra_ids = extra_ids
    187 self.add_prefix_space = add_prefix_space
--> 189 super().__init__(
    190     eos_token=eos_token,
    191     unk_token=unk_token,
    192     pad_token=pad_token,
    193     extra_ids=extra_ids,
    194     additional_special_tokens=additional_special_tokens,
    195     sp_model_kwargs=self.sp_model_kwargs,
    196     legacy=legacy,
    197     add_prefix_space=add_prefix_space,
    198     **kwargs,
    199 )

File ~/my/path/lib/python3.12/site-packages/transformers/tokenization_utils.py:435, in PreTrainedTokenizer.__init__(self, **kwargs)
    432 self._added_tokens_encoder: Dict[str, int] = {k.content: v for v, k in self._added_tokens_decoder.items()}
    434 # 4 init the parent class
--> 435 super().__init__(**kwargs)
    437 # 4. If some of the special tokens are not part of the vocab, we add them, at the end.
    438 # the order of addition is the same as self.SPECIAL_TOKENS_ATTRIBUTES following `tokenizers`
    439 self._add_tokens(
    440     [token for token in self.all_special_tokens_extended if token not in self._added_tokens_encoder],
    441     special_tokens=True,
    442 )

File ~/my/path/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1407, in PreTrainedTokenizerBase.__init__(self, **kwargs)
   1405 for key in kwargs:
   1406     if hasattr(self, key) and callable(getattr(self, key)):
-> 1407         raise AttributeError(f"{key} conflicts with the method {key} in {self.__class__.__name__}")
   1409 self.init_kwargs = copy.deepcopy(kwargs)
   1410 self.name_or_path = kwargs.pop("name_or_path", "")

AttributeError: add_special_tokens conflicts with the method add_special_tokens in T5Tokenizer

Expected behavior

I expected the tokenizer to load. Two similar issues were raised before but they did not solve my problem.

  1. Regression in tokenizer loading #33453
  2. [BUG] CodeGen 2.5 Tokenizer cannot be initialized anymore salesforce/CodeGen#94
@QuantumStaticFR QuantumStaticFR changed the title add_special_tokens conflicts with the method add_special_tokens in T5Tokenizer T5 Tokenzier not instantiating with AttributeError: add_special_tokens conflicts with the method add_special_tokens in T5Tokenizer Feb 4, 2025
@QuantumStaticFR QuantumStaticFR changed the title T5 Tokenzier not instantiating with AttributeError: add_special_tokens conflicts with the method add_special_tokens in T5Tokenizer T5 Tokenzier not instantiating with AttributeError: add_special_tokens conflicts with the method add_special_tokens in T5Tokenizer Feb 4, 2025
@QuantumStaticFR QuantumStaticFR changed the title T5 Tokenzier not instantiating with AttributeError: add_special_tokens conflicts with the method add_special_tokens in T5Tokenizer T5 Tokenzier not load with AttributeError: add_special_tokens conflicts with the method add_special_tokens in T5Tokenizer Feb 4, 2025
@Rocketknight1
Copy link
Member

Hi @QuantumStaticFR, I believe this is because of a key in the tokenizer_config.json for that model that is no longer supported. Please raise the issue with the repo owners!

@QuantumStaticFR
Copy link
Author

I'll tell them their config is no longer supported by huggingface and request them to update it?

@Rocketknight1
Copy link
Member

Yeah, I think that's the right strategy. It's also possible that this is caused by the tokenizer needing to pop that kwarg correctly - cc @ArthurZucker ?

@ArthurZucker
Copy link
Collaborator

Yeah, but I think we should support serializing and de-serializing this argument. It's a little bit annoying because we have to check the arg all the time, but let me open a PR (not the first time this happens)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants