Skip to content

T5 Tokenzier not load with AttributeError: add_special_tokens conflicts with the method add_special_tokens in T5Tokenizer #36032

Closed
@QuantumStaticFR

Description

@QuantumStaticFR

System Info

  • transformers version: 4.48.2
  • Platform: Linux-5.10.0-33-cloud-amd64-x86_64-with-glibc2.31
  • Python version: 3.12.0
  • Huggingface_hub version: 0.28.1
  • Safetensors version: 0.5.2
  • Accelerate version: not installed
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.6.0+cu118 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: No
  • Using GPU in script?: No
  • GPU type: NVIDIA A100-SXM4-40GB

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Running this script should reproduce the error

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "StonyBrookNLP/teabreac-preasm-large-drop"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

It fails with

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[6], line 2
      1 model_name = "StonyBrookNLP/teabreac-preasm-large-drop"
----> 2 tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
      3 model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
      4 # enable_digit_tokenization(tokenizer)

File ~/my/path/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py:934, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    930     if tokenizer_class is None:
    931         raise ValueError(
    932             f"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported."
    933         )
--> 934     return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    936 # Otherwise we have to be creative.
    937 # if model is an encoder decoder, the encoder tokenizer class is used by default
    938 if isinstance(config, EncoderDecoderConfig):

File ~/my/path/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:2036, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, trust_remote_code, *init_inputs, **kwargs)
   2033     else:
   2034         logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}")
-> 2036 return cls._from_pretrained(
   2037     resolved_vocab_files,
   2038     pretrained_model_name_or_path,
   2039     init_configuration,
   2040     *init_inputs,
   2041     token=token,
   2042     cache_dir=cache_dir,
   2043     local_files_only=local_files_only,
   2044     _commit_hash=commit_hash,
   2045     _is_local=is_local,
   2046     trust_remote_code=trust_remote_code,
   2047     **kwargs,
   2048 )

File ~/my/path/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:2276, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, trust_remote_code, *init_inputs, **kwargs)
   2274 # Instantiate the tokenizer.
   2275 try:
-> 2276     tokenizer = cls(*init_inputs, **init_kwargs)
   2277 except import_protobuf_decode_error():
   2278     logger.info(
   2279         "Unable to load tokenizer model from SPM, loading from TikToken will be attempted instead."
   2280         "(Google protobuf error: Tried to load SPM model with non-SPM vocab file).",
   2281     )

File ~/my/path/lib/python3.12/site-packages/transformers/models/t5/tokenization_t5.py:189, in T5Tokenizer.__init__(self, vocab_file, eos_token, unk_token, pad_token, extra_ids, additional_special_tokens, sp_model_kwargs, legacy, add_prefix_space, **kwargs)
    186 self._extra_ids = extra_ids
    187 self.add_prefix_space = add_prefix_space
--> 189 super().__init__(
    190     eos_token=eos_token,
    191     unk_token=unk_token,
    192     pad_token=pad_token,
    193     extra_ids=extra_ids,
    194     additional_special_tokens=additional_special_tokens,
    195     sp_model_kwargs=self.sp_model_kwargs,
    196     legacy=legacy,
    197     add_prefix_space=add_prefix_space,
    198     **kwargs,
    199 )

File ~/my/path/lib/python3.12/site-packages/transformers/tokenization_utils.py:435, in PreTrainedTokenizer.__init__(self, **kwargs)
    432 self._added_tokens_encoder: Dict[str, int] = {k.content: v for v, k in self._added_tokens_decoder.items()}
    434 # 4 init the parent class
--> 435 super().__init__(**kwargs)
    437 # 4. If some of the special tokens are not part of the vocab, we add them, at the end.
    438 # the order of addition is the same as self.SPECIAL_TOKENS_ATTRIBUTES following `tokenizers`
    439 self._add_tokens(
    440     [token for token in self.all_special_tokens_extended if token not in self._added_tokens_encoder],
    441     special_tokens=True,
    442 )

File ~/my/path/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1407, in PreTrainedTokenizerBase.__init__(self, **kwargs)
   1405 for key in kwargs:
   1406     if hasattr(self, key) and callable(getattr(self, key)):
-> 1407         raise AttributeError(f"{key} conflicts with the method {key} in {self.__class__.__name__}")
   1409 self.init_kwargs = copy.deepcopy(kwargs)
   1410 self.name_or_path = kwargs.pop("name_or_path", "")

AttributeError: add_special_tokens conflicts with the method add_special_tokens in T5Tokenizer

Expected behavior

I expected the tokenizer to load. Two similar issues were raised before but they did not solve my problem.

  1. Regression in tokenizer loading #33453
  2. [BUG] CodeGen 2.5 Tokenizer cannot be initialized anymore salesforce/CodeGen#94

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions