Closed
Description
System Info
transformers
version: 4.48.2- Platform: Linux-5.10.0-33-cloud-amd64-x86_64-with-glibc2.31
- Python version: 3.12.0
- Huggingface_hub version: 0.28.1
- Safetensors version: 0.5.2
- Accelerate version: not installed
- Accelerate config: not found
- PyTorch version (GPU?): 2.6.0+cu118 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: No
- Using GPU in script?: No
- GPU type: NVIDIA A100-SXM4-40GB
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Running this script should reproduce the error
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "StonyBrookNLP/teabreac-preasm-large-drop"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
It fails with
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[6], line 2
1 model_name = "StonyBrookNLP/teabreac-preasm-large-drop"
----> 2 tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
3 model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
4 # enable_digit_tokenization(tokenizer)
File ~/my/path/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py:934, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
930 if tokenizer_class is None:
931 raise ValueError(
932 f"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported."
933 )
--> 934 return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
936 # Otherwise we have to be creative.
937 # if model is an encoder decoder, the encoder tokenizer class is used by default
938 if isinstance(config, EncoderDecoderConfig):
File ~/my/path/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:2036, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, trust_remote_code, *init_inputs, **kwargs)
2033 else:
2034 logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}")
-> 2036 return cls._from_pretrained(
2037 resolved_vocab_files,
2038 pretrained_model_name_or_path,
2039 init_configuration,
2040 *init_inputs,
2041 token=token,
2042 cache_dir=cache_dir,
2043 local_files_only=local_files_only,
2044 _commit_hash=commit_hash,
2045 _is_local=is_local,
2046 trust_remote_code=trust_remote_code,
2047 **kwargs,
2048 )
File ~/my/path/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:2276, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, trust_remote_code, *init_inputs, **kwargs)
2274 # Instantiate the tokenizer.
2275 try:
-> 2276 tokenizer = cls(*init_inputs, **init_kwargs)
2277 except import_protobuf_decode_error():
2278 logger.info(
2279 "Unable to load tokenizer model from SPM, loading from TikToken will be attempted instead."
2280 "(Google protobuf error: Tried to load SPM model with non-SPM vocab file).",
2281 )
File ~/my/path/lib/python3.12/site-packages/transformers/models/t5/tokenization_t5.py:189, in T5Tokenizer.__init__(self, vocab_file, eos_token, unk_token, pad_token, extra_ids, additional_special_tokens, sp_model_kwargs, legacy, add_prefix_space, **kwargs)
186 self._extra_ids = extra_ids
187 self.add_prefix_space = add_prefix_space
--> 189 super().__init__(
190 eos_token=eos_token,
191 unk_token=unk_token,
192 pad_token=pad_token,
193 extra_ids=extra_ids,
194 additional_special_tokens=additional_special_tokens,
195 sp_model_kwargs=self.sp_model_kwargs,
196 legacy=legacy,
197 add_prefix_space=add_prefix_space,
198 **kwargs,
199 )
File ~/my/path/lib/python3.12/site-packages/transformers/tokenization_utils.py:435, in PreTrainedTokenizer.__init__(self, **kwargs)
432 self._added_tokens_encoder: Dict[str, int] = {k.content: v for v, k in self._added_tokens_decoder.items()}
434 # 4 init the parent class
--> 435 super().__init__(**kwargs)
437 # 4. If some of the special tokens are not part of the vocab, we add them, at the end.
438 # the order of addition is the same as self.SPECIAL_TOKENS_ATTRIBUTES following `tokenizers`
439 self._add_tokens(
440 [token for token in self.all_special_tokens_extended if token not in self._added_tokens_encoder],
441 special_tokens=True,
442 )
File ~/my/path/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1407, in PreTrainedTokenizerBase.__init__(self, **kwargs)
1405 for key in kwargs:
1406 if hasattr(self, key) and callable(getattr(self, key)):
-> 1407 raise AttributeError(f"{key} conflicts with the method {key} in {self.__class__.__name__}")
1409 self.init_kwargs = copy.deepcopy(kwargs)
1410 self.name_or_path = kwargs.pop("name_or_path", "")
AttributeError: add_special_tokens conflicts with the method add_special_tokens in T5Tokenizer
Expected behavior
I expected the tokenizer to load. Two similar issues were raised before but they did not solve my problem.