Skip to content

All albanian models are broken #5637

@tuwid

Description

@tuwid

🐛 Bug

✓ Model loaded successfully!

============================================================
ENGLISH TO ALBANIAN TESTS
============================================================

English:  Hello, how are you?
Albanian: •••• How are you?•••How are you? •••• ••• How do you feel?•• • ••• • How are you feeling?•• How did you feel? •• •• • • • •• •

English:  Good morning!
Albanian: ️️🏻🏻️🏼🏼️♀️🏼 🏼🏻🏼♀️ ️♂️🏼♀️️♀️🏻 🏼 ♀️

English:  Where is the library?
Albanian: •••• •••• Where is the library?•••■ Where is the Library?••■•••

English:  Thank you very much.
Albanian: Thank you very much  Thank you so much

English:  The weather is beautiful today.
Albanian: ️️🏻️🏼️︎️🇪️♀️️ ️♂️🏼 🏼🏼♀️ 🏼 ♀️🏼 ?? ♀️ ♂️

English:  Hello, how are you?
Albanian: •••• How are you?•••How are you? •••• ••• How do you feel?•• • ••• • How are you feeling?•• How did you feel? •• •• • • • •• •

English:  Good morning!
Albanian: ️️🏻🏻️🏼🏼️♀️🏼 🏼🏻🏼♀️ ️♂️🏼♀️️♀️🏻 🏼 ♀️

English:  Where is the library?
Albanian: •••• •••• Where is the library?•••■ Where is the Library?••■•••

English:  Thank you very much.
Albanian: Thank you very much  Thank you so much

To Reproduce

def translate_nllb(text, src_lang='eng_Latn', tgt_lang='sqi_Latn', model_size='3.3B'):
    """Use local NLLB model (doesn't work well for Albanian)"""
    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
    import torch

    model_name = f"facebook/nllb-200-{model_size}"
    device = "mps" if torch.backends.mps.is_available() else "cpu"

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

    tokenizer.src_lang = src_lang
    tokenizer.tgt_lang = tgt_lang

    inputs = tokenizer(text, return_tensors="pt").input_ids.to(device)

    translated_tokens = model.generate(
        input_ids=inputs,
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
        max_length=len(inputs[0]) + 50,
        num_beams=5,
        num_return_sequences=1,
        no_repeat_ngram_size=4,
        renormalize_logits=True
    )

    return tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

( they dont work even with osesPunctNormalizer and sentence splitting )

> cat test_unesco_exact.py
#!/usr/bin/env python3
"""Test with UNESCO's EXACT implementation"""

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from sacremoses import MosesPunctNormalizer
import torch

# Initialize
print("Loading 3.3B model with UNESCO's exact approach...")
model_name = "facebook/nllb-200-3.3B"
device = "mps" if torch.backends.mps.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
punct_normalizer = MosesPunctNormalizer(lang="en")

print(f"Model loaded on {device}\n")

def translate_unesco_style(text, src_code='eng_Latn', tgt_code='sqi_Latn'):
    """Exact UNESCO implementation"""
    # Set languages
    tokenizer.src_lang = src_code
    tokenizer.tgt_lang = tgt_code

    # Normalize punctuation
    text = punct_normalizer.normalize(text)

    # Tokenize (UNESCO style - convert to list and back)
    input_tokens = tokenizer(text, return_tensors="pt").input_ids[0].cpu().numpy().tolist()

    # Generate
    translated_chunk = model.generate(
        input_ids=torch.tensor([input_tokens]).to(device),
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_code),
        max_length=len(input_tokens) + 50,
        num_return_sequences=1,
        num_beams=5,
        no_repeat_ngram_size=4,
        renormalize_logits=True
    )

    # Decode
    return tokenizer.batch_decode(translated_chunk, skip_special_tokens=True)[0]

# Test
print("="*60)
print("TESTING WITH UNESCO'S EXACT CODE")
print("="*60 + "\n")

tests = [
    "Hello, how are you?",
    "Good morning!",
    "Where is the library?",
    "Thank you very much.",
]

for text in tests:
    print(f"English:  {text}")
    translation = translate_unesco_style(text)
    print(f"Albanian: {translation}\n")

print("="*60)

Code sample

Expected behavior

Environment

  • fairseq Version (e.g., 1.0 or main):
  • PyTorch Version (e.g., 1.0)
  • OS (e.g., Linux):
  • How you installed fairseq (pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Additional context

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions