-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Open
Labels
Description
🐛 Bug
✓ Model loaded successfully!
============================================================
ENGLISH TO ALBANIAN TESTS
============================================================
English: Hello, how are you?
Albanian: •••• How are you?•••How are you? •••• ••• How do you feel?•• • ••• • How are you feeling?•• How did you feel? •• •• • • • •• •
English: Good morning!
Albanian: ️️🏻🏻️🏼🏼️♀️🏼 🏼🏻🏼♀️ ️♂️🏼♀️️♀️🏻 🏼 ♀️
English: Where is the library?
Albanian: •••• •••• Where is the library?•••■ Where is the Library?••■•••
English: Thank you very much.
Albanian: Thank you very much Thank you so much
English: The weather is beautiful today.
Albanian: ️️🏻️🏼️︎️🇪️♀️️ ️♂️🏼 🏼🏼♀️ 🏼 ♀️🏼 ?? ♀️ ♂️
English: Hello, how are you?
Albanian: •••• How are you?•••How are you? •••• ••• How do you feel?•• • ••• • How are you feeling?•• How did you feel? •• •• • • • •• •
English: Good morning!
Albanian: ️️🏻🏻️🏼🏼️♀️🏼 🏼🏻🏼♀️ ️♂️🏼♀️️♀️🏻 🏼 ♀️
English: Where is the library?
Albanian: •••• •••• Where is the library?•••■ Where is the Library?••■•••
English: Thank you very much.
Albanian: Thank you very much Thank you so much
To Reproduce
def translate_nllb(text, src_lang='eng_Latn', tgt_lang='sqi_Latn', model_size='3.3B'):
"""Use local NLLB model (doesn't work well for Albanian)"""
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
model_name = f"facebook/nllb-200-{model_size}"
device = "mps" if torch.backends.mps.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
tokenizer.src_lang = src_lang
tokenizer.tgt_lang = tgt_lang
inputs = tokenizer(text, return_tensors="pt").input_ids.to(device)
translated_tokens = model.generate(
input_ids=inputs,
forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
max_length=len(inputs[0]) + 50,
num_beams=5,
num_return_sequences=1,
no_repeat_ngram_size=4,
renormalize_logits=True
)
return tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
( they dont work even with osesPunctNormalizer and sentence splitting )
> cat test_unesco_exact.py
#!/usr/bin/env python3
"""Test with UNESCO's EXACT implementation"""
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from sacremoses import MosesPunctNormalizer
import torch
# Initialize
print("Loading 3.3B model with UNESCO's exact approach...")
model_name = "facebook/nllb-200-3.3B"
device = "mps" if torch.backends.mps.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
punct_normalizer = MosesPunctNormalizer(lang="en")
print(f"Model loaded on {device}\n")
def translate_unesco_style(text, src_code='eng_Latn', tgt_code='sqi_Latn'):
"""Exact UNESCO implementation"""
# Set languages
tokenizer.src_lang = src_code
tokenizer.tgt_lang = tgt_code
# Normalize punctuation
text = punct_normalizer.normalize(text)
# Tokenize (UNESCO style - convert to list and back)
input_tokens = tokenizer(text, return_tensors="pt").input_ids[0].cpu().numpy().tolist()
# Generate
translated_chunk = model.generate(
input_ids=torch.tensor([input_tokens]).to(device),
forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_code),
max_length=len(input_tokens) + 50,
num_return_sequences=1,
num_beams=5,
no_repeat_ngram_size=4,
renormalize_logits=True
)
# Decode
return tokenizer.batch_decode(translated_chunk, skip_special_tokens=True)[0]
# Test
print("="*60)
print("TESTING WITH UNESCO'S EXACT CODE")
print("="*60 + "\n")
tests = [
"Hello, how are you?",
"Good morning!",
"Where is the library?",
"Thank you very much.",
]
for text in tests:
print(f"English: {text}")
translation = translate_unesco_style(text)
print(f"Albanian: {translation}\n")
print("="*60)
Code sample
Expected behavior
Environment
- fairseq Version (e.g., 1.0 or main):
- PyTorch Version (e.g., 1.0)
- OS (e.g., Linux):
- How you installed fairseq (
pip, source): - Build command you used (if compiling from source):
- Python version:
- CUDA/cuDNN version:
- GPU models and configuration:
- Any other relevant information:
Additional context
Reactions are currently unavailable