Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imporve NMT quality by adding 10-100 new tokens with SentencePiece #79

Open
johnml1135 opened this issue Dec 4, 2023 · 0 comments
Open
Labels
enhancement New feature or request

Comments

@johnml1135
Copy link
Collaborator

It is assumed that at least for some languages, adding 10-100 new compiled tokens (depending potentially on the number of instances, the size of corpus, etc.) may end up improving the overall training quality. This issue is to implement it after the research is complete (see sillsdev/silnlp#196).

@johnml1135 johnml1135 added the enhancement New feature or request label Dec 4, 2023
@johnml1135 johnml1135 added this to Serval Dec 4, 2023
@github-project-automation github-project-automation bot moved this to 🆕 New in Serval Dec 4, 2023
@johnml1135 johnml1135 added this to the Serval API 1.3 milestone Dec 4, 2023
@johnml1135 johnml1135 removed this from the Serval API 1.3 milestone Jan 3, 2024
@johnml1135 johnml1135 changed the title Add SentencePiece tokenizer (with options) Imporve NMT quality by adding 10-100 new tokens with SentencePiece Apr 4, 2025
@johnml1135 johnml1135 moved this from 🆕 New to 📋 Backlog in Serval Apr 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: 📋 Backlog
Development

No branches or pull requests

2 participants