-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
normalize Ð and Đ into d #257
normalize Ð and Đ into d #257
Conversation
@ngdbao thank you for the PR |
Isn't d and D with stroke a different letter? I think that may negatively affect downstream tokenization in an n-gram language model. |
impl CharNormalizer for VietnameseNormalizer { | ||
fn normalize_char(&self, c: char) -> Option<CharOrStr> { | ||
match c { | ||
'Ð' | 'Đ' | 'đ' => Some("d".to_string().into()), // not only Vietnamese, but also many European countries use these letters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should say:
'Ð' | 'Đ' | 'đ' => Some("đ".to_string().into()),
since "d" is a different letter, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right, I would even prefer something like:
'Ð' => Some("Đ".into()),
'ð' => Some("đ".into()),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about how Slovenian, Croatian, or other countries handle this, but Vietnamese people use a US-layout keyboard with software for typing Unicode, which is manually installed, but not by everyone.
People would be happy if typing "Da Lat" would produce the same result as "Đà Lạt", "D" as same as "Đ" in digital letters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, lets keep your implementation then, could you make the CI happy? This way I will be able to merge your PR :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ManyTheFish Doesn’t the normalizer come before the tokenizer. Given Vietnamese is an n-gram language, I would have thought throwing away the d with stroke Metadata might hurt the n-gram part of the code (downstream). Also, If you normalize it here to just d, don't you also need to wait for all indexed documents to be reindexed for this to work?
Anyway....
I think the real question is can this solution actually meet the users needs.... It's possible it does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jzabroski yeah, what I expected somehow may not be as same as what PR is, and not sure about how It impacts overall
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @jzabroski and @ngdbao, the normalizers are processed after the word segmentation, so we already have the token at this step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ManyTheFish sounds great now, can't wait to see It get merged :)
(@jzabroski, a detail, there is still the issue with Rustfmt CI 😇) |
sorry my bad, I'm starting with zero-knowledge in Rust, trying to arrange Rust local-environment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this version and try it.
About the reindexing, each new version of Meilisearch needs to reindex the data, so this change will not impact it more than usual.
Bors merge
Build succeeded:
|
Pull Request
Related issue
Fixes issue #<245>
What does this PR do?
PR checklist
Please check if your PR fulfills the following requirements: