normalize Ð and Đ into d #257

ngdbao · 2024-01-16T17:49:26Z

Pull Request

Related issue

Fixes issue #<245>

What does this PR do?

Add Vietnamese normalizer

PR checklist

Please check if your PR fulfills the following requirements:

[ x ] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
[ x ] Have you read the contributing guidelines?
[ x ] Have you made sure that the title is accurate and descriptive of the changes?

curquiza · 2024-01-16T18:12:33Z

@ngdbao thank you for the PR
can you fix the Rustfmt CI before we review it please? 😊

jzabroski · 2024-01-16T20:21:13Z

Isn't d and D with stroke a different letter? I think that may negatively affect downstream tokenization in an n-gram language model.

jzabroski · 2024-01-16T20:22:47Z

charabia/src/normalizer/vietnamese.rs

+impl CharNormalizer for VietnameseNormalizer {
+    fn normalize_char(&self, c: char) -> Option<CharOrStr> {
+        match c {
+            'Ð' | 'Đ' | 'đ' => Some("d".to_string().into()), // not only Vietnamese, but also many European countries use these letters


This should say:

'Ð' | 'Đ' | 'đ' => Some("đ".to_string().into()),

since "d" is a different letter, no?

I think you are right, I would even prefer something like:

'Ð' => Some("Đ".into()), 'ð' => Some("đ".into()),

https://www.compart.com/en/unicode/U+0111

I'm not sure about how Slovenian, Croatian, or other countries handle this, but Vietnamese people use a US-layout keyboard with software for typing Unicode, which is manually installed, but not by everyone.

People would be happy if typing "Da Lat" would produce the same result as "Đà Lạt", "D" as same as "Đ" in digital letters.

This is from Airbnb

This is from Skyscanner

Understood, lets keep your implementation then, could you make the CI happy? This way I will be able to merge your PR :)

@ManyTheFish Doesn’t the normalizer come before the tokenizer. Given Vietnamese is an n-gram language, I would have thought throwing away the d with stroke Metadata might hurt the n-gram part of the code (downstream). Also, If you normalize it here to just d, don't you also need to wait for all indexed documents to be reindexed for this to work?

Anyway....
I think the real question is can this solution actually meet the users needs.... It's possible it does.

@jzabroski yeah, what I expected somehow may not be as same as what PR is, and not sure about how It impacts overall

Hello @jzabroski and @ngdbao, the normalizers are processed after the word segmentation, so we already have the token at this step.

@ManyTheFish sounds great now, can't wait to see It get merged :)

curquiza · 2024-01-17T14:32:58Z

(@jzabroski, a detail, there is still the issue with Rustfmt CI 😇)

ngdbao · 2024-01-18T15:49:49Z

(@jzabroski, a detail, there is still the issue with Rustfmt CI 😇)

sorry my bad, I'm starting with zero-knowledge in Rust, trying to arrange Rust local-environment

ManyTheFish

Let's merge this version and try it.
About the reindexing, each new version of Meilisearch needs to reindex the data, so this change will not impact it more than usual.

Bors merge

meili-bors · 2024-01-24T10:03:57Z

Build succeeded:

tests

normalize Ð and Đ

4189dad

ngdbao mentioned this pull request Jan 16, 2024

Ð vs Đ differentiate #245

Closed

format code

14be492

jzabroski reviewed Jan 16, 2024

View reviewed changes

fix format comply CI

4103a73

ManyTheFish approved these changes Jan 24, 2024

View reviewed changes

meili-bors bot merged commit 286ef64 into meilisearch:main Jan 24, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

normalize Ð and Đ into d #257

normalize Ð and Đ into d #257

ngdbao commented Jan 16, 2024

curquiza commented Jan 16, 2024

jzabroski commented Jan 16, 2024

jzabroski Jan 16, 2024

ManyTheFish Jan 17, 2024

ngdbao Jan 18, 2024

ManyTheFish Jan 22, 2024

jzabroski Jan 22, 2024 •

edited

Loading

ngdbao Jan 22, 2024

ManyTheFish Jan 22, 2024

ngdbao Jan 24, 2024

curquiza commented Jan 17, 2024 •

edited

Loading

ngdbao commented Jan 18, 2024

ManyTheFish left a comment

meili-bors bot commented Jan 24, 2024

normalize Ð and Đ into d #257

normalize Ð and Đ into d #257

Conversation

ngdbao commented Jan 16, 2024

Pull Request

Related issue

What does this PR do?

PR checklist

curquiza commented Jan 16, 2024

jzabroski commented Jan 16, 2024

jzabroski Jan 16, 2024

Choose a reason for hiding this comment

ManyTheFish Jan 17, 2024

Choose a reason for hiding this comment

ngdbao Jan 18, 2024

Choose a reason for hiding this comment

ManyTheFish Jan 22, 2024

Choose a reason for hiding this comment

jzabroski Jan 22, 2024 • edited Loading

Choose a reason for hiding this comment

ngdbao Jan 22, 2024

Choose a reason for hiding this comment

ManyTheFish Jan 22, 2024

Choose a reason for hiding this comment

ngdbao Jan 24, 2024

Choose a reason for hiding this comment

curquiza commented Jan 17, 2024 • edited Loading

ngdbao commented Jan 18, 2024

ManyTheFish left a comment

Choose a reason for hiding this comment

meili-bors bot commented Jan 24, 2024

jzabroski Jan 22, 2024 •

edited

Loading

curquiza commented Jan 17, 2024 •

edited

Loading