Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

normalize Ð and Đ into d #257

Merged
merged 3 commits into from
Jan 24, 2024

Conversation

ngdbao
Copy link
Contributor

@ngdbao ngdbao commented Jan 16, 2024

Pull Request

Related issue

Fixes issue #<245>

What does this PR do?

  • Add Vietnamese normalizer

PR checklist

Please check if your PR fulfills the following requirements:

  • [ x ] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
  • [ x ] Have you read the contributing guidelines?
  • [ x ] Have you made sure that the title is accurate and descriptive of the changes?

@ngdbao ngdbao mentioned this pull request Jan 16, 2024
@curquiza
Copy link
Member

@ngdbao thank you for the PR
can you fix the Rustfmt CI before we review it please? 😊

@jzabroski
Copy link

Isn't d and D with stroke a different letter? I think that may negatively affect downstream tokenization in an n-gram language model.

impl CharNormalizer for VietnameseNormalizer {
fn normalize_char(&self, c: char) -> Option<CharOrStr> {
match c {
'Ð' | 'Đ' | 'đ' => Some("d".to_string().into()), // not only Vietnamese, but also many European countries use these letters

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should say:

'Ð' | 'Đ' | 'đ' => Some("đ".to_string().into()),

since "d" is a different letter, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right, I would even prefer something like:

'Ð' => Some("Đ".into()),
'ð' => Some("đ".into()),

https://www.compart.com/en/unicode/U+0111

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about how Slovenian, Croatian, or other countries handle this, but Vietnamese people use a US-layout keyboard with software for typing Unicode, which is manually installed, but not by everyone.

People would be happy if typing "Da Lat" would produce the same result as "Đà Lạt", "D" as same as "Đ" in digital letters.

This is from Airbnb
Screenshot 2024-01-18 at 22 42 43

This is from Skyscanner
Screenshot 2024-01-18 at 22 46 53

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, lets keep your implementation then, could you make the CI happy? This way I will be able to merge your PR :)

Copy link

@jzabroski jzabroski Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ManyTheFish Doesn’t the normalizer come before the tokenizer. Given Vietnamese is an n-gram language, I would have thought throwing away the d with stroke Metadata might hurt the n-gram part of the code (downstream). Also, If you normalize it here to just d, don't you also need to wait for all indexed documents to be reindexed for this to work?

Anyway....
I think the real question is can this solution actually meet the users needs.... It's possible it does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jzabroski yeah, what I expected somehow may not be as same as what PR is, and not sure about how It impacts overall

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @jzabroski and @ngdbao, the normalizers are processed after the word segmentation, so we already have the token at this step.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ManyTheFish sounds great now, can't wait to see It get merged :)

@curquiza
Copy link
Member

curquiza commented Jan 17, 2024

(@jzabroski, a detail, there is still the issue with Rustfmt CI 😇)

@ngdbao
Copy link
Contributor Author

ngdbao commented Jan 18, 2024

(@jzabroski, a detail, there is still the issue with Rustfmt CI 😇)

sorry my bad, I'm starting with zero-knowledge in Rust, trying to arrange Rust local-environment

Copy link
Member

@ManyTheFish ManyTheFish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this version and try it.
About the reindexing, each new version of Meilisearch needs to reindex the data, so this change will not impact it more than usual.

Bors merge

Copy link
Contributor

meili-bors bot commented Jan 24, 2024

Build succeeded:

  • tests

@meili-bors meili-bors bot merged commit 286ef64 into meilisearch:main Jan 24, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants