Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

normalize Ð and Đ into d #257

Merged
merged 3 commits into from
Jan 24, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion charabia/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ litemap = "0.6.1"
zerovec = "0.9.3"

[features]
default = ["chinese", "hebrew", "japanese", "thai", "korean", "greek", "latin-camelcase", "latin-snakecase", "khmer"]
default = ["chinese", "hebrew", "japanese", "thai", "korean", "greek", "latin-camelcase", "latin-snakecase", "khmer", "vietnamese"]

# allow chinese specialized tokenization
chinese = ["dep:pinyin", "dep:jieba-rs"]
Expand Down Expand Up @@ -65,6 +65,9 @@ latin-camelcase = ["dep:finl_unicode"]

khmer = []

# allow vietnamese specialized tokenization
vietnamese = []

# allow splitting snake_case latin words
latin-snakecase = ["dep:finl_unicode"]

Expand Down
5 changes: 5 additions & 0 deletions charabia/src/normalizer/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ use self::greek::GreekNormalizer;
#[cfg(feature = "japanese-transliteration")]
pub use self::japanese::JapaneseNormalizer;
pub use self::lowercase::LowercaseNormalizer;
pub use self::vietnamese::VietnameseNormalizer;
use self::nonspacing_mark::NonspacingMarkNormalizer;
use self::quote::QuoteNormalizer;
use crate::segmenter::SegmentedTokenIter;
Expand All @@ -31,6 +32,8 @@ mod japanese;
mod lowercase;
mod nonspacing_mark;
mod quote;
#[cfg(feature = "vietnamese")]
mod vietnamese;

/// List of [`Normalizer`]s used by [`Normalize::normalize`] that are not considered lossy.
pub static NORMALIZERS: Lazy<Vec<Box<dyn Normalizer>>> = Lazy::new(|| {
Expand All @@ -54,6 +57,8 @@ pub static LOSSY_NORMALIZERS: Lazy<Vec<Box<dyn Normalizer>>> = Lazy::new(|| {
Box::new(GreekNormalizer),
Box::new(ArabicNormalizer),
Box::new(NonspacingMarkNormalizer),
#[cfg(feature = "vietnamese")]
Box::new(VietnameseNormalizer),
]
});

Expand Down
23 changes: 23 additions & 0 deletions charabia/src/normalizer/vietnamese.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
use super::{CharNormalizer, CharOrStr};
use crate::Token;
use crate::Script;

pub struct VietnameseNormalizer;

impl CharNormalizer for VietnameseNormalizer {
fn normalize_char(&self, c: char) -> Option<CharOrStr> {
match c {
'Ð' | 'Đ' | 'đ' => Some("d".to_string().into()), // not only Vietnamese, but also many European countries use these letters

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should say:

'Ð' | 'Đ' | 'đ' => Some("đ".to_string().into()),

since "d" is a different letter, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right, I would even prefer something like:

'Ð' => Some("Đ".into()),
'ð' => Some("đ".into()),

https://www.compart.com/en/unicode/U+0111

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about how Slovenian, Croatian, or other countries handle this, but Vietnamese people use a US-layout keyboard with software for typing Unicode, which is manually installed, but not by everyone.

People would be happy if typing "Da Lat" would produce the same result as "Đà Lạt", "D" as same as "Đ" in digital letters.

This is from Airbnb
Screenshot 2024-01-18 at 22 42 43

This is from Skyscanner
Screenshot 2024-01-18 at 22 46 53

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, lets keep your implementation then, could you make the CI happy? This way I will be able to merge your PR :)

Copy link

@jzabroski jzabroski Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ManyTheFish Doesn’t the normalizer come before the tokenizer. Given Vietnamese is an n-gram language, I would have thought throwing away the d with stroke Metadata might hurt the n-gram part of the code (downstream). Also, If you normalize it here to just d, don't you also need to wait for all indexed documents to be reindexed for this to work?

Anyway....
I think the real question is can this solution actually meet the users needs.... It's possible it does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jzabroski yeah, what I expected somehow may not be as same as what PR is, and not sure about how It impacts overall

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @jzabroski and @ngdbao, the normalizers are processed after the word segmentation, so we already have the token at this step.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ManyTheFish sounds great now, can't wait to see It get merged :)

_ => None,
}
}

fn should_normalize(&self, token: &Token) -> bool {
token.script == Script::Latin && token.lemma.chars().any(is_should_normalize)
}

}

fn is_should_normalize(c: char) -> bool {
matches!(c, 'Ð' | 'Đ' | 'đ')
}
Loading