Skip to content

Commit

Permalink
Merge #317
Browse files Browse the repository at this point in the history
317: Latin camelcase wrong segmentation r=ManyTheFish a=PedroTurik

# Pull Request
All interesting changes are in charabia/src/segmenter/latin/camel_case.rs
other changes are import reordering caused by cargo fmt.

This change definitely looks like a small speed regression, but I will be opening for review nonetheless.

First time contributor to any meilisearch repo.

## Related issue
Fixes [#289](#289)

## What does this PR do?
# This PR changes the Latin camelCase segmentation to address the following cases: 
- openSSL should be segmented as ['open', 'SSL']
- openSSLError should be segmented as ['open', 'SSL', 'Error']

This is done by introducing a helper iteration for the main iterator to "peek" the next char, and this is needed in the currently solution with `StrGroupBy::linear_group_by` afaik since the segmentation sometimes depends on the existence of a lowercase letter after the one currently being analyzed ( like openSSLError )

## Benchmarks

the benchmarks ran once WITH the changes, and once more after commenting the changes out. The output here is from after removing the changes

```bash 
cargo bench
```
And what I believe is the relevant output is the following

```bash
segment/132/Latin/Eng   time:   [2.5651 µs 2.5714 µs 2.5786 µs]
                        change: [+0.6549% +0.8593% +1.0487%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

segment/132/Latin/Fra   time:   [2.3521 µs 2.3547 µs 2.3572 µs]
                        change: [-0.3826% -0.0443% +0.2662%] (p = 0.79 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) low mild

segment/131/Latin/Deu   time:   [2.1346 µs 2.1372 µs 2.1403 µs]
                        change: [+0.4418% +0.5896% +0.7584%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

segment/363/Latin/Eng   time:   [6.6601 µs 6.6749 µs 6.6911 µs]
                        change: [+3.1214% +3.4133% +3.7267%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
  8 (8.00%) low mild
  3 (3.00%) high mild
  7 (7.00%) high severe

segment/363/Latin/Fra   time:   [6.5405 µs 6.5481 µs 6.5566 µs]
                        change: [-0.4942% -0.2417% +0.0139%] (p = 0.07 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

segment/365/Latin/Vie   time:   [5.6049 µs 5.6109 µs 5.6170 µs]
                        change: [-4.2406% -3.9811% -3.7043%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild

tokenize/132/Latin/Eng  time:   [6.1291 µs 6.1349 µs 6.1406 µs]
                        change: [-1.4925% -1.3344% -1.1740%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
tokenize/132/Latin/Fra  time:   [6.9543 µs 6.9617 µs 6.9696 µs]
                        change: [-1.5821% -1.3067% -1.0380%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe

tokenize/131/Latin/Deu  time:   [6.4793 µs 6.4916 µs 6.5050 µs]
                        change: [-3.1861% -2.9351% -2.6852%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild

tokenize/363/Latin/Eng  time:   [15.510 µs 15.523 µs 15.535 µs]
                        change: [-8.2776% -7.9652% -7.6477%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  9 (9.00%) high mild
tokenize/363/Latin/Fra  time:   [18.026 µs 18.044 µs 18.064 µs]
                        change: [-2.4157% -1.5634% -1.0380%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild

tokenize/365/Latin/Vie  time:   [27.604 µs 27.626 µs 27.651 µs]
                        change: [-2.5536% -1.9732% -1.4007%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

tokenize/354/Latin/Deu  time:   [17.874 µs 17.904 µs 17.936 µs]
                        change: [+2.2411% +2.4528% +2.6719%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
```

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?


Co-authored-by: Pedro Turik Firmino <[email protected]>
  • Loading branch information
meili-bors[bot] and PedroTurik authored Nov 6, 2024
2 parents 4a16cd3 + 6d8ae94 commit 6fd2fcc
Show file tree
Hide file tree
Showing 5 changed files with 25 additions and 11 deletions.
3 changes: 1 addition & 2 deletions charabia/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -59,9 +59,8 @@ mod tokenizer;
pub use detection::{Language, Script, StrDetection};
pub use normalizer::Normalize;
pub use segmenter::Segment;
pub use token::{SeparatorKind, Token, TokenKind};

#[cfg(test)]
pub use token::StaticToken;
pub use token::{SeparatorKind, Token, TokenKind};

pub use crate::tokenizer::{ReconstructedTokenIter, Tokenize, Tokenizer, TokenizerBuilder};
3 changes: 1 addition & 2 deletions charabia/src/normalizer/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ use std::borrow::Cow;

use once_cell::sync::Lazy;

pub use self::ae_oe_normalizer::AeOeNormalizer;
pub use self::arabic::ArabicNormalizer;
#[cfg(feature = "chinese-normalization")]
pub use self::chinese::ChineseNormalizer;
Expand All @@ -24,8 +25,6 @@ pub use self::vietnamese::VietnameseNormalizer;
use crate::segmenter::SegmentedTokenIter;
use crate::Token;

pub use self::ae_oe_normalizer::AeOeNormalizer;

mod arabic;
#[cfg(feature = "chinese-normalization")]
mod chinese;
Expand Down
22 changes: 17 additions & 5 deletions charabia/src/segmenter/latin/camel_case.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,20 +5,26 @@ use slice_group_by::StrGroupBy;
/// For instance, "camelCase" is split into ["camel", "Case"].
/// A camelCase boundary constitutes a lowercase letter directly followed by an uppercase letter
/// optionally with any number of non-spacing marks in between.
/// Two consecutive uppercase letters constitute a boundary only if the following letter is lowercase
/// (eg., "MongoDBError" is split into ["Mongo", "DB", "Error"])
pub(crate) fn split_camel_case_bounds(str: &str) -> impl Iterator<Item = &str> {
let mut last_char_was_lowercase = str.chars().next().map_or(false, |c| c.is_lowercase());
let mut peek_char = str.chars().map(|c| c.is_lowercase());
let mut last_char_was_lowercase: bool = peek_char.next().unwrap_or_default();

peek_char.next();

str.linear_group_by(move |_, char| {
let peek_char_is_lowercase: bool = peek_char.next().unwrap_or_default();

if char.is_mark_nonspacing() {
return true;
}

if last_char_was_lowercase && char.is_letter_uppercase() {
return false;
}
let should_group =
!((last_char_was_lowercase || peek_char_is_lowercase) && char.is_letter_uppercase());

last_char_was_lowercase = char.is_letter_lowercase();
true
should_group
})
}

Expand Down Expand Up @@ -48,4 +54,10 @@ mod test {
non_spacing_marks_are_respected
);
test_segmentation!("a\u{0301}B", ["a\u{0301}", "B"], non_spacing_mark_after_first_letter);
test_segmentation!("openSSL", ["open", "SSL"], consecutive_uppercase_is_not_split);
test_segmentation!(
"MongoDBDatabase",
["Mongo", "DB", "Database"],
last_uppercase_from_non_final_sequence
);
}
5 changes: 4 additions & 1 deletion irg-kvariants/build.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
use std::env;
use std::error::Error;
use std::path::Path;

use serde::Deserialize;
use std::{env, error::Error, path::Path};

#[derive(Deserialize)]
pub struct TsvRow {
Expand Down
3 changes: 2 additions & 1 deletion irg-kvariants/src/lib.rs
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
use std::collections::HashMap;

use once_cell::sync::Lazy;
use serde::Deserialize;
use std::collections::HashMap;

#[derive(Debug, Eq, PartialEq)]
pub enum KVariantClass {
Expand Down

0 comments on commit 6fd2fcc

Please sign in to comment.