Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicate unescape_unicode implementation #307

Merged
merged 2 commits into from
May 5, 2024

Conversation

JasperDeSutter
Copy link
Contributor

Shares the implementation of unescape_unicode and unescape_unicode_to_string. I've added a check to the corresponding test to make sure values that don't need unescaping are returned as Cow::Borrowed.

@gregtatum gregtatum self-requested a review December 21, 2022 15:12
@zbraniecki
Copy link
Collaborator

This PR causes significant perf (26%) regression on unescape_to_str as it removes a bunch of carefully planted optimizations:

/Users/zibi/projects/fluent-rs/fluent-bundle〉cargo criterion unescape --features all-benchmarks
   Compiling fluent-syntax v0.11.0 (/Users/zibi/projects/fluent-rs/fluent-syntax)
   Compiling fluent-bundle v0.15.2 (/Users/zibi/projects/fluent-rs/fluent-bundle)
    Finished bench [optimized] target(s) in 2.24s

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 2 filtered out; finished in 0.00s

Gnuplot not found, using plotters backend
construct/unescape      time:   [497.02 ns 498.05 ns 499.27 ns]
                        change: [-1.3500% -0.9283% -0.4818%] (p = 0.00 < 0.05)
                        Change within noise threshold.

resolve/unescape        time:   [396.24 ns 397.15 ns 398.17 ns]
                        change: [+3.1358% +3.6066% +4.0831%] (p = 0.00 < 0.05)
                        Performance has regressed.

resolve_to_str/unescape time:   [501.87 ns 502.93 ns 503.89 ns]
                        change: [+25.846% +26.437% +27.027%] (p = 0.00 < 0.05)
                        Performance has regressed.

@JasperDeSutter
Copy link
Contributor Author

Thanks for checking the perf impact, I do run the benchmarks but they don't seem too reliable on my laptop. I'm guessing the system load/cpu scaling has more impact than my changes 😄
I've added back the check on line 92, not sure what else could have caused regressions.

@zbraniecki
Copy link
Collaborator

Does't help. It's still a ~25% regression. I skimmed through the code and I don't see a clear reason, but can reproduce it very reliably with cargo-criterion.

@JasperDeSutter
Copy link
Contributor Author

I've optimized the implementation a bit to work away a bounds check from char boundary checking.
It does change the approach quite a bit, by changing the string slice instead of holding a start pointer. Sadly the following is not possible without bringing back the bounds check:
while let Some(ptr) = input.as_bytes().iter().position(|b| *b == b'\\')

@zbraniecki
Copy link
Collaborator

It didn't help. I was able to accomplish what you're proposing here, without perf impact:

//! A set of helper functions for unescaping Fluent unicode escape sequences.
//!
//! # Unicode
//!
//! Fluent supports UTF-8 in all FTL resources, but it also allows
//! unicode sequences to be escaped in [`String
//! Literals`](super::ast::InlineExpression::StringLiteral).
//!
//! Four byte sequences are encoded with `\u` and six byte
//! sqeuences using `\U`.
//! ## Example
//!
//! ```
//! use fluent_syntax::unicode::unescape_unicode_to_string;
//!
//! assert_eq!(
//!     unescape_unicode_to_string("Foo \\u5bd2 Bar"),
//!     "Foo 寒 Bar"
//! );
//!
//! assert_eq!(
//!     unescape_unicode_to_string("Foo \\U01F68A Bar"),
//!     "Foo 🚊 Bar"
//! );
//! ```
//!
//! # Other unescapes
//!
//! This also allows for a char `"` to be present inside an FTL string literal,
//! and for `\` itself to be escaped.
//!
//! ## Example
//!
//! ```
//! use fluent_syntax::unicode::unescape_unicode_to_string;
//!
//! assert_eq!(
//!     unescape_unicode_to_string("Foo \\\" Bar"),
//!     "Foo \" Bar"
//! );
//! assert_eq!(
//!     unescape_unicode_to_string("Foo \\\\ Bar"),
//!     "Foo \\ Bar"
//! );
//! ```
use std::borrow::Cow;
use std::char;
use std::fmt;

const UNKNOWN_CHAR: char = '�';

fn encode_unicode(s: Option<&str>) -> char {
    s.and_then(|s| u32::from_str_radix(s, 16).ok().and_then(char::from_u32))
        .unwrap_or(UNKNOWN_CHAR)
}

fn unescape<W>(w: &mut W, input: &str) -> Result<bool, std::fmt::Error>
where
    W: fmt::Write,
{
    let bytes = input.as_bytes();

    let mut start = 0;
    let mut ptr = 0;

    while let Some(b) = bytes.get(ptr) {
        if b != &b'\\' {
            ptr += 1;
            continue;
        }
        if start != ptr {
            w.write_str(&input[start..ptr])?;
        }

        ptr += 1;

        let new_char = match bytes.get(ptr) {
            Some(b'\\') => '\\',
            Some(b'"') => '"',
            Some(u @ b'u') | Some(u @ b'U') => {
                let seq_start = ptr + 1;
                let len = if u == &b'u' { 4 } else { 6 };
                ptr += len;
                encode_unicode(input.get(seq_start..seq_start + len))
            }
            _ => UNKNOWN_CHAR,
        };
        ptr += 1;
        w.write_char(new_char)?;
        start = ptr;
    }
    if start == 0 {
        return Ok(false);
    }

    if start != ptr {
        w.write_str(&input[start..ptr])?;
    }
    Ok(true)
}

/// Unescapes to a writer without allocating.
///
/// ## Example
///
/// ```
/// use fluent_syntax::unicode::unescape_unicode;
///
/// let mut s = String::new();
/// unescape_unicode(&mut s, "Foo \\U01F60A Bar");
/// assert_eq!(s, "Foo 😊 Bar");
/// ```
pub fn unescape_unicode<W>(w: &mut W, input: &str) -> fmt::Result
where
    W: fmt::Write,
{
    if unescape(w, input)? {
        return Ok(());
    }
    w.write_str(input)
}

/// Unescapes to a `Cow<str>` optionally allocating.
///
/// ## Example
///
/// ```
/// use fluent_syntax::unicode::unescape_unicode_to_string;
///
/// assert_eq!(
///     unescape_unicode_to_string("Foo \\U01F60A Bar"),
///     "Foo 😊 Bar"
/// );
/// ```
pub fn unescape_unicode_to_string(input: &str) -> Cow<str> {
    let mut result = String::new();

    let owned = unescape(&mut result, input).expect("String write methods don't Err");

    if owned {
        Cow::Owned(result)
    } else {
        Cow::Borrowed(input)
    }
}

@JasperDeSutter
Copy link
Contributor Author

Last try on my method! Shouldn't be possible to optimize much further..
I'm not able to spot the differences in the code you posted with my second commit, is it just moving the functions?

Also, I'm not too sure we're benchmarking correctly with unescape.ftl being very small and not being optimized for the more common case of having no escape sequences.

@zbraniecki
Copy link
Collaborator

Your code vs my snippet:

/Users/zibi/projects/fluent-rs/fluent-bundle〉cargo criterion unescape --features all-benchmarks
   Compiling fluent-syntax v0.11.0 (/Users/zibi/projects/fluent-rs/fluent-syntax)
   Compiling fluent-bundle v0.15.2 (/Users/zibi/projects/fluent-rs/fluent-bundle)
    Finished bench [optimized] target(s) in 2.66s

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 2 filtered out; finished in 0.00s

Gnuplot not found, using plotters backend
construct/unescape      time:   [498.08 ns 499.65 ns 501.28 ns]
                        change: [+0.2029% +0.6591% +1.1052%] (p = 0.01 < 0.05)
                        Change within noise threshold.

resolve/unescape        time:   [384.18 ns 385.82 ns 387.91 ns]
                        change: [-0.3062% +0.1708% +0.6949%] (p = 0.51 > 0.05)
                        No change in performance detected.

resolve_to_str/unescape time:   [484.81 ns 486.70 ns 488.78 ns]
                        change: [+20.621% +21.180% +21.755%] (p = 0.00 < 0.05)
                        Performance has regressed.

Unexpected error while launching valgrind. Error: No such file or directory (os error 2)

What is your goal here? You filed a PR to "deduplicate unescape_unicode implementation" and I provided you with deduplicated implementations that preserve performance.

What else are you trying to accomplish in this PR and can you separate it out?

Here's the diff between your version and mine that causes 25% regression:

diff --git a/fluent-syntax/src/unicode.rs b/fluent-syntax/src/unicode.rs
index 137e17e..33f3c95 100644
--- a/fluent-syntax/src/unicode.rs
+++ b/fluent-syntax/src/unicode.rs
@@ -49,10 +49,8 @@ use std::fmt;
 
 const UNKNOWN_CHAR: char = '�';
 
-fn encode_unicode(s: &str) -> char {
-    u32::from_str_radix(s, 16)
-        .ok()
-        .and_then(char::from_u32)
+fn encode_unicode(s: Option<&str>) -> char {
+    s.and_then(|s| u32::from_str_radix(s, 16).ok().and_then(char::from_u32))
         .unwrap_or(UNKNOWN_CHAR)
 }
 
@@ -77,64 +75,47 @@ where
     w.write_str(input)
 }
 
-fn find_escape_sequence<'a, W>(
-    input: &'a str,
-    w: &mut W,
-) -> Result<Option<&'a str>, std::fmt::Error>
+fn unescape<W>(w: &mut W, input: &str) -> Result<bool, std::fmt::Error>
 where
     W: fmt::Write,
 {
-    if input.is_empty() {
-        return Ok(None);
-    }
+    let bytes = input.as_bytes();
 
+    let mut start = 0;
     let mut ptr = 0;
-    while input.as_bytes()[ptr] != b'\\' {
-        ptr += 1;
-        if ptr >= input.len() {
-            return Ok(None);
-        }
-    }
 
-    if ptr > 0 {
-        let (a, b) = input.split_at(ptr);
-        w.write_str(a)?;
-        return Ok(Some(b));
-    }
-    Ok(Some(input))
-}
-
-fn unescape<W>(w: &mut W, mut input: &str) -> Result<bool, std::fmt::Error>
-where
-    W: fmt::Write,
-{
-    let start_len = input.len();
+    while let Some(b) = bytes.get(ptr) {
+        if b != &b'\\' {
+            ptr += 1;
+            continue;
+        }
+        if start != ptr {
+            w.write_str(&input[start..ptr])?;
+        }
 
-    while let Some(sequence) = find_escape_sequence(input, w)? {
-        let mut offset = 2;
+        ptr += 1;
 
-        let new_char = match sequence.as_bytes().get(1) {
+        let new_char = match bytes.get(ptr) {
             Some(b'\\') => '\\',
             Some(b'"') => '"',
             Some(u @ b'u') | Some(u @ b'U') => {
-                let len = if *u == b'u' { 4 } else { 6 };
-                offset += len;
-                sequence.get(2..offset).map_or(UNKNOWN_CHAR, encode_unicode)
+                let seq_start = ptr + 1;
+                let len = if u == &b'u' { 4 } else { 6 };
+                ptr += len;
+                encode_unicode(input.get(seq_start..seq_start + len))
             }
             _ => UNKNOWN_CHAR,
         };
+        ptr += 1;
         w.write_char(new_char)?;
-
-        input = sequence.get(offset..).unwrap_or_default();
+        start = ptr;
     }
-
-    if input.len() == start_len {
-        // nothing was unescaped
+    if start == 0 {
         return Ok(false);
     }
 
-    if !input.is_empty() {
-        w.write_str(input)?;
+    if start != ptr {
+        w.write_str(&input[start..ptr])?;
     }
     Ok(true)
 }

I'm not able to spot the differences in the code you posted with my second commit, is it just moving the functions?

I don't think it is. I suspect some parts of your logic are falling off the cliff. We can keep debugging, but I'd like to understand what is the goal.

Also, I'm not too sure we're benchmarking correctly with unescape.ftl being very small and not being optimized for the more common case of having no escape sequences.

I strongly encourage you to not mix algo changes with benchmark evaluation. It's tempting to adjust the benchmark to show what you want, but that's precisely against what it's here for.
I also happen to disagree with you - the benchmark is testing exactly what we want to test with heavier weight on unescaping because that's costly, while pass-through should be as close to no-op as possible.

To sum it up, I recommend:

  1. Take my snippet, update PR, let's land this.
  2. Start a new PR where you can fiddle with additional changes and justify them please.
  3. If you believe the benchmark is wrong, file an issue/PR and discuss it there.

@zbraniecki
Copy link
Collaborator

zbraniecki commented Dec 22, 2022

I kept digging a bit more into it - it's actually a red-herring.

The changes between your code and mine are not impacting perf, but your branch is not up to date with main:

diff --git a/fluent-bundle/src/bundle.rs b/fluent-bundle/src/bundle.rs
index 63b00b2..0b74c22 100644
--- a/fluent-bundle/src/bundle.rs
+++ b/fluent-bundle/src/bundle.rs
@@ -494,7 +494,7 @@ impl<R, M> FluentBundle<R, M> {
     {
         let mut scope = Scope::new(self, args, Some(errors));
         let value = pattern.resolve(&mut scope);
-        value.as_string(&scope)
+        value.into_string(&scope)
     }
     /// Makes the provided rust function available to messages with the name `id`. See

This change is 100% responsible for the resolve_to_str/unescape regression.

Please, rebase on top of main!

I still think it's better to separate changes into smaller PRs so recommendation stands, but now you don't have to worry about your further changes affecting perf anymore!

@JasperDeSutter
Copy link
Contributor Author

That explains why I wasn't seeing the same performance impact, I was comparing to main without that into_string commit included.. I've removed the optimization commit.

Copy link
Collaborator

@zbraniecki zbraniecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! This looks good to me and I confirmed that it doesn't affect performance!

Let's wait for @gregtatum to review and decide on the PR.

@gregtatum
Copy link
Member

This is still on my list to look at, I only had time today to look at some smaller PRs.

@gregtatum gregtatum removed their request for review August 4, 2023 15:44
@alerque alerque force-pushed the dedupe-unescape branch from f7d7057 to b17e47f Compare May 5, 2024 21:49
Copy link
Collaborator

@alerque alerque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rebase on this was just to tweak commit messages so I can use conventional-commit tooling to generate changelogs for future releases.

Thanks for the contribution and especially tests!

@alerque alerque merged commit 466e100 into projectfluent:main May 5, 2024
5 checks passed
@JasperDeSutter JasperDeSutter deleted the dedupe-unescape branch May 6, 2024 04:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants