Deduplicate unescape_unicode implementation #307

JasperDeSutter · 2022-12-21T14:46:23Z

Shares the implementation of unescape_unicode and unescape_unicode_to_string. I've added a check to the corresponding test to make sure values that don't need unescaping are returned as Cow::Borrowed.

zbraniecki · 2022-12-21T15:49:14Z

This PR causes significant perf (26%) regression on unescape_to_str as it removes a bunch of carefully planted optimizations:

/Users/zibi/projects/fluent-rs/fluent-bundle〉cargo criterion unescape --features all-benchmarks
   Compiling fluent-syntax v0.11.0 (/Users/zibi/projects/fluent-rs/fluent-syntax)
   Compiling fluent-bundle v0.15.2 (/Users/zibi/projects/fluent-rs/fluent-bundle)
    Finished bench [optimized] target(s) in 2.24s

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 2 filtered out; finished in 0.00s

Gnuplot not found, using plotters backend
construct/unescape      time:   [497.02 ns 498.05 ns 499.27 ns]
                        change: [-1.3500% -0.9283% -0.4818%] (p = 0.00 < 0.05)
                        Change within noise threshold.

resolve/unescape        time:   [396.24 ns 397.15 ns 398.17 ns]
                        change: [+3.1358% +3.6066% +4.0831%] (p = 0.00 < 0.05)
                        Performance has regressed.

resolve_to_str/unescape time:   [501.87 ns 502.93 ns 503.89 ns]
                        change: [+25.846% +26.437% +27.027%] (p = 0.00 < 0.05)
                        Performance has regressed.

JasperDeSutter · 2022-12-21T16:25:59Z

Thanks for checking the perf impact, I do run the benchmarks but they don't seem too reliable on my laptop. I'm guessing the system load/cpu scaling has more impact than my changes 😄
I've added back the check on line 92, not sure what else could have caused regressions.

zbraniecki · 2022-12-21T16:41:57Z

Does't help. It's still a ~25% regression. I skimmed through the code and I don't see a clear reason, but can reproduce it very reliably with cargo-criterion.

JasperDeSutter · 2022-12-21T19:27:22Z

I've optimized the implementation a bit to work away a bounds check from char boundary checking.
It does change the approach quite a bit, by changing the string slice instead of holding a start pointer. Sadly the following is not possible without bringing back the bounds check:
while let Some(ptr) = input.as_bytes().iter().position(|b| *b == b'\\')

zbraniecki · 2022-12-21T22:02:03Z

It didn't help. I was able to accomplish what you're proposing here, without perf impact:

//! A set of helper functions for unescaping Fluent unicode escape sequences.
//!
//! # Unicode
//!
//! Fluent supports UTF-8 in all FTL resources, but it also allows
//! unicode sequences to be escaped in [`String
//! Literals`](super::ast::InlineExpression::StringLiteral).
//!
//! Four byte sequences are encoded with `\u` and six byte
//! sqeuences using `\U`.
//! ## Example
//!
//! ```
//! use fluent_syntax::unicode::unescape_unicode_to_string;
//!
//! assert_eq!(
//!     unescape_unicode_to_string("Foo \\u5bd2 Bar"),
//!     "Foo 寒 Bar"
//! );
//!
//! assert_eq!(
//!     unescape_unicode_to_string("Foo \\U01F68A Bar"),
//!     "Foo 🚊 Bar"
//! );
//! ```
//!
//! # Other unescapes
//!
//! This also allows for a char `"` to be present inside an FTL string literal,
//! and for `\` itself to be escaped.
//!
//! ## Example
//!
//! ```
//! use fluent_syntax::unicode::unescape_unicode_to_string;
//!
//! assert_eq!(
//!     unescape_unicode_to_string("Foo \\\" Bar"),
//!     "Foo \" Bar"
//! );
//! assert_eq!(
//!     unescape_unicode_to_string("Foo \\\\ Bar"),
//!     "Foo \\ Bar"
//! );
//! ```
use std::borrow::Cow;
use std::char;
use std::fmt;

const UNKNOWN_CHAR: char = '�';

fn encode_unicode(s: Option<&str>) -> char {
    s.and_then(|s| u32::from_str_radix(s, 16).ok().and_then(char::from_u32))
        .unwrap_or(UNKNOWN_CHAR)
}

fn unescape<W>(w: &mut W, input: &str) -> Result<bool, std::fmt::Error>
where
    W: fmt::Write,
{
    let bytes = input.as_bytes();

    let mut start = 0;
    let mut ptr = 0;

    while let Some(b) = bytes.get(ptr) {
        if b != &b'\\' {
            ptr += 1;
            continue;
        }
        if start != ptr {
            w.write_str(&input[start..ptr])?;
        }

        ptr += 1;

        let new_char = match bytes.get(ptr) {
            Some(b'\\') => '\\',
            Some(b'"') => '"',
            Some(u @ b'u') | Some(u @ b'U') => {
                let seq_start = ptr + 1;
                let len = if u == &b'u' { 4 } else { 6 };
                ptr += len;
                encode_unicode(input.get(seq_start..seq_start + len))
            }
            _ => UNKNOWN_CHAR,
        };
        ptr += 1;
        w.write_char(new_char)?;
        start = ptr;
    }
    if start == 0 {
        return Ok(false);
    }

    if start != ptr {
        w.write_str(&input[start..ptr])?;
    }
    Ok(true)
}

/// Unescapes to a writer without allocating.
///
/// ## Example
///
/// ```
/// use fluent_syntax::unicode::unescape_unicode;
///
/// let mut s = String::new();
/// unescape_unicode(&mut s, "Foo \\U01F60A Bar");
/// assert_eq!(s, "Foo 😊 Bar");
/// ```
pub fn unescape_unicode<W>(w: &mut W, input: &str) -> fmt::Result
where
    W: fmt::Write,
{
    if unescape(w, input)? {
        return Ok(());
    }
    w.write_str(input)
}

/// Unescapes to a `Cow<str>` optionally allocating.
///
/// ## Example
///
/// ```
/// use fluent_syntax::unicode::unescape_unicode_to_string;
///
/// assert_eq!(
///     unescape_unicode_to_string("Foo \\U01F60A Bar"),
///     "Foo 😊 Bar"
/// );
/// ```
pub fn unescape_unicode_to_string(input: &str) -> Cow<str> {
    let mut result = String::new();

    let owned = unescape(&mut result, input).expect("String write methods don't Err");

    if owned {
        Cow::Owned(result)
    } else {
        Cow::Borrowed(input)
    }
}

JasperDeSutter · 2022-12-22T11:23:58Z

Last try on my method! Shouldn't be possible to optimize much further..
I'm not able to spot the differences in the code you posted with my second commit, is it just moving the functions?

Also, I'm not too sure we're benchmarking correctly with unescape.ftl being very small and not being optimized for the more common case of having no escape sequences.

zbraniecki · 2022-12-22T12:49:14Z

Your code vs my snippet:

/Users/zibi/projects/fluent-rs/fluent-bundle〉cargo criterion unescape --features all-benchmarks
   Compiling fluent-syntax v0.11.0 (/Users/zibi/projects/fluent-rs/fluent-syntax)
   Compiling fluent-bundle v0.15.2 (/Users/zibi/projects/fluent-rs/fluent-bundle)
    Finished bench [optimized] target(s) in 2.66s

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 2 filtered out; finished in 0.00s

Gnuplot not found, using plotters backend
construct/unescape      time:   [498.08 ns 499.65 ns 501.28 ns]
                        change: [+0.2029% +0.6591% +1.1052%] (p = 0.01 < 0.05)
                        Change within noise threshold.

resolve/unescape        time:   [384.18 ns 385.82 ns 387.91 ns]
                        change: [-0.3062% +0.1708% +0.6949%] (p = 0.51 > 0.05)
                        No change in performance detected.

resolve_to_str/unescape time:   [484.81 ns 486.70 ns 488.78 ns]
                        change: [+20.621% +21.180% +21.755%] (p = 0.00 < 0.05)
                        Performance has regressed.

Unexpected error while launching valgrind. Error: No such file or directory (os error 2)

What is your goal here? You filed a PR to "deduplicate unescape_unicode implementation" and I provided you with deduplicated implementations that preserve performance.

What else are you trying to accomplish in this PR and can you separate it out?

Here's the diff between your version and mine that causes 25% regression:

diff --git a/fluent-syntax/src/unicode.rs b/fluent-syntax/src/unicode.rs
index 137e17e..33f3c95 100644
--- a/fluent-syntax/src/unicode.rs
+++ b/fluent-syntax/src/unicode.rs
@@ -49,10 +49,8 @@ use std::fmt;
 
 const UNKNOWN_CHAR: char = '�';
 
-fn encode_unicode(s: &str) -> char {
-    u32::from_str_radix(s, 16)
-        .ok()
-        .and_then(char::from_u32)
+fn encode_unicode(s: Option<&str>) -> char {
+    s.and_then(|s| u32::from_str_radix(s, 16).ok().and_then(char::from_u32))
         .unwrap_or(UNKNOWN_CHAR)
 }
 
@@ -77,64 +75,47 @@ where
     w.write_str(input)
 }
 
-fn find_escape_sequence<'a, W>(
-    input: &'a str,
-    w: &mut W,
-) -> Result<Option<&'a str>, std::fmt::Error>
+fn unescape<W>(w: &mut W, input: &str) -> Result<bool, std::fmt::Error>
 where
     W: fmt::Write,
 {
-    if input.is_empty() {
-        return Ok(None);
-    }
+    let bytes = input.as_bytes();
 
+    let mut start = 0;
     let mut ptr = 0;
-    while input.as_bytes()[ptr] != b'\\' {
-        ptr += 1;
-        if ptr >= input.len() {
-            return Ok(None);
-        }
-    }
 
-    if ptr > 0 {
-        let (a, b) = input.split_at(ptr);
-        w.write_str(a)?;
-        return Ok(Some(b));
-    }
-    Ok(Some(input))
-}
-
-fn unescape<W>(w: &mut W, mut input: &str) -> Result<bool, std::fmt::Error>
-where
-    W: fmt::Write,
-{
-    let start_len = input.len();
+    while let Some(b) = bytes.get(ptr) {
+        if b != &b'\\' {
+            ptr += 1;
+            continue;
+        }
+        if start != ptr {
+            w.write_str(&input[start..ptr])?;
+        }
 
-    while let Some(sequence) = find_escape_sequence(input, w)? {
-        let mut offset = 2;
+        ptr += 1;
 
-        let new_char = match sequence.as_bytes().get(1) {
+        let new_char = match bytes.get(ptr) {
             Some(b'\\') => '\\',
             Some(b'"') => '"',
             Some(u @ b'u') | Some(u @ b'U') => {
-                let len = if *u == b'u' { 4 } else { 6 };
-                offset += len;
-                sequence.get(2..offset).map_or(UNKNOWN_CHAR, encode_unicode)
+                let seq_start = ptr + 1;
+                let len = if u == &b'u' { 4 } else { 6 };
+                ptr += len;
+                encode_unicode(input.get(seq_start..seq_start + len))
             }
             _ => UNKNOWN_CHAR,
         };
+        ptr += 1;
         w.write_char(new_char)?;
-
-        input = sequence.get(offset..).unwrap_or_default();
+        start = ptr;
     }
-
-    if input.len() == start_len {
-        // nothing was unescaped
+    if start == 0 {
         return Ok(false);
     }
 
-    if !input.is_empty() {
-        w.write_str(input)?;
+    if start != ptr {
+        w.write_str(&input[start..ptr])?;
     }
     Ok(true)
 }

I'm not able to spot the differences in the code you posted with my second commit, is it just moving the functions?

I don't think it is. I suspect some parts of your logic are falling off the cliff. We can keep debugging, but I'd like to understand what is the goal.

Also, I'm not too sure we're benchmarking correctly with unescape.ftl being very small and not being optimized for the more common case of having no escape sequences.

I strongly encourage you to not mix algo changes with benchmark evaluation. It's tempting to adjust the benchmark to show what you want, but that's precisely against what it's here for.
I also happen to disagree with you - the benchmark is testing exactly what we want to test with heavier weight on unescaping because that's costly, while pass-through should be as close to no-op as possible.

To sum it up, I recommend:

Take my snippet, update PR, let's land this.
Start a new PR where you can fiddle with additional changes and justify them please.
If you believe the benchmark is wrong, file an issue/PR and discuss it there.

zbraniecki · 2022-12-22T13:14:08Z

I kept digging a bit more into it - it's actually a red-herring.

The changes between your code and mine are not impacting perf, but your branch is not up to date with main:

diff --git a/fluent-bundle/src/bundle.rs b/fluent-bundle/src/bundle.rs
index 63b00b2..0b74c22 100644
--- a/fluent-bundle/src/bundle.rs
+++ b/fluent-bundle/src/bundle.rs
@@ -494,7 +494,7 @@ impl<R, M> FluentBundle<R, M> {
     {
         let mut scope = Scope::new(self, args, Some(errors));
         let value = pattern.resolve(&mut scope);
-        value.as_string(&scope)
+        value.into_string(&scope)
     }
     /// Makes the provided rust function available to messages with the name `id`. See

This change is 100% responsible for the resolve_to_str/unescape regression.

Please, rebase on top of main!

I still think it's better to separate changes into smaller PRs so recommendation stands, but now you don't have to worry about your further changes affecting perf anymore!

JasperDeSutter · 2022-12-22T13:27:04Z

That explains why I wasn't seeing the same performance impact, I was comparing to main without that into_string commit included.. I've removed the optimization commit.

zbraniecki

thanks! This looks good to me and I confirmed that it doesn't affect performance!

Let's wait for @gregtatum to review and decide on the PR.

gregtatum · 2023-01-25T21:24:36Z

This is still on my list to look at, I only had time today to look at some smaller PRs.

alerque

The rebase on this was just to tweak commit messages so I can use conventional-commit tooling to generate changelogs for future releases.

Thanks for the contribution and especially tests!

gregtatum self-requested a review December 21, 2022 15:12

JasperDeSutter force-pushed the dedupe-unescape branch from 8820caf to bc8f258 Compare December 21, 2022 16:24

JasperDeSutter force-pushed the dedupe-unescape branch from 8bf33fe to 0e09127 Compare December 22, 2022 10:59

JasperDeSutter force-pushed the dedupe-unescape branch from 0e09127 to f7d7057 Compare December 22, 2022 13:21

zbraniecki approved these changes Dec 22, 2022

View reviewed changes

gregtatum removed their request for review August 4, 2023 15:44

JasperDeSutter added 2 commits May 6, 2024 00:47

test: Test ownership in unescape_unicode_test

0b13310

refator: Deduplicate implementation of unescape_unicode

b17e47f

alerque force-pushed the dedupe-unescape branch from f7d7057 to b17e47f Compare May 5, 2024 21:49

alerque approved these changes May 5, 2024

View reviewed changes

alerque merged commit 466e100 into projectfluent:main May 5, 2024
5 checks passed

JasperDeSutter deleted the dedupe-unescape branch May 6, 2024 04:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplicate unescape_unicode implementation #307

Deduplicate unescape_unicode implementation #307

JasperDeSutter commented Dec 21, 2022

zbraniecki commented Dec 21, 2022

JasperDeSutter commented Dec 21, 2022

zbraniecki commented Dec 21, 2022

JasperDeSutter commented Dec 21, 2022

zbraniecki commented Dec 21, 2022

JasperDeSutter commented Dec 22, 2022

zbraniecki commented Dec 22, 2022

zbraniecki commented Dec 22, 2022 •

edited

Loading

JasperDeSutter commented Dec 22, 2022

zbraniecki left a comment

gregtatum commented Jan 25, 2023

alerque left a comment

Deduplicate unescape_unicode implementation #307

Deduplicate unescape_unicode implementation #307

Conversation

JasperDeSutter commented Dec 21, 2022

zbraniecki commented Dec 21, 2022

JasperDeSutter commented Dec 21, 2022

zbraniecki commented Dec 21, 2022

JasperDeSutter commented Dec 21, 2022

zbraniecki commented Dec 21, 2022

JasperDeSutter commented Dec 22, 2022

zbraniecki commented Dec 22, 2022

zbraniecki commented Dec 22, 2022 • edited Loading

JasperDeSutter commented Dec 22, 2022

zbraniecki left a comment

Choose a reason for hiding this comment

gregtatum commented Jan 25, 2023

alerque left a comment

Choose a reason for hiding this comment

zbraniecki commented Dec 22, 2022 •

edited

Loading