-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deduplicate unescape_unicode implementation #307
Conversation
This PR causes significant perf (26%) regression on
|
8820caf
to
bc8f258
Compare
Thanks for checking the perf impact, I do run the benchmarks but they don't seem too reliable on my laptop. I'm guessing the system load/cpu scaling has more impact than my changes 😄 |
Does't help. It's still a ~25% regression. I skimmed through the code and I don't see a clear reason, but can reproduce it very reliably with cargo-criterion. |
I've optimized the implementation a bit to work away a bounds check from char boundary checking. |
It didn't help. I was able to accomplish what you're proposing here, without perf impact: //! A set of helper functions for unescaping Fluent unicode escape sequences.
//!
//! # Unicode
//!
//! Fluent supports UTF-8 in all FTL resources, but it also allows
//! unicode sequences to be escaped in [`String
//! Literals`](super::ast::InlineExpression::StringLiteral).
//!
//! Four byte sequences are encoded with `\u` and six byte
//! sqeuences using `\U`.
//! ## Example
//!
//! ```
//! use fluent_syntax::unicode::unescape_unicode_to_string;
//!
//! assert_eq!(
//! unescape_unicode_to_string("Foo \\u5bd2 Bar"),
//! "Foo 寒 Bar"
//! );
//!
//! assert_eq!(
//! unescape_unicode_to_string("Foo \\U01F68A Bar"),
//! "Foo 🚊 Bar"
//! );
//! ```
//!
//! # Other unescapes
//!
//! This also allows for a char `"` to be present inside an FTL string literal,
//! and for `\` itself to be escaped.
//!
//! ## Example
//!
//! ```
//! use fluent_syntax::unicode::unescape_unicode_to_string;
//!
//! assert_eq!(
//! unescape_unicode_to_string("Foo \\\" Bar"),
//! "Foo \" Bar"
//! );
//! assert_eq!(
//! unescape_unicode_to_string("Foo \\\\ Bar"),
//! "Foo \\ Bar"
//! );
//! ```
use std::borrow::Cow;
use std::char;
use std::fmt;
const UNKNOWN_CHAR: char = '�';
fn encode_unicode(s: Option<&str>) -> char {
s.and_then(|s| u32::from_str_radix(s, 16).ok().and_then(char::from_u32))
.unwrap_or(UNKNOWN_CHAR)
}
fn unescape<W>(w: &mut W, input: &str) -> Result<bool, std::fmt::Error>
where
W: fmt::Write,
{
let bytes = input.as_bytes();
let mut start = 0;
let mut ptr = 0;
while let Some(b) = bytes.get(ptr) {
if b != &b'\\' {
ptr += 1;
continue;
}
if start != ptr {
w.write_str(&input[start..ptr])?;
}
ptr += 1;
let new_char = match bytes.get(ptr) {
Some(b'\\') => '\\',
Some(b'"') => '"',
Some(u @ b'u') | Some(u @ b'U') => {
let seq_start = ptr + 1;
let len = if u == &b'u' { 4 } else { 6 };
ptr += len;
encode_unicode(input.get(seq_start..seq_start + len))
}
_ => UNKNOWN_CHAR,
};
ptr += 1;
w.write_char(new_char)?;
start = ptr;
}
if start == 0 {
return Ok(false);
}
if start != ptr {
w.write_str(&input[start..ptr])?;
}
Ok(true)
}
/// Unescapes to a writer without allocating.
///
/// ## Example
///
/// ```
/// use fluent_syntax::unicode::unescape_unicode;
///
/// let mut s = String::new();
/// unescape_unicode(&mut s, "Foo \\U01F60A Bar");
/// assert_eq!(s, "Foo 😊 Bar");
/// ```
pub fn unescape_unicode<W>(w: &mut W, input: &str) -> fmt::Result
where
W: fmt::Write,
{
if unescape(w, input)? {
return Ok(());
}
w.write_str(input)
}
/// Unescapes to a `Cow<str>` optionally allocating.
///
/// ## Example
///
/// ```
/// use fluent_syntax::unicode::unescape_unicode_to_string;
///
/// assert_eq!(
/// unescape_unicode_to_string("Foo \\U01F60A Bar"),
/// "Foo 😊 Bar"
/// );
/// ```
pub fn unescape_unicode_to_string(input: &str) -> Cow<str> {
let mut result = String::new();
let owned = unescape(&mut result, input).expect("String write methods don't Err");
if owned {
Cow::Owned(result)
} else {
Cow::Borrowed(input)
}
} |
8bf33fe
to
0e09127
Compare
Last try on my method! Shouldn't be possible to optimize much further.. Also, I'm not too sure we're benchmarking correctly with unescape.ftl being very small and not being optimized for the more common case of having no escape sequences. |
Your code vs my snippet:
What is your goal here? You filed a PR to "deduplicate unescape_unicode implementation" and I provided you with deduplicated implementations that preserve performance. What else are you trying to accomplish in this PR and can you separate it out? Here's the diff between your version and mine that causes 25% regression: diff --git a/fluent-syntax/src/unicode.rs b/fluent-syntax/src/unicode.rs
index 137e17e..33f3c95 100644
--- a/fluent-syntax/src/unicode.rs
+++ b/fluent-syntax/src/unicode.rs
@@ -49,10 +49,8 @@ use std::fmt;
const UNKNOWN_CHAR: char = '�';
-fn encode_unicode(s: &str) -> char {
- u32::from_str_radix(s, 16)
- .ok()
- .and_then(char::from_u32)
+fn encode_unicode(s: Option<&str>) -> char {
+ s.and_then(|s| u32::from_str_radix(s, 16).ok().and_then(char::from_u32))
.unwrap_or(UNKNOWN_CHAR)
}
@@ -77,64 +75,47 @@ where
w.write_str(input)
}
-fn find_escape_sequence<'a, W>(
- input: &'a str,
- w: &mut W,
-) -> Result<Option<&'a str>, std::fmt::Error>
+fn unescape<W>(w: &mut W, input: &str) -> Result<bool, std::fmt::Error>
where
W: fmt::Write,
{
- if input.is_empty() {
- return Ok(None);
- }
+ let bytes = input.as_bytes();
+ let mut start = 0;
let mut ptr = 0;
- while input.as_bytes()[ptr] != b'\\' {
- ptr += 1;
- if ptr >= input.len() {
- return Ok(None);
- }
- }
- if ptr > 0 {
- let (a, b) = input.split_at(ptr);
- w.write_str(a)?;
- return Ok(Some(b));
- }
- Ok(Some(input))
-}
-
-fn unescape<W>(w: &mut W, mut input: &str) -> Result<bool, std::fmt::Error>
-where
- W: fmt::Write,
-{
- let start_len = input.len();
+ while let Some(b) = bytes.get(ptr) {
+ if b != &b'\\' {
+ ptr += 1;
+ continue;
+ }
+ if start != ptr {
+ w.write_str(&input[start..ptr])?;
+ }
- while let Some(sequence) = find_escape_sequence(input, w)? {
- let mut offset = 2;
+ ptr += 1;
- let new_char = match sequence.as_bytes().get(1) {
+ let new_char = match bytes.get(ptr) {
Some(b'\\') => '\\',
Some(b'"') => '"',
Some(u @ b'u') | Some(u @ b'U') => {
- let len = if *u == b'u' { 4 } else { 6 };
- offset += len;
- sequence.get(2..offset).map_or(UNKNOWN_CHAR, encode_unicode)
+ let seq_start = ptr + 1;
+ let len = if u == &b'u' { 4 } else { 6 };
+ ptr += len;
+ encode_unicode(input.get(seq_start..seq_start + len))
}
_ => UNKNOWN_CHAR,
};
+ ptr += 1;
w.write_char(new_char)?;
-
- input = sequence.get(offset..).unwrap_or_default();
+ start = ptr;
}
-
- if input.len() == start_len {
- // nothing was unescaped
+ if start == 0 {
return Ok(false);
}
- if !input.is_empty() {
- w.write_str(input)?;
+ if start != ptr {
+ w.write_str(&input[start..ptr])?;
}
Ok(true)
}
I don't think it is. I suspect some parts of your logic are falling off the cliff. We can keep debugging, but I'd like to understand what is the goal.
I strongly encourage you to not mix algo changes with benchmark evaluation. It's tempting to adjust the benchmark to show what you want, but that's precisely against what it's here for. To sum it up, I recommend:
|
I kept digging a bit more into it - it's actually a red-herring. The changes between your code and mine are not impacting perf, but your branch is not up to date with diff --git a/fluent-bundle/src/bundle.rs b/fluent-bundle/src/bundle.rs
index 63b00b2..0b74c22 100644
--- a/fluent-bundle/src/bundle.rs
+++ b/fluent-bundle/src/bundle.rs
@@ -494,7 +494,7 @@ impl<R, M> FluentBundle<R, M> {
{
let mut scope = Scope::new(self, args, Some(errors));
let value = pattern.resolve(&mut scope);
- value.as_string(&scope)
+ value.into_string(&scope)
}
/// Makes the provided rust function available to messages with the name `id`. See This change is 100% responsible for the Please, rebase on top of main! I still think it's better to separate changes into smaller PRs so recommendation stands, but now you don't have to worry about your further changes affecting perf anymore! |
0e09127
to
f7d7057
Compare
That explains why I wasn't seeing the same performance impact, I was comparing to main without that into_string commit included.. I've removed the optimization commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! This looks good to me and I confirmed that it doesn't affect performance!
Let's wait for @gregtatum to review and decide on the PR.
This is still on my list to look at, I only had time today to look at some smaller PRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rebase on this was just to tweak commit messages so I can use conventional-commit tooling to generate changelogs for future releases.
Thanks for the contribution and especially tests!
Shares the implementation of unescape_unicode and unescape_unicode_to_string. I've added a check to the corresponding test to make sure values that don't need unescaping are returned as Cow::Borrowed.