Skip to content
This repository has been archived by the owner on Feb 16, 2024. It is now read-only.

Unicode-aware \w, \d, and \b ? #16

Closed
markusicu opened this issue Mar 18, 2021 · 20 comments
Closed

Unicode-aware \w, \d, and \b ? #16

markusicu opened this issue Mar 18, 2021 · 20 comments

Comments

@markusicu
Copy link
Collaborator

@mathiasbynens wrote in #2 (comment) :

One other thing we could include in this new flag is Unicode-aware \w, \d, and \b. I originally proposed this to be part of the u flag but it was rejected out of fear it would hurt adoption of the u flag. tc39/proposal-regexp-unicode-property-escapes#22 (comment) We also could take it one step at a time, and ban \w, \d, and \b with the new flag for now, and then decide on their behavior later.

@markusicu
Copy link
Collaborator Author

markusicu commented Mar 18, 2021

One other thing we could include in this new flag is Unicode-aware \w, \d, and \b.

sgtm

UTS #18 has long had recommendations for these: https://www.unicode.org/reports/tr18/#Compatibility_Properties

We also could take it one step at a time, and ban \w, \d, and \b with the new flag for now, and then decide on their behavior later.

Either works for me.

@markusicu
Copy link
Collaborator Author

markusicu commented Jul 8, 2021

Cc stage 3 reviewers @waldemarhorwat @gibson042 @msaboff

One other thing we could include in this new flag is Unicode-aware \w, \d, and \b.

sgtm

UTS #18 has long had recommendations for these: https://www.unicode.org/reports/tr18/#Compatibility_Properties

ES regex definitions see https://tc39.es/ecma262/multipage/text-processing.html#sec-characterclassescape

Details

\d

  • ES regex: [0-9]
  • UTS #18: \p{gc=Decimal_Number}
  • Unicode comment: Non-decimal numbers (like Roman numerals) are normally excluded.

\w

  • ES regex: [0-9a-zA-Z_] plus, if IgnoreCase, characters whose Simple_Case_Folding is in that set
  • UTS #18: \p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}
  • Unicode comment: This is only an approximation to Word Boundaries (see b below). The Connector Punctuation is added in for programming language identifiers, thus adding "_" and similar characters.

\b

We also could take it one step at a time, and ban \w, \d, and \b with the new flag for now, and then decide on their behavior later.

Either works for me.

I suggest that we use the Unicode definitions for \d and \w, and forbid \b for now. It seems like \b would need a lot more thought and discussion than the others.

@mathiasbynens
Copy link
Member

I suggest that we use the Unicode definitions for \d and \w, and forbid \b for now. It seems like \b would need a lot more thought and discussion than the others.

SGTM

@macchiati
Copy link
Collaborator

macchiati commented Jul 13, 2021 via email

@markusicu
Copy link
Collaborator Author

What about having it aligned with \ w, but allow for \ b curly brace...

I am ok with that. It looks like that might be what Java does:
https://docs.oracle.com/en/java/javase/15/docs/api/java.base/java/util/regex/Pattern.html

Construct Matches
\b A word boundary
\b{g} A Unicode extended grapheme cluster boundary

Probably related to the syntax suggested by https://www.unicode.org/reports/tr18/#Default_Grapheme_Clusters

We should check that \b{g} throws a SyntaxError so that it can be added later.

@macchiati
Copy link
Collaborator

macchiati commented Jul 13, 2021 via email

@gibson042
Copy link

This seems like a further expansion of scope, specifically relating to https://www.unicode.org/reports/tr18/#RL1.4 ... do you want to recharacterize the proposal as something like UTS 18 regular expressions and primarily ensure that "provision should be made for the syntax to be extended in the future to encompass [support for higher-level features]"?

@markusicu
Copy link
Collaborator Author

This seems like a further expansion of scope

It is an expansion of scope, which I like to resist, but Mathias is right to point out a backlog of things that would require incompatible changes, and while we are at it with a new flag, we should discuss what makes sense (and is sufficiently low-risk) to include, vs. where we can and should grease the skids with SyntaxErrors to make reasonable future extensions possible without requiring yet another new flag.

I don't think we want to advertise this proposal as everything that UTS18 suggests, but I welcome the discussion. Let's just remember that at some point soon we do want to produce a real spec and go to stage 3 and into implementations :-)

@gibson042
Copy link

gibson042 commented Jul 14, 2021

I think we are in agreement. I don't want to expand this proposal to "all of UTS 18", but I do want to ensure that sufficient accommodations are made for future proposals to do so with minimal friction (ideally without introducing any more flags) and also to be clear about that in the description of this proposal (which seems to be transforming from "set notation" into something like "UTS18 compatibility").

And I'll further note that it is actually close to achieving UTS 18 Level 1, although I do see gaps relating to both RL1.4 (this issue) and RL1.6 (recognition of line boundaries, e.g. /./v.test("\u0085") would need to be false and /^\n/vm.test("\r\n") might need to be as well).

EDIT: see also #37

@markusicu
Copy link
Collaborator Author

It looks to me like \b{g} is currently a SyntaxError because the curly braces are SyntaxCharacter and thus not PatternCharacter, which means that they have to be escaped except in constructs where they are part of the syntax. So we could punt on that.

@markusicu
Copy link
Collaborator Author

Updated proposal:

  • Expand \d and \w from their current ASCII definitions to the UTS 18 defintions.
  • Keep \b as before, but now consistent with “the new \w

FYI: See also issue #37 for a small change in \s.

@RunDevelopment
Copy link

Quick question: Under the new proposed definition, is \d still a subset of \w?

@markusicu
Copy link
Collaborator Author

Quick question: Under the new proposed definition, is \d still a subset of \w?

Yes, of course.

In UTS 18

  • POSIX digit = \d = \p{gc=Decimal_Number}
  • POSIX word = \w = \p{alpha}\p{gc=Mark}\p{digit}\p{gc=Connector_Punctuation}\p{Join_Control}
    • Note how this includes \p{digit} == \p{gc=Decimal_Number}

@RunDevelopment
Copy link

Thank you @markusicu! The \p{digit} == \p{gc=Decimal_Number} wasn't clear to me.

@mathiasbynens
Copy link
Member

mathiasbynens commented Jul 16, 2021

I think we are in agreement. I don't want to expand this proposal to "all of UTS 18", but I do want to ensure that sufficient accommodations are made for future proposals to do so with minimal friction (ideally without introducing any more flags) and also to be clear about that in the description of this proposal (which seems to be transforming from "set notation" into something like "UTS18 compatibility").

And I'll further note that it is actually close to achieving UTS 18 Level 1, although I do see gaps relating to both RL1.4 (this issue) and RL1.6 (recognition of line boundaries, e.g. /./v.test("\u0085") would need to be false and /^\n/vm.test("\r\n") might need to be as well).

We should be careful in how we phrase this. UTS 18 requires recommends “loose matching” (ignoring casing, hyphens, spaces, and underscores in \p{…}), which was explicitly decided against for ECMAScript’s implementation of \p{…}. I’m on board with saying something like “increased UTS 18 Level 1 compatibility” as long as we don’t overpromise.

@gibson042
Copy link

Agreed on the general point, but I think you're mistaken about UTS 18 requiring that: https://www.unicode.org/reports/tr18/#property_syntax

The recommended names (identifiers) for UCD properties and property values are in PropertyAliases.txt and PropertyValueAliases.txt. There are both abbreviated names and longer, more descriptive names. It is strongly recommended that both names be recognized, and that loose matching of property names be used, whereby the case distinctions, whitespace, hyphens, and underbar are ignored.

@markusicu
Copy link
Collaborator Author

Agreed on the general point, but I think you're mistaken about UTS 18 requiring that

Richard and Mark agreed to work on a “census” of what UTS 18 recommends vs. what ES regex /u has, /v is proposed to have, could be added compatibly, or could be added under /v or another new flag.

I suggest to list “loose property name matching” in one row of the spreadsheet, and it could have a note that this has been discussed and rejected before.

karenetheridge added a commit to karenetheridge/JSON-Schema-Modern that referenced this issue Aug 28, 2021
ECMA 262 specifies that \d and \w should match ascii characters only, but \s
matches more characters than just in the ascii range. Both these criteria
cannot be met simultaneously by either /u or /a regular expression semantics;
consequently as it is the decision of this author that \s matching more
whitespace is more useful than \d and \w being more limited (where [0-9] and
[a-zA-Z] can be used explicitly when intended), unicode (/u) semantics will be
maintained when matching with pattern and patternProperties.

Plus, it has been proposed that ECMA regexes treat \d and \w as unicode-aware:
tc39/proposal-regexp-v-flag#16

See https://perldoc.perl.org/perlre#/u
karenetheridge added a commit to karenetheridge/JSON-Schema-Tiny that referenced this issue Sep 11, 2021
ECMA 262 specifies that \d and \w should match ascii characters only, but \s
matches more characters than just in the ascii range. Both these criteria
cannot be met simultaneously by either /u or /a regular expression semantics;
consequently as it is the decision of this author that \s matching more
whitespace is more useful than \d and \w being more limited (where [0-9] and
[a-zA-Z] can be used explicitly when intended), unicode (/u) semantics will be
maintained when matching with pattern and patternProperties.

Plus, it has been proposed that ECMA regexes treat \d and \w as unicode-aware:
tc39/proposal-regexp-v-flag#16

See https://perldoc.perl.org/perlre#/u
@markusicu
Copy link
Collaborator Author

This is no longer a goal for this proposal, but we will ask TC39 whether we can mention that these shorthands might change in the future.

@pygy
Copy link

pygy commented Apr 23, 2022

FWIW, compose-regexp supports /\b/-like assertions for arbitrary character classes (or even patterrns).

import {atomic, bound, charSet, flags, sequence, suffix} from '[email protected]'

const LcGrekLetter = charSet.intersection(/\p{Lowercase}/u, /\p{Script=Greek}/u)

const b = bound(/\p{Script=Greek}/u)

const LcGrekWord = flags.add('g', [b, suffix("+", LcGrekLetter), b])

for (let lc of `Θεωρείται ως ο σημαντικότερος θεμελιωτής ...`.matchAll(LcGrekWord)) {
  console.log(lc) //'ως', 'ο', 'σημαντικότερος', 'θεμελιωτής'
}

live here

@markusicu markusicu closed this as not planned Won't fix, can't repro, duplicate, stale Apr 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants