-
Notifications
You must be signed in to change notification settings - Fork 12
Unicode-aware \w
, \d
, and \b
?
#16
Comments
sgtm
Either works for me. |
Cc stage 3 reviewers @waldemarhorwat @gibson042 @msaboff
ES regex definitions see https://tc39.es/ecma262/multipage/text-processing.html#sec-characterclassescape Details
|
SGTM |
Hmmm. I don't Think we could ship without \b. What about having it aligned
with \ w, but allow for \ b curly brace...
|
I am ok with that. It looks like that might be what Java does:
Probably related to the syntax suggested by https://www.unicode.org/reports/tr18/#Default_Grapheme_Clusters We should check that |
+1
Mark
…On Tue, Jul 13, 2021 at 2:34 PM Markus Scherer ***@***.***> wrote:
What about having it aligned with \ w, but allow for \ b curly brace...
I am ok with that. It looks like that might be what Java does:
https://docs.oracle.com/en/java/javase/15/docs/api/java.base/java/util/regex/Pattern.html
Construct Matches
\b A word boundary
\b{g} A Unicode extended grapheme cluster boundary
Probably related to the syntax suggested by
https://www.unicode.org/reports/tr18/#Default_Grapheme_Clusters
We should check that \b{g} throws a SyntaxError so that it can be added
later.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#16 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMFGLAMBWG67WGTDWOLTXSWM5ANCNFSM4ZNFBL2A>
.
|
This seems like a further expansion of scope, specifically relating to https://www.unicode.org/reports/tr18/#RL1.4 ... do you want to recharacterize the proposal as something like UTS 18 regular expressions and primarily ensure that "provision should be made for the syntax to be extended in the future to encompass [support for higher-level features]"? |
It is an expansion of scope, which I like to resist, but Mathias is right to point out a backlog of things that would require incompatible changes, and while we are at it with a new flag, we should discuss what makes sense (and is sufficiently low-risk) to include, vs. where we can and should grease the skids with SyntaxErrors to make reasonable future extensions possible without requiring yet another new flag. I don't think we want to advertise this proposal as everything that UTS18 suggests, but I welcome the discussion. Let's just remember that at some point soon we do want to produce a real spec and go to stage 3 and into implementations :-) |
I think we are in agreement. I don't want to expand this proposal to "all of UTS 18", but I do want to ensure that sufficient accommodations are made for future proposals to do so with minimal friction (ideally without introducing any more flags) and also to be clear about that in the description of this proposal (which seems to be transforming from "set notation" into something like "UTS18 compatibility"). And I'll further note that it is actually close to achieving UTS 18 Level 1, although I do see gaps relating to both RL1.4 (this issue) and RL1.6 (recognition of line boundaries, e.g. EDIT: see also #37 |
It looks to me like |
Updated proposal:
FYI: See also issue #37 for a small change in |
Quick question: Under the new proposed definition, is |
Yes, of course. In UTS 18
|
Thank you @markusicu! The |
We should be careful in how we phrase this. UTS 18 |
Agreed on the general point, but I think you're mistaken about UTS 18 requiring that: https://www.unicode.org/reports/tr18/#property_syntax
|
Richard and Mark agreed to work on a “census” of what UTS 18 recommends vs. what ES regex /u has, /v is proposed to have, could be added compatibly, or could be added under /v or another new flag. I suggest to list “loose property name matching” in one row of the spreadsheet, and it could have a note that this has been discussed and rejected before. |
ECMA 262 specifies that \d and \w should match ascii characters only, but \s matches more characters than just in the ascii range. Both these criteria cannot be met simultaneously by either /u or /a regular expression semantics; consequently as it is the decision of this author that \s matching more whitespace is more useful than \d and \w being more limited (where [0-9] and [a-zA-Z] can be used explicitly when intended), unicode (/u) semantics will be maintained when matching with pattern and patternProperties. Plus, it has been proposed that ECMA regexes treat \d and \w as unicode-aware: tc39/proposal-regexp-v-flag#16 See https://perldoc.perl.org/perlre#/u
ECMA 262 specifies that \d and \w should match ascii characters only, but \s matches more characters than just in the ascii range. Both these criteria cannot be met simultaneously by either /u or /a regular expression semantics; consequently as it is the decision of this author that \s matching more whitespace is more useful than \d and \w being more limited (where [0-9] and [a-zA-Z] can be used explicitly when intended), unicode (/u) semantics will be maintained when matching with pattern and patternProperties. Plus, it has been proposed that ECMA regexes treat \d and \w as unicode-aware: tc39/proposal-regexp-v-flag#16 See https://perldoc.perl.org/perlre#/u
This is no longer a goal for this proposal, but we will ask TC39 whether we can mention that these shorthands might change in the future. |
FWIW, import {atomic, bound, charSet, flags, sequence, suffix} from '[email protected]'
const LcGrekLetter = charSet.intersection(/\p{Lowercase}/u, /\p{Script=Greek}/u)
const b = bound(/\p{Script=Greek}/u)
const LcGrekWord = flags.add('g', [b, suffix("+", LcGrekLetter), b])
for (let lc of `Θεωρείται ως ο σημαντικότερος θεμελιωτής ...`.matchAll(LcGrekWord)) {
console.log(lc) //'ως', 'ο', 'σημαντικότερος', 'θεμελιωτής'
} |
@mathiasbynens wrote in #2 (comment) :
The text was updated successfully, but these errors were encountered: