Skip to content

Commit 757e15b

Browse files
committed
html: sync changes from std
Before golang/go@324513b (2012-01-04) std "html" and what is now "golang.org/x/net/html" were the same. Ever since then (well, since golang/go@4e0749a (2012-05-29)) the escape/unescape code that they share has been drifting apart, each receiving separate improvements. This CL cherry-picks over all of the changes that std "html" has seen. When applying golang/go@5b92028 (https://golang.org/cl/10172) I had to get a touch creative. That commit inlined `unescape()` into `UnescapeString()`, removing the original `unescape()`. However, over here in x/net, we have other callers of `unescape()` so we can't remove it... but duplicating it is also bad. Simply wrapping it instead of duplicating it would repeat the first call to `IndexByte()` (first as `strings.IndexByte()`, then as `bytes.IndexByte()`); as minor as that preformance regression would be, I don't want anything to go backward. So, I've pulled out an `unescapeInner()` function takes the initial `i` as an argument, and both `unescape()` and `UnescapeString()` call. This is the counterpart to https://golang.org/cl/580896, and so also includes the doc-fix for `UnescapeString()` requested at https://go-review.googlesource.com/c/go/+/580896/comment/cc8b5704_b1899241/ golang/go@a025e1c : Author: Shawn Smith <[email protected]> Date: Wed Dec 18 10:20:25 2013 -0800 html: add tests for UnescapeString edge cases R=golang-dev, gobot, bradfitz CC=golang-dev https://golang.org/cl/40810044 golang/go@2d9a50b : Author: Didier Spezia <[email protected]> Date: Fri May 8 16:38:08 2015 +0000 html: simplify and optimize escape/unescape The html package uses some specific code to escape special characters. Actually, the strings.Replacer can be used instead, and is much more efficient. The converse operation is more complex but can still be slightly optimized. Credits to Ken Bloom ([email protected]), who first submitted a similar patch at https://codereview.appspot.com/141930043 Added benchmarks and slightly optimized UnescapeString. benchmark old ns/op new ns/op delta BenchmarkEscape-4 118713 19825 -83.30% BenchmarkEscapeNone-4 87653 3784 -95.68% BenchmarkUnescape-4 24888 23417 -5.91% BenchmarkUnescapeNone-4 14423 157 -98.91% benchmark old allocs new allocs delta BenchmarkEscape-4 9 2 -77.78% BenchmarkEscapeNone-4 0 0 +0.00% BenchmarkUnescape-4 2 2 +0.00% BenchmarkUnescapeNone-4 0 0 +0.00% benchmark old bytes new bytes delta BenchmarkEscape-4 24800 12288 -50.45% BenchmarkEscapeNone-4 0 0 +0.00% BenchmarkUnescape-4 10240 10240 +0.00% BenchmarkUnescapeNone-4 0 0 +0.00% Fixes #8697 Change-Id: I208261ed7cbe9b3dee6317851f8c0cf15528bce4 Reviewed-on: https://go-review.googlesource.com/9808 Run-TryBot: Brad Fitzpatrick <[email protected]> Reviewed-by: Brad Fitzpatrick <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> golang/go@a3c0730 : Author: Carlos C <[email protected]> Date: Wed Jun 17 23:51:54 2015 +0200 html: add examples to the functions Change-Id: I129d70304ae4e4694d9217826b18b341e3834d3c Reviewed-on: https://go-review.googlesource.com/11201 Reviewed-by: Andrew Gerrand <[email protected]> golang/go@5b92028 : Author: Ingo Oeser <[email protected]> Date: Sat May 9 17:55:05 2015 +0200 html: speed up UnescapeString Add benchmarks for for sparsely escaped and densely escaped strings. Then speed up the sparse unescaping part heavily by using IndexByte and copy to skip the parts containing no escaping very fast. Unescaping densely escaped strings slower because of the new function call overhead. But sparsely encoded strings are seen more often in the utf8 enabled web. We win part of the speed back by looking up entityName differently. benchmark old ns/op new ns/op delta BenchmarkEscape 31680 31396 -0.90% BenchmarkEscapeNone 6507 6872 +5.61% BenchmarkUnescape 36481 48298 +32.39% BenchmarkUnescapeNone 332 325 -2.11% BenchmarkUnescapeSparse 8836 3221 -63.55% BenchmarkUnescapeDense 30639 32224 +5.17% Change-Id: If606cb01897a40eefe35ba98f2ff23bb25251606 Reviewed-on: https://go-review.googlesource.com/10172 Reviewed-by: Brad Fitzpatrick <[email protected]> Run-TryBot: Brad Fitzpatrick <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> golang/go@a44c425 : Author: Brad Fitzpatrick <[email protected]> Date: Sun Apr 10 14:51:07 2016 +0000 html: fix typo in UnescapeString string docs Fixes #15221 Change-Id: I9e927a2f604213338b4572f1a32d0247c58bdc60 Reviewed-on: https://go-review.googlesource.com/21798 Reviewed-by: Ian Lance Taylor <[email protected]> golang/go@6dae588 : Author: Seiji Takahashi <[email protected]> Date: Thu Aug 3 22:08:55 2017 +0900 html: updated entity spec link Fixes #21194 Change-Id: Iac5187335df67f90f0f47c7ef6574de147c2ac9b Reviewed-on: https://go-review.googlesource.com/52970 Reviewed-by: Avelino <[email protected]> Reviewed-by: Brad Fitzpatrick <[email protected]> golang/go@740e589 : Author: Brad Fitzpatrick <[email protected]> Date: Tue Jul 31 21:37:35 2018 +0000 html: lazily populate Unescape tables Saves ~105KB of heap for callers who don't use html.UnescapeString. (EscapeString is much more common). Also saves 70KB of binary size, because now the linker can do dead code elimination. (because #2559 is still open and global maps always generate init code) Fixes #26727 Updates #6853 Change-Id: I18fe9a273097e2c7e0cb7f88205cae1bb60fa89b Reviewed-on: https://go-review.googlesource.com/127075 Run-TryBot: Brad Fitzpatrick <[email protected]> Reviewed-by: Emmanuel Odeke <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> golang/go@4ad1355 : Author: Romain Baugue <[email protected]> Date: Tue Apr 30 13:51:05 2019 +0200 html: add a Fuzz function Adds a sample Fuzz test function to package html based on https://github.com/dvyukov/go-fuzz-corpus/blob/master/stdhtml/main.go Updates #19109 Updates #31309 Change-Id: I8c49fff8f70fc8a8813daf1abf0044752003adbb Reviewed-on: https://go-review.googlesource.com/c/go/+/174301 Reviewed-by: Brad Fitzpatrick <[email protected]> Run-TryBot: Brad Fitzpatrick <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> golang/go@52c4488 : Author: fujimoto kyosuke <[email protected]> Date: Sun Jan 12 06:49:19 2020 +0000 html: update URL in comment The comment contained a link that had a file name and ID that no longer existed, so change to the URL of the corresponding part of the latest page. Change-Id: I74e0885aabf470facc39b84035f7a83fef9c6a8e GitHub-Last-Rev: 5681c84d9f1029449da6860c65a1d9a128296e85 GitHub-Pull-Request: golang/go#36514 Reviewed-on: https://go-review.googlesource.com/c/go/+/214181 Run-TryBot: Ian Lance Taylor <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> golang/go@d4b2638 : Author: Russ Cox <[email protected]> Date: Fri Feb 19 18:35:10 2021 -0500 all: go fmt std cmd (but revert vendor) Make all our package sources use Go 1.17 gofmt format (adding //go:build lines). Part of //go:build change (#41184). See https://golang.org/design/draft-gobuild Change-Id: Ia0534360e4957e58cd9a18429c39d0e32a6addb4 Reviewed-on: https://go-review.googlesource.com/c/go/+/294430 Trust: Russ Cox <[email protected]> Run-TryBot: Russ Cox <[email protected]> TryBot-Result: Go Bot <[email protected]> Reviewed-by: Jason A. Donenfeld <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> golang/go@f229e70 : Author: Russ Cox <[email protected]> Date: Wed Aug 25 12:48:26 2021 -0400 all: go fix -fix=buildtag std cmd (except for bootstrap deps, vendor) When these packages are released as part of Go 1.18, Go 1.16 will no longer be supported, so we can remove the +build tags in these files. Ran go fix -fix=buildtag std cmd and then reverted the bootstrapDirs as defined in src/cmd/dist/buildtool.go, which need to continue to build with Go 1.4 for now. Also reverted src/vendor and src/cmd/vendor, which will need to be updated in their own repos first. Manual changes in runtime/pprof/mprof_test.go to adjust line numbers. For #41184. Change-Id: Ic0f93f7091295b6abc76ed5cd6e6746e1280861e Reviewed-on: https://go-review.googlesource.com/c/go/+/344955 Trust: Russ Cox <[email protected]> Run-TryBot: Russ Cox <[email protected]> TryBot-Result: Go Bot <[email protected]> Reviewed-by: Bryan C. Mills <[email protected]> golang/go@200a01f : Author: Tobias Klauser <[email protected]> Date: Wed May 10 17:08:59 2023 +0200 html: convert fuzz test to native Go fuzzing Convert the existing gofuzz based fuzz test to a testing.F based fuzz test. Change-Id: Ieae69ba7fb17bd54d95c7bb2f4ed04c323c9f15f Reviewed-on: https://go-review.googlesource.com/c/go/+/494195 TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> Reviewed-by: Cherry Mui <[email protected]> Auto-Submit: Tobias Klauser <[email protected]> Run-TryBot: Tobias Klauser <[email protected]>
1 parent e2310ae commit 757e15b

File tree

6 files changed

+2418
-2301
lines changed

6 files changed

+2418
-2301
lines changed

html/entity.go

+2,248-2,236
Large diffs are not rendered by default.

html/entity_test.go

+8
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,15 @@ import (
99
"unicode/utf8"
1010
)
1111

12+
func init() {
13+
UnescapeString("") // force load of entity maps
14+
}
15+
1216
func TestEntityLength(t *testing.T) {
17+
if len(entity) == 0 || len(entity2) == 0 {
18+
t.Fatal("maps not loaded")
19+
}
20+
1321
// We verify that the length of UTF-8 encoding of each value is <= 1 + len(key).
1422
// The +1 comes from the leading "&". This property implies that the length of
1523
// unescaped text is <= the length of escaped text.

html/escape.go

+45-64
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ import (
1212

1313
// These replacements permit compatibility with old numeric entities that
1414
// assumed Windows-1252 encoding.
15-
// https://html.spec.whatwg.org/multipage/syntax.html#consume-a-character-reference
15+
// https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state
1616
var replacementTable = [...]rune{
1717
'\u20AC', // First entry is what 0x80 should be replaced with.
1818
'\u0081',
@@ -135,14 +135,14 @@ func unescapeEntity(b []byte, dst, src int, attribute bool) (dst1, src1 int) {
135135
break
136136
}
137137

138-
entityName := string(s[1:i])
139-
if entityName == "" {
138+
entityName := s[1:i]
139+
if len(entityName) == 0 {
140140
// No-op.
141141
} else if attribute && entityName[len(entityName)-1] != ';' && len(s) > i && s[i] == '=' {
142142
// No-op.
143-
} else if x := entity[entityName]; x != 0 {
143+
} else if x := entity[string(entityName)]; x != 0 {
144144
return dst + utf8.EncodeRune(b[dst:], x), src + i
145-
} else if x := entity2[entityName]; x[0] != 0 {
145+
} else if x := entity2[string(entityName)]; x[0] != 0 {
146146
dst1 := dst + utf8.EncodeRune(b[dst:], x[0])
147147
return dst1 + utf8.EncodeRune(b[dst1:], x[1]), src + i
148148
} else if !attribute {
@@ -151,7 +151,7 @@ func unescapeEntity(b []byte, dst, src int, attribute bool) (dst1, src1 int) {
151151
maxLen = longestEntityWithoutSemicolon
152152
}
153153
for j := maxLen; j > 1; j-- {
154-
if x := entity[entityName[:j]]; x != 0 {
154+
if x := entity[string(entityName[:j])]; x != 0 {
155155
return dst + utf8.EncodeRune(b[dst:], x), src + j + 1
156156
}
157157
}
@@ -165,24 +165,34 @@ func unescapeEntity(b []byte, dst, src int, attribute bool) (dst1, src1 int) {
165165
// unescape unescapes b's entities in-place, so that "a&lt;b" becomes "a<b".
166166
// attribute should be true if parsing an attribute value.
167167
func unescape(b []byte, attribute bool) []byte {
168-
for i, c := range b {
169-
if c == '&' {
170-
dst, src := unescapeEntity(b, i, i, attribute)
171-
for src < len(b) {
172-
c := b[src]
173-
if c == '&' {
174-
dst, src = unescapeEntity(b, dst, src, attribute)
175-
} else {
176-
b[dst] = c
177-
dst, src = dst+1, src+1
178-
}
179-
}
180-
return b[0:dst]
181-
}
168+
populateMapsOnce.Do(populateMaps)
169+
if i := bytes.IndexByte(b, '&'); i >= 0 {
170+
return unescapeInner(b, i, attribute)
182171
}
183172
return b
184173
}
185174

175+
func unescapeInner(b []byte, i int, attribute bool) []byte {
176+
dst, src := unescapeEntity(b, i, i, attribute)
177+
for len(b[src:]) > 0 {
178+
if b[src] == '&' {
179+
i = 0
180+
} else {
181+
i = bytes.IndexByte(b[src:], '&')
182+
}
183+
if i < 0 {
184+
dst += copy(b[dst:], b[src:])
185+
break
186+
}
187+
188+
if i > 0 {
189+
copy(b[dst:], b[src:src+i])
190+
}
191+
dst, src = unescapeEntity(b, dst+i, src+i, attribute)
192+
}
193+
return b[:dst]
194+
}
195+
186196
// lower lower-cases the A-Z bytes in b in-place, so that "aBc" becomes "abc".
187197
func lower(b []byte) []byte {
188198
for i, c := range b {
@@ -274,66 +284,37 @@ func escapeCommentString(s string) string {
274284
return buf.String()
275285
}
276286

277-
const escapedChars = "&'<>\"\r"
287+
var htmlEscaper = strings.NewReplacer(
288+
`&`, "&amp;",
289+
`'`, "&#39;", // "&#39;" is shorter than "&apos;" and apos was not in HTML until HTML5.
290+
`<`, "&lt;",
291+
`>`, "&gt;",
292+
`"`, "&#34;", // "&#34;" is shorter than "&quot;".
293+
"\r", "&#13;",
294+
)
278295

279296
func escape(w writer, s string) error {
280-
i := strings.IndexAny(s, escapedChars)
281-
for i != -1 {
282-
if _, err := w.WriteString(s[:i]); err != nil {
283-
return err
284-
}
285-
var esc string
286-
switch s[i] {
287-
case '&':
288-
esc = "&amp;"
289-
case '\'':
290-
// "&#39;" is shorter than "&apos;" and apos was not in HTML until HTML5.
291-
esc = "&#39;"
292-
case '<':
293-
esc = "&lt;"
294-
case '>':
295-
esc = "&gt;"
296-
case '"':
297-
// "&#34;" is shorter than "&quot;".
298-
esc = "&#34;"
299-
case '\r':
300-
esc = "&#13;"
301-
default:
302-
panic("unrecognized escape character")
303-
}
304-
s = s[i+1:]
305-
if _, err := w.WriteString(esc); err != nil {
306-
return err
307-
}
308-
i = strings.IndexAny(s, escapedChars)
309-
}
310-
_, err := w.WriteString(s)
297+
_, err := htmlEscaper.WriteString(w, s)
311298
return err
312299
}
313300

314301
// EscapeString escapes special characters like "<" to become "&lt;". It
315-
// escapes only five such characters: <, >, &, ' and ".
302+
// escapes only six such characters: <, >, &, ', ", and \r.
316303
// UnescapeString(EscapeString(s)) == s always holds, but the converse isn't
317304
// always true.
318305
func EscapeString(s string) string {
319-
if strings.IndexAny(s, escapedChars) == -1 {
320-
return s
321-
}
322-
var buf bytes.Buffer
323-
escape(&buf, s)
324-
return buf.String()
306+
return htmlEscaper.Replace(s)
325307
}
326308

327309
// UnescapeString unescapes entities like "&lt;" to become "<". It unescapes a
328310
// larger range of entities than EscapeString escapes. For example, "&aacute;"
329-
// unescapes to "á", as does "&#225;" and "&xE1;".
311+
// unescapes to "á", as does "&#225;" and "&#xE1;".
330312
// UnescapeString(EscapeString(s)) == s always holds, but the converse isn't
331313
// always true.
332314
func UnescapeString(s string) string {
333-
for _, c := range s {
334-
if c == '&' {
335-
return string(unescape([]byte(s), false))
336-
}
315+
populateMapsOnce.Do(populateMaps)
316+
if i := strings.IndexByte(s, '&'); i >= 0 {
317+
return string(unescapeInner([]byte(s), i, false))
337318
}
338319
return s
339320
}

html/escape_example_test.go

+22
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
// Copyright 2015 The Go Authors. All rights reserved.
2+
// Use of this source code is governed by a BSD-style
3+
// license that can be found in the LICENSE file.
4+
5+
package html_test
6+
7+
import (
8+
"fmt"
9+
"html"
10+
)
11+
12+
func ExampleEscapeString() {
13+
const s = `"Fran & Freddie's Diner" <[email protected]>`
14+
fmt.Println(html.EscapeString(s))
15+
// Output: &#34;Fran &amp; Freddie&#39;s Diner&#34; &lt;[email protected]&gt;
16+
}
17+
18+
func ExampleUnescapeString() {
19+
const s = `&quot;Fran &amp; Freddie&#39;s Diner&quot; &lt;[email protected]&gt;`
20+
fmt.Println(html.UnescapeString(s))
21+
// Output: "Fran & Freddie's Diner" <[email protected]>
22+
}

html/escape_test.go

+73-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,10 @@
44

55
package html
66

7-
import "testing"
7+
import (
8+
"strings"
9+
"testing"
10+
)
811

912
type unescapeTest struct {
1013
// A short description of the test case.
@@ -64,6 +67,24 @@ var unescapeTests = []unescapeTest{
6467
"Footnote&#x87;",
6568
"Footnote‡",
6669
},
70+
// Handle single ampersand.
71+
{
72+
"copySingleAmpersand",
73+
"&",
74+
"&",
75+
},
76+
// Handle ampersand followed by non-entity.
77+
{
78+
"copyAmpersandNonEntity",
79+
"text &test",
80+
"text &test",
81+
},
82+
// Handle "&#".
83+
{
84+
"copyAmpersandHash",
85+
"text &#",
86+
"text &#",
87+
},
6788
}
6889

6990
func TestUnescape(t *testing.T) {
@@ -95,3 +116,54 @@ func TestUnescapeEscape(t *testing.T) {
95116
}
96117
}
97118
}
119+
120+
var (
121+
benchEscapeData = strings.Repeat("AAAAA < BBBBB > CCCCC & DDDDD ' EEEEE \" ", 100)
122+
benchEscapeNone = strings.Repeat("AAAAA x BBBBB x CCCCC x DDDDD x EEEEE x ", 100)
123+
benchUnescapeSparse = strings.Repeat(strings.Repeat("AAAAA x BBBBB x CCCCC x DDDDD x EEEEE x ", 10)+"&amp;", 10)
124+
benchUnescapeDense = strings.Repeat("&amp;&lt; &amp; &lt;", 100)
125+
)
126+
127+
func BenchmarkEscape(b *testing.B) {
128+
n := 0
129+
for i := 0; i < b.N; i++ {
130+
n += len(EscapeString(benchEscapeData))
131+
}
132+
}
133+
134+
func BenchmarkEscapeNone(b *testing.B) {
135+
n := 0
136+
for i := 0; i < b.N; i++ {
137+
n += len(EscapeString(benchEscapeNone))
138+
}
139+
}
140+
141+
func BenchmarkUnescape(b *testing.B) {
142+
s := EscapeString(benchEscapeData)
143+
n := 0
144+
for i := 0; i < b.N; i++ {
145+
n += len(UnescapeString(s))
146+
}
147+
}
148+
149+
func BenchmarkUnescapeNone(b *testing.B) {
150+
s := EscapeString(benchEscapeNone)
151+
n := 0
152+
for i := 0; i < b.N; i++ {
153+
n += len(UnescapeString(s))
154+
}
155+
}
156+
157+
func BenchmarkUnescapeSparse(b *testing.B) {
158+
n := 0
159+
for i := 0; i < b.N; i++ {
160+
n += len(UnescapeString(benchUnescapeSparse))
161+
}
162+
}
163+
164+
func BenchmarkUnescapeDense(b *testing.B) {
165+
n := 0
166+
for i := 0; i < b.N; i++ {
167+
n += len(UnescapeString(benchUnescapeDense))
168+
}
169+
}

html/fuzz_test.go

+22
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
// Copyright 2019 The Go Authors. All rights reserved.
2+
// Use of this source code is governed by a BSD-style
3+
// license that can be found in the LICENSE file.
4+
5+
package html
6+
7+
import "testing"
8+
9+
func FuzzEscapeUnescape(f *testing.F) {
10+
f.Fuzz(func(t *testing.T, v string) {
11+
e := EscapeString(v)
12+
u := UnescapeString(e)
13+
if u != v {
14+
t.Errorf("EscapeString(%q) = %q, UnescapeString(%q) = %q, want %q", v, e, e, u, v)
15+
}
16+
17+
// As per the documentation, this isn't always equal to v, so it makes
18+
// no sense to check for equality. It can still be interesting to find
19+
// panics in it though.
20+
EscapeString(UnescapeString(v))
21+
})
22+
}

0 commit comments

Comments
 (0)