Skip to content

Commit 72fa60e

Browse files
committed
Move Parser.pm to lib/HTML. Cleanup some of the spacing in examples and misspellings
1 parent 3c48387 commit 72fa60e

File tree

6 files changed

+185
-131
lines changed

6 files changed

+185
-131
lines changed

Changes

+5
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,11 @@ Change history for HTML-Parser
33
{{$NEXT}}
44
* Cleanup the prereqs a bit
55
* Mark HTML::Filter as deprecated as the docs point out
6+
* Move Parser.pm into the lib directory with the others. This will help
7+
with everything from auto version bumps after releases, to scanning for
8+
prerequisites and spelling errors.
9+
* Fix a few spelling errors in the POD for HTML::Parser
10+
* Clean up the spacing on many examples in HTML::Parser
611

712
3.74 2020-08-30
813
* Fix the order of date and version in this change log. (Thanks, haarg)

META.json

+5-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
{
2-
"abstract" : "Filter HTML text through the parser",
2+
"abstract" : "HTML parser class",
33
"author" : [
44
"Gisle Aas <[email protected]>"
55
],
@@ -107,6 +107,10 @@
107107
"file" : "lib/HTML/LinkExtor.pm",
108108
"version" : "3.75"
109109
},
110+
"HTML::Parser" : {
111+
"file" : "lib/HTML/Parser.pm",
112+
"version" : "3.75"
113+
},
110114
"HTML::PullParser" : {
111115
"file" : "lib/HTML/PullParser.pm",
112116
"version" : "3.75"

Makefile.PL

+1-1
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ use warnings;
88
use ExtUtils::MakeMaker;
99

1010
my %WriteMakefileArgs = (
11-
"ABSTRACT" => "Filter HTML text through the parser",
11+
"ABSTRACT" => "HTML parser class",
1212
"AUTHOR" => "Gisle Aas <gaas\@cpan.org>",
1313
"CONFIGURE_REQUIRES" => {
1414
"ExtUtils::MakeMaker" => "6.52"

README.md

+74-55
Original file line numberDiff line numberDiff line change
@@ -9,20 +9,24 @@ HTML::Parser - HTML parser class
99
# SYNOPSIS
1010

1111
```perl
12+
use strict;
13+
use warnings;
1214
use HTML::Parser ();
1315

1416
# Create parser object
15-
$p = HTML::Parser->new( api_version => 3,
16-
start_h => [\&start, "tagname, attr"],
17-
end_h => [\&end, "tagname"],
18-
marked_sections => 1,
19-
);
17+
my $p = HTML::Parser->new(
18+
api_version => 3,
19+
start_h => [\&start, "tagname, attr"],
20+
end_h => [\&end, "tagname"],
21+
marked_sections => 1,
22+
);
2023

2124
# Parse document text chunk by chunk
2225
$p->parse($chunk1);
2326
$p->parse($chunk2);
24-
#...
25-
$p->eof; # signal end of document
27+
# ...
28+
# signal end of document
29+
$p->eof;
2630

2731
# Parse directly from file
2832
$p->parse_file("foo.html");
@@ -83,26 +87,33 @@ The following method is used to construct a new `HTML::Parser` object:
8387
Examples:
8488

8589
```perl
86-
$p = HTML::Parser->new(api_version => 3,
87-
text_h => [ sub {...}, "dtext" ]);
90+
$p = HTML::Parser->new(
91+
api_version => 3,
92+
text_h => [ sub {...}, "dtext" ]
93+
);
8894
```
8995

9096
This creates a new parser object with a text event handler subroutine
9197
that receives the original text with general entities decoded.
9298

9399
```perl
94-
$p = HTML::Parser->new(api_version => 3,
95-
start_h => [ 'my_start', "self,tokens" ]);
100+
$p = HTML::Parser->new(
101+
api_version => 3,
102+
start_h => [ 'my_start', "self,tokens" ]
103+
);
96104
```
97105

98106
This creates a new parser object with a start event handler method
99107
that receives the $p and the tokens array.
100108

101109
```perl
102-
$p = HTML::Parser->new(api_version => 3,
103-
handlers => { text => [\@array, "event,text"],
104-
comment => [\@array, "event,text"],
105-
});
110+
$p = HTML::Parser->new(
111+
api_version => 3,
112+
handlers => {
113+
text => [\@array, "event,text"],
114+
comment => [\@array, "event,text"],
115+
}
116+
);
106117
```
107118

108119
This creates a new parser object that stores the event type and the
@@ -133,12 +144,12 @@ to the `HTML::Parser` object:
133144

134145
```perl
135146
while (1) {
136-
my $chunk = &$code_ref();
137-
if (!defined($chunk) || !length($chunk)) {
138-
$p->eof;
139-
return $p;
140-
}
141-
$p->parse($chunk) || return undef;
147+
my $chunk = &$code_ref();
148+
if (!defined($chunk) || !length($chunk)) {
149+
$p->eof;
150+
return $p;
151+
}
152+
$p->parse($chunk) || return undef;
142153
}
143154
```
144155

@@ -214,16 +225,16 @@ Methods that can be used to get and/or set parser options are:
214225
- $p->case\_sensitive
215226
- $p->case\_sensitive( $bool )
216227
217-
By default, tagnames and attribute names are down-cased. Enabling this
228+
By default, tag names and attribute names are down-cased. Enabling this
218229
attribute leaves them as found in the HTML source document.
219230
220231
- $p->closing\_plaintext
221232
- $p->closing\_plaintext( $bool )
222233
223-
By default, "plaintext" element can never be closed. Everything up to
234+
By default, `plaintext` element can never be closed. Everything up to
224235
the end of the document is parsed in CDATA mode. This historical
225236
behaviour is what at least MSIE does. Enabling this attribute makes
226-
closing "&lt;/plaintext>" tag effective and the parsing process will resume
237+
closing ` </plaintext` > tag effective and the parsing process will resume
227238
after seeing this tag. This emulates early gecko-based browsers.
228239
229240
- $p->empty\_element\_tags
@@ -405,8 +416,8 @@ method is used to set up handlers for different events:
405416
$p->handler(start => "start", 'self, attr, attrseq, text' );
406417
```
407418

408-
This causes the "start" method of object $p to be called for 'start' events.
409-
The callback signature is $p->start(\\%attr, \\@attr\_seq, $text).
419+
This causes the "start" method of object `$p` to be called for 'start' events.
420+
The callback signature is `$p->start(\%attr, \@attr_seq, $text)`.
410421

411422
```perl
412423
$p->handler(start => \&start, 'attr, attrseq, text' );
@@ -857,24 +868,28 @@ $p->handler(start => "start", "self, tagname, attr, attrseq, text");
857868
$p->handler(end => "end", "self, tagname, text");
858869
$p->handler(text => "text", "self, text, is_cdata");
859870
$p->handler(process => "process", "self, token0, text");
860-
$p->handler(comment =>
861-
sub {
862-
my($self, $tokens) = @_;
863-
for (@$tokens) {$self->comment($_);}},
864-
"self, tokens");
865-
$p->handler(declaration =>
866-
sub {
867-
my $self = shift;
868-
$self->declaration(substr($_[0], 2, -1));},
869-
"self, text");
871+
$p->handler(
872+
comment => sub {
873+
my($self, $tokens) = @_;
874+
for (@$tokens) {$self->comment($_);}
875+
},
876+
"self, tokens"
877+
);
878+
$p->handler(
879+
declaration => sub {
880+
my $self = shift;
881+
$self->declaration(substr($_[0], 2, -1));
882+
},
883+
"self, text"
884+
);
870885
```
871886
872887
Setting up these handlers can also be requested with the "api\_version =>
873888
2" constructor option.
874889
875890
# SUBCLASSING
876891
877-
The `HTML::Parser` class is subclassable. Parser objects are plain
892+
The `HTML::Parser` class is able to be subclassed. Parser objects are plain
878893
hashes and `HTML::Parser` reserves only hash keys that start with
879894
"\_hparser". The parser state can be set up by invoking the init()
880895
method, which takes the same arguments as new().
@@ -887,19 +902,20 @@ does nothing and a default handler that will print out anything else:
887902
888903
```perl
889904
use HTML::Parser;
890-
HTML::Parser->new(default_h => [sub { print shift }, 'text'],
891-
comment_h => [""],
892-
)->parse_file(shift || die) || die $!;
905+
HTML::Parser->new(
906+
default_h => [sub { print shift }, 'text'],
907+
comment_h => [""],
908+
)->parse_file(shift || die) || die $!;
893909
```
894910
895911
An alternative implementation is:
896912
897913
```perl
898914
use HTML::Parser;
899-
HTML::Parser->new(end_document_h => [sub { print shift },
900-
'skipped_text'],
901-
comment_h => [""],
902-
)->parse_file(shift || die) || die $!;
915+
HTML::Parser->new(
916+
end_document_h => [sub { print shift }, 'skipped_text'],
917+
comment_h => [""],
918+
)->parse_file(shift || die) || die $!;
903919
```
904920
905921
This will in most cases be much more efficient since only a single
@@ -914,17 +930,20 @@ parsing as soon as the title end tag is seen:
914930
```perl
915931
use HTML::Parser ();
916932
917-
sub start_handler
918-
{
933+
sub start_handler {
919934
return if shift ne "title";
920935
my $self = shift;
921936
$self->handler(text => sub { print shift }, "dtext");
922-
$self->handler(end => sub { shift->eof if shift eq "title"; },
923-
"tagname,self");
937+
$self->handler(
938+
end => sub {
939+
shift->eof if shift eq "title";
940+
},
941+
"tagname,self"
942+
);
924943
}
925944
926945
my $p = HTML::Parser->new(api_version => 3);
927-
$p->handler( start => \&start_handler, "tagname,self");
946+
$p->handler(start => \&start_handler, "tagname,self");
928947
$p->parse_file(shift || die) || die $!;
929948
print "\n";
930949
```
@@ -962,7 +981,7 @@ respectively.
962981
NET tags, e.g. "code/.../" are not recognized. This is SGML
963982
shorthand for "&lt;code>...&lt;/code>".
964983
965-
Unclosed start or end tags, e.g. "&lt;tt&lt;b>...&lt;/b&lt;/tt>" are not
984+
Incomplete start or end tags, e.g. "&lt;tt&lt;b>...&lt;/b&lt;/tt>" are not
966985
recognized.
967986
968987
# DIAGNOSTICS
@@ -1070,23 +1089,23 @@ in this listing is the same as used in [perldiag](https://metacpan.org/pod/perld
10701089
10711090
The alternative solution is to enable the `utf8_mode` and not decode before
10721091
passing strings to $p->parse(). The parser can process raw undecoded UTF-8
1073-
sanely if the `utf8_mode` is enabled, or if the "attr", "@attr" or "dtext"
1092+
sanely if the `utf8_mode` is enabled, or if the `attr`, `@attr` or `dtext`
10741093
argspecs are avoided.
10751094
1076-
- Parsing string decoded with wrong endianness
1095+
- Parsing string decoded with wrong endian selection
10771096
10781097
(W) The first character in the document is U+FFFE. This is not a
1079-
legal Unicode character but a byte swapped BOM. The result of parsing
1098+
legal Unicode character but a byte swapped `BOM`. The result of parsing
10801099
will likely be garbage.
10811100
10821101
- Parsing of undecoded UTF-32
10831102
1084-
(W) The parser found the Unicode UTF-32 BOM signature at the start
1103+
(W) The parser found the Unicode UTF-32 `BOM` signature at the start
10851104
of the document. The result of parsing will likely be garbage.
10861105
10871106
- Parsing of undecoded UTF-16
10881107
1089-
(W) The parser found the Unicode UTF-16 BOM signature at the start of
1108+
(W) The parser found the Unicode UTF-16 `BOM` signature at the start of
10901109
the document. The result of parsing will likely be garbage.
10911110
10921111
# SEE ALSO

dist.ini

+9-2
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ copyright_year = 1996
77

88
[ReadmeAnyFromPod / Markdown_Readme]
99
type = gfm
10-
source_filename = Parser.pm
10+
source_filename = lib/HTML/Parser.pm
1111
filename = README.md
1212
location = root
1313

@@ -54,7 +54,7 @@ badges = github_actions/windows
5454
[Test::Kwalitee]
5555
skiptest = no_symlinks
5656
[Test::Version]
57-
filename_match = qr/^Parser\.pm$/
57+
filename_match = qr/Parser\.pm$/
5858
[Test::Pod::Coverage::Configurable]
5959
trustme = HTML::Entities => qr/^(?:UNICODE_SUPPORT|decode|encode|encode_numeric|encode_numerically|num_entity)$/
6060
trustme = HTML::Filter => qr/^(?:output)$/
@@ -73,3 +73,10 @@ stopword = undecoded
7373
stopword = IMG
7474
stopword = textified
7575
stopword = Textification
76+
stopword = argspecs
77+
stopword = Attr
78+
stopword = Attrseq
79+
stopword = Dtext
80+
stopword = Tokenpos
81+
stopword = Unterminated
82+
stopword = CDATA

0 commit comments

Comments
 (0)