You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PHPHtmlParser is a simple, flexible, html parser which allows you to select tags using any css selector, like jQuery. The goal is to assist in the development of tools which require a quick, easy way to scrap html, whether it's valid or not!
This package can be found on [packagist](https://packagist.org/packages/paquettg/php-html-parser) and is best loaded using [composer](http://getcomposer.org/). We support php 7.2, 7.3, and 7.4.
20
18
21
-
Basic Usage
22
-
-----
19
+
## Basic Usage
23
20
24
21
You can find many examples of how to use the DOM parser and any of its parts (which you will most likely never touch) in the tests directory. The tests are done using PHPUnit and are very small, a few lines each, and are a great place to start. Given that, I'll still be showing a few examples of how the package should be used. The following example is a very simplistic usage of the package.
The above will output "click here". Simple, no? There are many ways to get the same result from the DOM, such as `$dom->getElementsbyTag('a')[0]` or `$dom->find('a', 0)`, which can all be found in the tests or in the code itself.
38
35
39
-
Support PHP Html Parser Financially
40
-
--------------
36
+
## Support PHP Html Parser Financially
41
37
42
38
Get supported Monolog and help fund the project with the [Tidelift Subscription](https://tidelift.com/subscription/pkg/packagist-paquettg-php-html-parser?utm_source=packagist-paquettg-php-html-parser&utm_medium=referral&utm_campaign=enterprise).
43
39
44
40
Tidelift delivers commercial support and maintenance for the open source dependencies you use to build your applications. Save time, reduce risk, and improve code health, while paying the maintainers of the exact dependencies you use.
45
41
46
-
Loading Files
47
-
------------------
42
+
## Loading Files
48
43
49
44
You may also seamlessly load a file into the DOM instead of a string, which is much more convenient and is how I expect most developers will be loading the HTML. The following example is taken from our test and uses the "big.html" file found there.
50
45
@@ -62,7 +57,7 @@ foreach ($contents as $content)
62
57
{
63
58
// get the class attr
64
59
$class = $content->getAttribute('class');
65
-
60
+
66
61
// do something with the html
67
62
$html = $content->innerHtml;
68
63
@@ -74,10 +69,9 @@ foreach ($contents as $content)
74
69
75
70
This example loads the html from big.html, a real page found online, and gets all the content-border classes to process. It also shows a few things you can do with a node but it is not an exhaustive list of the methods that a node has available.
76
71
77
-
Loading URLs
78
-
----------------
72
+
## Loading URLs
79
73
80
-
Loading a URL is very similar to the way you would load the HTML from a file.
74
+
Loading a URL is very similar to the way you would load the HTML from a file.
81
75
82
76
```php
83
77
// Assuming you installed from Composer:
@@ -108,8 +102,7 @@ $html = $dom->outerHtml;
108
102
109
103
As long as the client object implements the interface properly, it will use that object to get the content of the url.
You can also set parsing option that will effect the behavior of the parsing engine. You can set a global option array using the `setOptions` method in the `Dom` object or a instance specific option by adding it to the `load` method as an extra (optional) parameter.
130
122
@@ -141,7 +133,7 @@ $dom->setOptions(
141
133
->setStrict(true)
142
134
);
143
135
144
-
$dom->loadFromUrl('http://google.com',
136
+
$dom->loadFromUrl('http://google.com',
145
137
(new Options())->setWhitespaceTextNode(false) // only applies to this load.
146
138
);
147
139
@@ -150,56 +142,55 @@ $dom->loadFromUrl('http://gmail.com'); // will not have whitespaceTextNode set t
150
142
151
143
At the moment we support 12 options.
152
144
153
-
**Strict**
145
+
**setStrict**
154
146
155
-
Strict, by default false, will throw a `StrickException` if it find that the html is not strictly compliant (all tags must have a closing tag, no attribute with out a value, etc.).
147
+
`setStrict`, by default `false`, will throw a `StrickException` if it find that the html is not strictly compliant (all tags must have a closing tag, no attribute with out a value, etc.).
156
148
157
-
**whitespaceTextNode**
149
+
**setWhitespaceTextNode**
158
150
159
-
The whitespaceTextNode, by default true, option tells the parser to save textnodes even if the content of the node is empty (only whitespace). Setting it to false will ignore all whitespace only text node found in the document.
151
+
The `setWhitespaceTextNode`, by default `true`, option tells the parser to save textnodes even if the content of the node is empty (only whitespace). Setting it to `false` will ignore all whitespace only text node found in the document.
160
152
161
-
**enforceEncoding**
153
+
**setEnforceEncoding**
162
154
163
-
The enforceEncoding, by default null, option will enforce an character set to be used for reading the content and returning the content in that encoding. Setting it to null will trigger an attempt to figure out the encoding from within the content of the string given instead.
155
+
The `setEnforceEncoding`, by default `null`, option will enforce an character set to be used for reading the content and returning the content in that encoding. Setting it to `null` will trigger an attempt to figure out the encoding from within the content of the string given instead.
164
156
165
-
**cleanupInput**
157
+
**setCleanupInput**
166
158
167
-
Set this to `false` to skip the entire clean up phase of the parser. If this is set to true the next 3 options will be ignored. Defaults to `true`.
159
+
Set `setCleanupInput` to `false` to skip the entire clean up phase of the parser. If this is set to true the next 3 options will be ignored. Defaults to `true`.
168
160
169
-
**removeScripts**
161
+
**setRemoveScripts**
170
162
171
-
Set this to `false` to skip removing the script tags from the document body. This might have adverse effects. Defaults to `true`.
163
+
Set `setRemoveScripts` to `false` to skip removing the script tags from the document body. This might have adverse effects. Defaults to `true`.
172
164
173
-
**removeStyles**
165
+
**setRemoveStyles**
174
166
175
-
Set this to `false` to skip removing of style tags from the document body. This might have adverse effects. Defaults to `true`.
167
+
Set `setRemoveStyles` to `false` to skip removing of style tags from the document body. This might have adverse effects. Defaults to `true`.
176
168
177
-
**preserveLineBreaks**
169
+
**setPreserveLineBreaks**
178
170
179
-
Preserves Line Breaks if set to `true`. If set to `false` line breaks are cleaned up as part of the input clean up process. Defaults to `false`.
171
+
`setPreserveLineBreaks` preserves line Breaks if set to `true`. If set to `false` line breaks are cleaned up as part of the input clean up process. Defaults to `false`.
180
172
181
-
**removeDoubleSpace**
173
+
**setRemoveDoubleSpace**
182
174
183
-
Set this to `false` if you want to preserve whitespace inside of text nodes. It is set to `true` by default.
175
+
Set `setRemoveDoubleSpace` to `false` if you want to preserve whitespace inside of text nodes. It is set to `true` by default.
184
176
185
-
**removeSmartyScripts**
177
+
**setRemoveSmartyScripts**
186
178
187
-
Set this to `false` if you want to preserve smarty script found in the html content. It is set to `true` by default.
179
+
Set `setRemoveSmartyScripts` to `false` if you want to preserve smarty script found in the html content. It is set to `true` by default.
188
180
189
-
**htmlSpecialCharsDecode**
181
+
**setHtmlSpecialCharsDecode**
190
182
191
-
By default this is set to `false`. Setting this to `true` will apply the php function `htmlspecialchars_decode` too all attribute values and text nodes.
183
+
By default `setHtmlSpecialCharsDecode` is set to `false`. Setting this to `true` will apply the php function `htmlspecialchars_decode` too all attribute values and text nodes.
192
184
193
-
**selfClosing**
185
+
**setSelfClosing**
194
186
195
187
This option contains an array of all self closing tags. These tags must be self closing and the parser will force them to be so if you have strict turned on. You can update this list with any additional tags that can be used as a self closing tag when using strict. You can also remove tags from this array or clear it out completly.
196
188
197
-
**noSlash**
189
+
**setNoSlash**
198
190
199
191
This option contains an array of all tags that can not be self closing. The list starts off as empty but you can add elements as you wish.
200
192
201
-
Static Facade
202
-
-------------
193
+
## Static Facade
203
194
204
195
You can also mount a static facade for the Dom object.
The above php block does the same find and load as the first example but it is done using the static facade, which supports all public methods found in the Dom object.
215
206
216
-
Modifying The Dom
217
-
-----------------
207
+
## Modifying The Dom
218
208
219
209
You can always modify the dom that was created from any loading method. To change the attribute of any node you can just call the `setAttribute` method.
0 commit comments