@@ -55,6 +55,8 @@ Also, be sure to [install the `asyncio`-based Twisted reactor](https://docs.scra
55
55
TWISTED_REACTOR = " twisted.internet.asyncioreactor.AsyncioSelectorReactor"
56
56
```
57
57
58
+ ### Settings
59
+
58
60
` scrapy-playwright ` accepts the following settings:
59
61
60
62
* ` PLAYWRIGHT_BROWSER_TYPE ` (type ` str ` , default ` chromium ` )
@@ -67,7 +69,28 @@ TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
67
69
68
70
* ` PLAYWRIGHT_CONTEXT_ARGS ` (type ` dict ` , default ` {} ` )
69
71
70
- A dictionary with keyword arguments to be passed when creating the default Browser context.
72
+ A dictionary with default keyword arguments to be passed when creating the
73
+ "default" Browser context.
74
+
75
+ ** Deprecated: use ` PLAYWRIGHT_CONTEXTS ` instead**
76
+
77
+ * ` PLAYWRIGHT_CONTEXTS ` (type ` dict[str, dict] ` , default ` {} ` )
78
+
79
+ A dictionary which defines Browser contexts to be created on startup.
80
+ It should be a mapping of (name, keyword arguments) For instance:
81
+ ``` python
82
+ {
83
+ " first" : {
84
+ " context_arg1" : " value" ,
85
+ " context_arg2" : " value" ,
86
+ },
87
+ " second" : {
88
+ " context_arg1" : " value" ,
89
+ },
90
+ }
91
+ ```
92
+ If no contexts are defined, a default context (called `default` ) is created.
93
+ The arguments passed here take precedence over the ones defined in `PLAYWRIGHT_CONTEXT_ARGS ` .
71
94
See the docs for [`Browser.new_context` ](https:// playwright.dev/ python/ docs/ api/ class - browser# browsernew_contextkwargs).
72
95
73
96
* `PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT ` (type `Optional[int ]` , default `None ` )
@@ -104,42 +127,7 @@ class AwesomeSpider(scrapy.Spider):
104
127
```
105
128
106
129
107
- ## Page coroutines
108
-
109
- A sorted iterable (` list ` , ` tuple ` or ` dict ` , for instance) could be passed
110
- in the ` playwright_page_coroutines `
111
- [ Request.meta] ( https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta )
112
- key to request coroutines to be awaited on the ` Page ` before returning the final
113
- ` Response ` to the callback.
114
-
115
- This is useful when you need to perform certain actions on a page, like scrolling
116
- down or clicking links, and you want everything to count as a single Scrapy
117
- Response, containing the final result.
118
-
119
- ### Supported actions
120
-
121
- * ` scrapy_playwright.page.PageCoroutine(method: str, *args, **kwargs) ` :
122
-
123
- _ Represents a coroutine to be awaited on a ` playwright.page.Page ` object,
124
- such as "click", "screenshot", "evaluate", etc.
125
- ` method ` should be the name of the coroutine, ` *args ` and ` **kwargs `
126
- are passed to the function call._
127
-
128
- _ The coroutine result will be stored in the ` PageCoroutine.result ` attribute_
129
-
130
- For instance,
131
- ``` python
132
- PageCoroutine(" screenshot" , path = " quotes.png" , fullPage = True )
133
- ```
134
-
135
- produces the same effect as :
136
- ```python
137
- # 'page' is a playwright.async_api.Page object
138
- await page.screenshot(path = " quotes.png" , fullPage = True )
139
- ```
140
-
141
-
142
- # ## Receiving the Page object in the callback
130
+ # # Receiving the Page object in the callback
143
131
144
132
Specifying a non- False value for the `playwright_include_page` `meta` key for a
145
133
request will result in the corresponding `playwright.async_api.Page` object
@@ -176,6 +164,109 @@ class AwesomeSpiderWithPage(scrapy.Spider):
176
164
Scrapy request workflow (Scheduler, Middlewares, etc).
177
165
178
166
167
+ # # Multiple browser contexts
168
+
169
+ Multiple [browser contexts](https:// playwright.dev/ python/ docs/ core- concepts/ # browser-contexts)
170
+ to be launched at startup can be defined via the `PLAYWRIGHT_CONTEXTS ` [setting](# settings).
171
+
172
+ # ## Choosing a specific context for a request
173
+
174
+ Pass the name of the desired context in the `playwright_context` meta key:
175
+
176
+ ```python
177
+ yield scrapy.Request(
178
+ url = " https://example.org" ,
179
+ meta = {" playwright" : True , " playwright_context" : " first" },
180
+ )
181
+ ```
182
+
183
+ # ## Creating a context during a crawl
184
+
185
+ If the context specified in the `playwright_context` meta key does not exist, it will be created.
186
+ You can specify keyword arguments to be passed to
187
+ [`Browser.new_context` ](https:// playwright.dev/ python/ docs/ api/ class - browser# browsernew_contextkwargs)
188
+ in the `playwright_context_kwargs` meta key:
189
+
190
+ ```python
191
+ yield scrapy.Request(
192
+ url = " https://example.org" ,
193
+ meta = {
194
+ " playwright" : True ,
195
+ " playwright_context" : " new" ,
196
+ " playwright_context_kwargs" : {
197
+ " java_script_enabled" : False ,
198
+ " ignore_https_errors" : True ,
199
+ " proxy" : {
200
+ " server" : " http://myproxy.com:3128" ,
201
+ " username" : " user" ,
202
+ " password" : " pass" ,
203
+ },
204
+ },
205
+ },
206
+ )
207
+ ```
208
+
209
+ Please note that if a context with the specified name already exists,
210
+ that context is used and `playwright_context_kwargs` are ignored.
211
+
212
+ # ## Closing a context during a crawl
213
+
214
+ After [receiving the Page object in your callback](# receiving-the-page-object-in-the-callback),
215
+ you can access a context though the corresponding [`Page.context` ](https:// playwright.dev/ python/ docs/ api/ class - page# page-context)
216
+ attribute, and await [`close` ](https:// playwright.dev/ python/ docs/ api/ class - browsercontext# browser-context-close) on it.
217
+
218
+ ```python
219
+ def parse(self , response):
220
+ yield scrapy.Request(
221
+ url = " https://example.org" ,
222
+ callback = self .parse_in_new_context,
223
+ meta = {" playwright" : True , " playwright_context" : " new" , " playwright_include_page" : True },
224
+ )
225
+
226
+ async def parse_in_new_context(self , response):
227
+ page = response.meta[" playwright_page" ]
228
+ title = await page.title()
229
+ await page.context.close() # close the context
230
+ await page.close()
231
+ return {" title" : title}
232
+ ```
233
+
234
+
235
+ # # Page coroutines
236
+
237
+ A sorted iterable (`list ` , `tuple ` or `dict ` , for instance) could be passed
238
+ in the `playwright_page_coroutines`
239
+ [Request.meta](https:// docs.scrapy.org/ en/ latest/ topics/ request- response.html# scrapy.http.Request.meta)
240
+ key to request coroutines to be awaited on the `Page` before returning the final
241
+ `Response` to the callback.
242
+
243
+ This is useful when you need to perform certain actions on a page, like scrolling
244
+ down or clicking links, and you want everything to count as a single Scrapy
245
+ Response, containing the final result.
246
+
247
+ # ## Supported actions
248
+
249
+ * `scrapy_playwright.page.PageCoroutine(method: str , * args, ** kwargs)` :
250
+
251
+ _Represents a coroutine to be awaited on a `playwright.page.Page` object ,
252
+ such as " click" , " screenshot" , " evaluate" , etc.
253
+ `method` should be the name of the coroutine, `* args` and `** kwargs`
254
+ are passed to the function call._
255
+
256
+ _The coroutine result will be stored in the `PageCoroutine.result` attribute_
257
+
258
+ For instance,
259
+ ```python
260
+ PageCoroutine(" screenshot" , path = " quotes.png" , fullPage = True )
261
+ ```
262
+
263
+ produces the same effect as :
264
+ ```python
265
+ # 'page' is a playwright.async_api.Page object
266
+ await page.screenshot(path = " quotes.png" , fullPage = True )
267
+ ```
268
+
269
+
179
270
# # Examples
180
271
181
272
** Click on a link, save the resulting page as PDF **
0 commit comments