Skip to content

Commit 77b6721

Browse files
committed
Merge remote-tracking branch 'origin/main' into close-inactive-contexts
2 parents 1b22c81 + 62ddc0c commit 77b6721

File tree

11 files changed

+257
-10
lines changed

11 files changed

+257
-10
lines changed

.bumpversion.cfg

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[bumpversion]
2-
current_version = 0.0.33
2+
current_version = 0.0.35
33
commit = True
44
tag = True
55

README.md

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -845,6 +845,32 @@ for a list of the accepted events and the arguments passed to their handlers.
845845
images, scripts, stylesheets, etc are not seen by Scrapy.
846846

847847

848+
## Memory usage extension
849+
850+
The default Scrapy memory usage extension
851+
(`scrapy.extensions.memusage.MemoryUsage`) does not include the memory used by
852+
Playwright because the browser is launched as a separate process. The
853+
scrapy-playwright package provides a replacement extension which also considers
854+
the memory used by Playwright. This extension needs the
855+
[`psutil`](https://pypi.org/project/psutil/) package to work.
856+
857+
Update the [EXTENSIONS](https://docs.scrapy.org/en/latest/topics/settings.html#std-setting-EXTENSIONS)
858+
setting to disable the built-in Scrapy extension and replace it with the one
859+
from the scrapy-playwright package:
860+
861+
```python
862+
# settings.py
863+
EXTENSIONS = {
864+
"scrapy.extensions.memusage.MemoryUsage": None,
865+
"scrapy_playwright.memusage.ScrapyPlaywrightMemoryUsageExtension": 0,
866+
}
867+
```
868+
869+
Refer to the
870+
[upstream docs](https://docs.scrapy.org/en/latest/topics/extensions.html#module-scrapy.extensions.memusage)
871+
for more information about supported settings.
872+
873+
848874
## Examples
849875

850876
**Click on a link, save the resulting page as PDF**
@@ -975,6 +1001,68 @@ async def main():
9751001
asyncio.run(main())
9761002
```
9771003

1004+
### Software versions
1005+
1006+
Be sure to include which versions of Scrapy and scrapy-playwright you are using:
1007+
1008+
```
1009+
$ python -c "import scrapy_playwright; print(scrapy_playwright.__version__)"
1010+
0.0.34
1011+
```
1012+
1013+
```
1014+
$ scrapy version -v
1015+
Scrapy : 2.11.1
1016+
lxml : 5.1.0.0
1017+
libxml2 : 2.12.3
1018+
cssselect : 1.2.0
1019+
parsel : 1.8.1
1020+
w3lib : 2.1.2
1021+
Twisted : 23.10.0
1022+
Python : 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
1023+
pyOpenSSL : 24.0.0 (OpenSSL 3.2.1 30 Jan 2024)
1024+
cryptography : 42.0.5
1025+
Platform : Linux-6.5.0-35-generic-x86_64-with-glibc2.35
1026+
```
1027+
1028+
### Reproducible code example
1029+
1030+
When opening an issue please include a
1031+
[Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example)
1032+
that shows the reported behavior. In addition, please make the code as self-contained as possible
1033+
so an active Scrapy project is not required and the spider can be executed directly from a file with
1034+
[`scrapy runspider`](https://docs.scrapy.org/en/latest/topics/commands.html#std-command-runspider).
1035+
This usually means including the relevant settings in the spider's
1036+
[`custom_settings`](https://docs.scrapy.org/en/latest/topics/settings.html#settings-per-spider)
1037+
attribute:
1038+
1039+
```python
1040+
import scrapy
1041+
1042+
class ExampleSpider(scrapy.Spider):
1043+
name = "example"
1044+
custom_settings = {
1045+
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
1046+
"DOWNLOAD_HANDLERS": {
1047+
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
1048+
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
1049+
},
1050+
}
1051+
1052+
def start_requests(self):
1053+
yield scrapy.Request(
1054+
url="https://example.org",
1055+
meta={"playwright": True},
1056+
)
1057+
```
1058+
1059+
### Logs and stats
1060+
1061+
Logs for spider jobs displaying the issue in detail are extremely useful
1062+
for understanding possible bugs. Include lines before and after the problem,
1063+
not just isolated tracebacks. Job stats displayed at the end of the job
1064+
are also important.
1065+
9781066

9791067
## Frequently Asked Questions
9801068

docs/changelog.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,11 @@
11
# scrapy-playwright changelog
22

3+
### [v0.0.34](https://github.com/scrapy-plugins/scrapy-playwright/releases/tag/v0.0.34) (2024-01-01)
4+
5+
* Update dev status classifier to 4 - beta
6+
* Official Python 3.12 support (#254)
7+
* Custom memusage extension (#257)
8+
39

410
### [v0.0.33](https://github.com/scrapy-plugins/scrapy-playwright/releases/tag/v0.0.33) (2023-10-19)
511

scrapy_playwright/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.0.33"
1+
__version__ = "0.0.35"

scrapy_playwright/_utils.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ async def _get_page_content(
6565
try:
6666
return await page.content()
6767
except Error as err:
68-
if err.message == _NAVIGATION_ERROR_MSG:
68+
if _NAVIGATION_ERROR_MSG in err.message:
6969
logger.debug(
7070
"Retrying to get content from page '%s', error: '%s'",
7171
page.url,

scrapy_playwright/handler.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
Download,
1313
Error as PlaywrightError,
1414
Page,
15+
Playwright as AsyncPlaywright,
1516
PlaywrightContextManager,
1617
Request as PlaywrightRequest,
1718
Response as PlaywrightResponse,
@@ -102,6 +103,9 @@ def from_settings(cls, settings: Settings) -> "Config":
102103

103104

104105
class ScrapyPlaywrightDownloadHandler(HTTPDownloadHandler):
106+
playwright_context_manager: Optional[PlaywrightContextManager] = None
107+
playwright: Optional[AsyncPlaywright] = None
108+
105109
def __init__(self, crawler: Crawler) -> None:
106110
super().__init__(settings=crawler.settings, crawler=crawler)
107111
verify_installed_reactor("twisted.internet.asyncioreactor.AsyncioSelectorReactor")
@@ -326,8 +330,10 @@ async def _close(self) -> None:
326330
if hasattr(self, "browser"):
327331
logger.info("Closing browser")
328332
await self.browser.close()
329-
await self.playwright_context_manager.__aexit__()
330-
await self.playwright.stop()
333+
if self.playwright_context_manager:
334+
await self.playwright_context_manager.__aexit__()
335+
if self.playwright:
336+
await self.playwright.stop()
331337

332338
def download_request(self, request: Request, spider: Spider) -> Deferred:
333339
if request.meta.get("playwright"):

scrapy_playwright/headers.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
This module includes functions to process request headers.
33
Refer to the PLAYWRIGHT_PROCESS_REQUEST_HEADERS setting for more information.
44
"""
5+
56
from urllib.parse import urlparse
67

78
from playwright.async_api import Request as PlaywrightRequest

scrapy_playwright/memusage.py

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
from contextlib import suppress
2+
from importlib import import_module
3+
from typing import List
4+
5+
from scrapy.exceptions import NotConfigured
6+
from scrapy.extensions.memusage import MemoryUsage
7+
8+
from scrapy_playwright.handler import ScrapyPlaywrightDownloadHandler, logger
9+
10+
11+
_MIB_FACTOR = 1024**2
12+
13+
14+
class ScrapyPlaywrightMemoryUsageExtension(MemoryUsage):
15+
def __init__(self, *args, **kwargs) -> None:
16+
super().__init__(*args, **kwargs)
17+
try:
18+
self.psutil = import_module("psutil")
19+
except ImportError as exc:
20+
raise NotConfigured("The psutil module is not available") from exc
21+
22+
def _get_main_process_ids(self) -> List[int]:
23+
try:
24+
return [
25+
handler.playwright_context_manager._connection._transport._proc.pid
26+
for handler in self.crawler.engine.downloader.handlers._handlers.values()
27+
if isinstance(handler, ScrapyPlaywrightDownloadHandler)
28+
and handler.playwright_context_manager
29+
]
30+
except Exception:
31+
return []
32+
33+
def _get_descendant_processes(self, process) -> list:
34+
children = process.children()
35+
result = children.copy()
36+
for child in children:
37+
result.extend(self._get_descendant_processes(child))
38+
return result
39+
40+
def _get_total_playwright_process_memory(self) -> int:
41+
process_list = [self.psutil.Process(pid) for pid in self._get_main_process_ids()]
42+
for proc in process_list.copy():
43+
process_list.extend(self._get_descendant_processes(proc))
44+
total_process_size = 0
45+
for proc in process_list:
46+
with suppress(Exception): # might fail if the process exited in the meantime
47+
total_process_size += proc.memory_info().rss
48+
logger.debug(
49+
"Total Playwright process memory: %i Bytes (%i MiB)",
50+
total_process_size,
51+
total_process_size / _MIB_FACTOR,
52+
)
53+
return total_process_size
54+
55+
def get_virtual_size(self) -> int:
56+
return super().get_virtual_size() + self._get_total_playwright_process_memory()

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
url="https://github.com/scrapy-plugins/scrapy-playwright",
2020
packages=["scrapy_playwright"],
2121
classifiers=[
22-
"Development Status :: 3 - Alpha",
22+
"Development Status :: 4 - Beta",
2323
"License :: OSI Approved :: BSD License",
2424
"Programming Language :: Python",
2525
"Programming Language :: Python :: 3.8",
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
from asyncio.subprocess import Process as AsyncioProcess
2+
from unittest import IsolatedAsyncioTestCase
3+
from unittest.mock import MagicMock, patch
4+
5+
import pytest
6+
from playwright.async_api import PlaywrightContextManager
7+
from scrapy.exceptions import NotConfigured
8+
from scrapy.extensions.memusage import MemoryUsage
9+
10+
from scrapy_playwright.memusage import ScrapyPlaywrightMemoryUsageExtension
11+
from scrapy_playwright.handler import ScrapyPlaywrightDownloadHandler
12+
13+
14+
SCHEMA_PID_MAP = {"http": 123, "https": 456}
15+
16+
17+
def mock_crawler_with_handlers() -> dict:
18+
handlers = {}
19+
for schema, pid in SCHEMA_PID_MAP.items():
20+
process = MagicMock()
21+
process.pid = pid
22+
handlers[schema] = MagicMock(spec=ScrapyPlaywrightDownloadHandler)
23+
handlers[schema].playwright_context_manager._connection._transport._proc = process
24+
crawler = MagicMock()
25+
crawler.engine.downloader.handlers._handlers = handlers
26+
return crawler
27+
28+
29+
def raise_import_error(*args, **kwargs):
30+
raise ImportError
31+
32+
33+
class MockMemoryInfo:
34+
rss = 999
35+
36+
37+
@patch("scrapy.extensions.memusage.MailSender")
38+
class TestMemoryUsageExtension(IsolatedAsyncioTestCase):
39+
async def test_process_availability(self, _MailSender):
40+
"""The main node process should be accessible from the context manager"""
41+
ctx_manager = PlaywrightContextManager()
42+
await ctx_manager.start()
43+
assert isinstance(ctx_manager._connection._transport._proc, AsyncioProcess)
44+
await ctx_manager.__aexit__()
45+
46+
@patch("scrapy_playwright.memusage.import_module", side_effect=raise_import_error)
47+
async def test_psutil_not_available_extension_disabled(self, _import_module, _MailSender):
48+
crawler = MagicMock()
49+
with pytest.raises(NotConfigured):
50+
ScrapyPlaywrightMemoryUsageExtension(crawler)
51+
52+
async def test_get_process_ids_ok(self, _MailSender):
53+
crawler = mock_crawler_with_handlers()
54+
extension = ScrapyPlaywrightMemoryUsageExtension(crawler)
55+
assert extension._get_main_process_ids() == list(SCHEMA_PID_MAP.values())
56+
57+
async def test_get_process_ids_error(self, _MailSender):
58+
crawler = mock_crawler_with_handlers()
59+
crawler.engine.downloader.handlers._handlers = MagicMock()
60+
crawler.engine.downloader.handlers._handlers.values.side_effect = raise_import_error
61+
extension = ScrapyPlaywrightMemoryUsageExtension(crawler)
62+
assert extension._get_main_process_ids() == []
63+
64+
async def test_get_descendant_processes(self, _MailSender):
65+
p1 = MagicMock()
66+
p2 = MagicMock()
67+
p3 = MagicMock()
68+
p4 = MagicMock()
69+
p2.children.return_value = [p3, p4]
70+
p1.children.return_value = [p2]
71+
crawler = MagicMock()
72+
extension = ScrapyPlaywrightMemoryUsageExtension(crawler)
73+
assert extension._get_descendant_processes(p1) == [p2, p3, p4]
74+
75+
async def test_get_total_process_size(self, _MailSender):
76+
crawler = MagicMock()
77+
extension = ScrapyPlaywrightMemoryUsageExtension(crawler)
78+
extension.psutil = MagicMock()
79+
extension.psutil.Process.return_value.memory_info.return_value = MockMemoryInfo()
80+
extension._get_main_process_ids = MagicMock(return_value=[1, 2, 3])
81+
expected_size = MockMemoryInfo().rss * len(extension._get_main_process_ids())
82+
assert extension._get_total_playwright_process_memory() == expected_size
83+
84+
async def test_get_virtual_size_sum(self, _MailSender):
85+
crawler = MagicMock()
86+
extension = ScrapyPlaywrightMemoryUsageExtension(crawler)
87+
parent_cls_extension = MemoryUsage(crawler)
88+
extension._get_total_playwright_process_memory = MagicMock(return_value=123)
89+
assert extension.get_virtual_size() == parent_cls_extension.get_virtual_size() + 123

0 commit comments

Comments
 (0)