Closed
Description
Hi,
This issue related to #18
The error still occurred with scrapy-playwright 0.0.4
. The Scrapy script crawled about 2500 domains in 10k from majestic and crashed with the last error JavaScript heap out of memory
. So I think this is a bug.
My main code:
domain = self.get_domain(url=url)
context_name = domain.replace('.', '_')
yield scrapy.Request(
url=url,
meta={
"playwright": True,
"playwright_page_coroutines": {
"screenshot": PageCoroutine("screenshot", domain + ".png"),
},
# Create new content
"playwright_context": context_name,
},
)
My env:
Python 3.8.10
Scrapy 2.5.0
playwright 1.12.1
scrapy-playwright 0.0.04
The detail of error:
2021-07-17 14:47:48 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.costco.com/>: HTTP status code is not handled or not allowed
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
1: 0xa18150 node::Abort() [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
2: 0xa1855c node::OnFatalError(char const*, char const*) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
3: 0xb9715e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
4: 0xb974d9 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
5: 0xd54755 [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
6: 0xd650a8 v8::internal::Heap::AllocateRawWithRetryOrFail(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
7: 0xd2bd9d v8::internal::Factory::NewFixedArrayWithFiller(v8::internal::RootIndex, int, v8::internal::Object, v8::internal::AllocationType) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
8: 0xd2be90 v8::internal::Handle<v8::internal::FixedArray> v8::internal::Factory::NewFixedArrayWithMap<v8::internal::FixedArray>(v8::internal::RootIndex, int, v8::internal::AllocationType) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
9: 0xf5abd0 v8::internal::OrderedHashTable<v8::internal::OrderedHashMap, 2>::Allocate(v8::internal::Isolate*, int, v8::internal::AllocationType) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
10: 0xf5ac81 v8::internal::OrderedHashTable<v8::internal::OrderedHashMap, 2>::Rehash(v8::internal::Isolate*, v8::internal::Handle<v8::internal::OrderedHashMap>, int) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
11: 0xf5b2cb v8::internal::OrderedHashTable<v8::internal::OrderedHashMap, 2>::EnsureGrowable(v8::internal::Isolate*, v8::internal::Handle<v8::internal::OrderedHashMap>) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
12: 0x1051b38 v8::internal::Runtime_MapGrow(int, unsigned long*, v8::internal::Isolate*) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
13: 0x140a8f9 [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
Aborted (core dumped)
2021-07-17 14:48:34 [scrapy.extensions.logstats] INFO: Crawled 2533 pages (at 15 pages/min), scraped 2362 items (at 12 items/min)
Temporary fix: I replaced line 166 with await page.context.close()
to close current context in handler.py because my script had one context per one domain. It will fix the error Allocation failed - JavaScript heap out of memory
and the Scrapy script crawled all 10k domains, but the successful rate was about 72% in comparison with no added code (about 85% successful rate). Also, when I added the new code, the new error was:
2021-07-17 15:04:59 [scrapy.core.scraper] ERROR: Error downloading <GET http://usatoday.com>
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/ubuntu/.local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
return (yield download_func(request=request, spider=spider))
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 824, in adapt
extracted = result.result()
File "/home/ubuntu/python/scrapy-playwright/scrapy_playwright/handler.py", line 138, in _download_request
result = await self._download_request_with_page(request, page)
File "/home/ubuntu/python/scrapy-playwright/scrapy_playwright/handler.py", line 149, in _download_request_with_page
response = await page.goto(request.url)
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 6006, in goto
await self._async(
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_page.py", line 429, in goto
return await self._main_frame.goto(**locals_to_params(locals()))
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_frame.py", line 117, in goto
await self._channel.send("goto", locals_to_params(locals()))
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
return await self.inner_send(method, params, False)
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
result = next(iter(done)).result()
playwright._impl._api_types.Error: Navigation failed because page was closed!
...
2021-07-17 19:31:15 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-38926' coro=<Route.continue_() done, defined at /home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py:544> exception=Error('Target page, context or browser has been closed')>
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 582, in continue_
await self._async(
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_network.py", line 207, in continue_
await self._channel.send("continue", cast(Any, overrides))
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
return await self.inner_send(method, params, False)
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
....
2021-07-18 03:51:34 [scrapy.core.scraper] ERROR: Error downloading <GET http://bbc.co.uk>
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/ubuntu/.local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
return (yield download_func(request=request, spider=spider))
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 824, in adapt
extracted = result.result()
File "/home/ubuntu/python/scrapy-playwright/scrapy_playwright/handler.py", line 138, in _download_request
result = await self._download_request_with_page(request, page)
File "/home/ubuntu/python/scrapy-playwright/scrapy_playwright/handler.py", line 165, in _download_request_with_page
body = (await page.content()).encode("utf8")
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 5914, in content
await self._async("page.content", self._impl_obj.content())
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_page.py", line 412, in content
return await self._main_frame.content()
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_frame.py", line 325, in content
return await self._channel.send("content")
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
return await self.inner_send(method, params, False)
File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
result = next(iter(done)).result()
playwright._impl._api_types.Error: Execution context was destroyed, most likely because of a navigation.