-
Notifications
You must be signed in to change notification settings - Fork 131
Allocation failed - JavaScript heap out of memory #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
There is no need to patch the handler code, closing a context can be done using the existing API. I understand it might seem a bit verbose but I don't want to create a whole DSL around this to handle context/page creation/deletion. |
Hmm, I got the same error after a few hours when scraping just a single domain. Could it be related to error #15 which pops up a fair bit? Context '1': new page created, page count is 1 (1 for all contexts) |
Are you using a single context for this domain? If so, you're falling into microsoft/playwright#6319. This seems like an issue on the Node.js side of things. I'm no JS developer, so take the following with a grain of salt, but from what I've found you should be able to increase the memory limit by setting Sources and further reading:
|
Thank you, setting the NODE_OPTIONS seems to have solved the memory issue and It can run for 24h+ without crashing in a single context. |
Hi @xanrag How did you do to fix the |
Just the memory setting, I added this to my docker-compose and it seems to work: |
Thanks @xanrag I will try to test my script with new env setting. |
@xanrag Hi, did you get "Aborted (core dumped)" error anymore? I added |
@phongtnit
|
I'm not sure. When I run scrapy in celery as a separate process it doesn't log to the file when it crashes. There is something still going on though because ocassionally it stops and keeps putting out the same page/item count indefinitely without stopping and I have another issue where it doesn't kill the chrome process correctly but I'll investigate more and start another issue for that if I find anything. (A week of use spawned a quarter of a million zombie processes...) |
@elacuesta hey I'm having this problem were my computer starts freezing after 1/2hours of running my crawler. I'm pretty sure it's due to this playwright issue you linked (microsoft/playwright#6319) where it's taking up more and more memory. It seems like a workaround is to recreate the page every x minutes but I'm not sure how to do this. I'm already doing all playwright requests with I'm new to this, can you give me pointers on how I can create a new page or context (?) every x minutes? I'm currently unable to figure this out from the documentation on my own. I've added my spider in case you're interested spiderimport logging
from typing import Optional
import bs4
import scrapy
from scrapy_playwright.page import PageMethod
from jobscraper import storage
from jobscraper.items import CybercodersJob
class CybercodersSpider(scrapy.Spider):
name = 'cybercoders'
allowed_domains = ['cybercoders.com']
loading_delay = 2500
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/109.0'}
request_meta = dict(
playwright=True,
playwright_context="new",
# You can define page actions (https://playwright.dev/python/docs/api/class-page)
playwright_page_methods=[
PageMethod("wait_for_timeout", loading_delay)
# TODO instead of waiting, wait for the page to load (look for a specific element)
]
)
def get_search_url(self, page: Optional[int] = 1) -> str:
page_string = f"page={page}&" if page else ""
return f"https://www.cybercoders.com/jobs/?{page_string}&worklocationtypeid=3"
def start_requests(self):
yield scrapy.http.Request(
self.get_search_url(),
headers=self.headers,
cb_kwargs={'page': 1},
meta=self.request_meta,
callback=self.parse
)
def parse(self, response, **kwargs):
"""
Parses the job-search page
"""
# get all job_links
job_links = response.css('div.job-title a::attr(href)').getall()
# If there are no job links on the page, the page is empty so we can stop
if not job_links:
return
# Go to the next search page
yield scrapy.http.Request(
self.get_search_url(kwargs['page'] + 1),
headers=self.headers,
cb_kwargs={'page': kwargs['page'] + 1},
meta=self.request_meta,
callback=self.parse
)
# Go to each job page
for link in job_links:
job_id = link.split('/')[-1]
if job_id and storage.has_job_been_scraped(CybercodersJob, job_id):
continue
yield response.follow("https://www.cybercoders.com" + link, callback=self.parse_job, headers=self.headers,
meta=self.request_meta)
def parse_job(self, response, **kwargs):
"""
Parses a job page
"""
try:
soup = bs4.BeautifulSoup(response.body, 'html.parser')
details = dict(
id=response.url.split('/')[-1],
url=response.url,
description=soup.find('div', class_='job-details-content').find('div',
class_='job-details') if soup.find(
'div', class_='job-details-content') else None,
title=response.css('div.job-title h1::text').get() if response.css('div.job-title h1::text') else None,
skills=response.css('div.skills span.skill-name::text').getall() if response.css(
'div.skills span.skill-name::text') else None,
location=response.css('div.job-info-main div.location span::text').get() if response.css(
'div.job-info-main div.location span::text') else None,
compensation=response.css('div.job-info-main div.wage span::text').get() if response.css(
'div.job-info-main div.wage span::text') else None,
posted_date=response.css('div.job-info-main div.posted span::text').get() if response.css(
'div.job-info-main div.posted span::text') else None,
)
for key in ['title', 'description', 'url']:
if details[key] is None:
logging.warning(f"Missing value for {key} in {response.url}")
yield CybercodersJob(
**details
)
except Exception as e:
logging.error(f"Something went wrong parsing {response.url}: {e}") |
Passing |
@elacuesta Oh ok, good idea. Thanks! |
https://github.com/scrapy-plugins/scrapy-playwright#closing-a-context-during-a-crawl |
Hi,
This issue related to #18
The error still occurred with
scrapy-playwright 0.0.4
. The Scrapy script crawled about 2500 domains in 10k from majestic and crashed with the last errorJavaScript heap out of memory
. So I think this is a bug.My main code:
My env:
The detail of error:
Temporary fix: I replaced line 166 with
await page.context.close()
to close current context in handler.py because my script had one context per one domain. It will fix the errorAllocation failed - JavaScript heap out of memory
and the Scrapy script crawled all 10k domains, but the successful rate was about 72% in comparison with no added code (about 85% successful rate). Also, when I added the new code, the new error was:The text was updated successfully, but these errors were encountered: