-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task 1 - Otodom scraper #4
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally fine, try to acquaint with pre-commit
otodom/task_1/wk/fin.py
Outdated
headers = { | ||
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should define constans at the beggining of the file with capital letters
otodom/task_1/wk/fin.py
Outdated
regex = re.compile("Idź do strony [0-9]+$") | ||
result = doc.find_all("a", {"aria-label": regex}) | ||
pages_n = max(list(map(lambda x: int(x.string), result))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otodom/task_1/wk/fin.py
Outdated
if not response.ok: | ||
with open("db.json", "w", encoding="utf-8") as file: | ||
json.dump(scrapped_data, file, ensure_ascii=False, indent=4) | ||
print("Already scrapped data saved in db.json") | ||
print("Error occured on page " + str(i) + ". Aborting.") | ||
sys.exit(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it may happen that for some reason 1 of 100 times the server response will simply fail on their site (5XX code). So do we really should exit on such circumstances?
otodom/task_1/wk/fin.py
Outdated
tags = record.find_all("p") | ||
for tag in tags: | ||
if tag.has_attr("title"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just find a tag with an attribute title instead of all
otodom/task_1/wk/fin.py
Outdated
record_dict["promoted"] = ( | ||
True if len(record.find_all("p", string="Podbite")) > 0 else False | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otodom/task_1/wk/fin.py
Outdated
if len(zl) > 1: | ||
record_dict["price"] = (zl[0].string + ", " + zl[1].string).replace("\xa0", " ") | ||
else: | ||
record_dict["price"] = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to scrap price /m^2 since we can simply divide price/area and get it
otodom/task_1/wk/fin.py
Outdated
record_dict["estate_agency"] = ( | ||
False if len(record.find_all("p", string="Oferta prywatna")) > 0 else True | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there is an estate agency we would like to get its name
otodom/task_1/wk/generate_link.py
Outdated
distance = settings["distance_radius"] | ||
if distance != "None": | ||
url += "?distanceRadius=" + str(distance) | ||
|
||
# price | ||
min_p = settings["price_min"] | ||
max_p = settings["price_max"] | ||
if min_p != "None": | ||
url += "&priceMin=" + str(min_p) | ||
if max_p != "None": | ||
url += "&priceMax=" + str(max_p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see
#1 (comment)
No description provided.