Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task 1 - Otodom scraper #4

Merged
merged 12 commits into from
Dec 30, 2023
Merged

Task 1 - Otodom scraper #4

merged 12 commits into from
Dec 30, 2023

Conversation

detker
Copy link
Contributor

@detker detker commented Dec 5, 2023

No description provided.

@detker detker changed the title Task 1 - Otodom scrapper Task 1 - Otodom scraper Dec 6, 2023
Copy link
Contributor

@TheRealSeber TheRealSeber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally fine, try to acquaint with pre-commit

Comment on lines 70 to 72
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should define constans at the beggining of the file with capital letters

Comment on lines 77 to 79
regex = re.compile("Idź do strony [0-9]+$")
result = doc.find_all("a", {"aria-label": regex})
pages_n = max(list(map(lambda x: int(x.string), result)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, definetly there is a better approach
image
You just could find the nav, and form its a childrens take the last element. The best way is always to find the most general solution when looking for the elements

Comment on lines 86 to 91
if not response.ok:
with open("db.json", "w", encoding="utf-8") as file:
json.dump(scrapped_data, file, ensure_ascii=False, indent=4)
print("Already scrapped data saved in db.json")
print("Error occured on page " + str(i) + ". Aborting.")
sys.exit(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may happen that for some reason 1 of 100 times the server response will simply fail on their site (5XX code). So do we really should exit on such circumstances?

Comment on lines 18 to 20
tags = record.find_all("p")
for tag in tags:
if tag.has_attr("title"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just find a tag with an attribute title instead of all

Comment on lines 35 to 37
record_dict["promoted"] = (
True if len(record.find_all("p", string="Podbite")) > 0 else False
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dived into the html code. What if the site loaded in english? Or we would like to scrap in english? Then maybe it wouldn't work.
image
Finding single element p whose parent is a span i think would be more error-omitting approach

Comment on lines 41 to 44
if len(zl) > 1:
record_dict["price"] = (zl[0].string + ", " + zl[1].string).replace("\xa0", " ")
else:
record_dict["price"] = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to scrap price /m^2 since we can simply divide price/area and get it

Comment on lines 61 to 63
record_dict["estate_agency"] = (
False if len(record.find_all("p", string="Oferta prywatna")) > 0 else True
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is an estate agency we would like to get its name

Comment on lines 58 to 68
distance = settings["distance_radius"]
if distance != "None":
url += "?distanceRadius=" + str(distance)

# price
min_p = settings["price_min"]
max_p = settings["price_max"]
if min_p != "None":
url += "&priceMin=" + str(min_p)
if max_p != "None":
url += "&priceMax=" + str(max_p)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@detker detker merged commit e96c6ad into master Dec 30, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants