You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This repository are going to contain early stages of the pracuj / otodom scrapers. Here we will create fundamentals for the future one big project and the done things are going to be merged into one.
3
+
4
+
The tasks you can find in the corresponding directories.
Let the User provide a link for the page of listings. For example [this one](https://www.otodom.pl/pl/wyniki/sprzedaz/mieszkanie/mazowieckie/warszawa/warszawa). You should be able to collect information from the page and create a JSON (dict) from it as following:
5
+
```json
6
+
{
7
+
"url": "str",
8
+
"otodom_id": "str",
9
+
"title" : "str",
10
+
"localization": {
11
+
"province": "str",
12
+
"city": "str",
13
+
"district": "str",
14
+
"street": "str",
15
+
},
16
+
"promoted": "bool",
17
+
"price": "int",
18
+
"rooms": "int",
19
+
"area": "int",
20
+
"estate_agency": "str"
21
+
}
22
+
```
23
+
If something is missing you can leave the value as an empty string.
24
+
*#### Second stage
25
+
The Bot should be able to iterate through all the listings pages. The listings should be again collected and the duplicates should be removed.
26
+
### Task 2
27
+
28
+
Create a **settings.json** file. It should contain things which are going to define what bot is going to scrap. An example may look like:
29
+
```json
30
+
{
31
+
"base_url": "str",
32
+
"price_min": "str",
33
+
"price_max": "str",
34
+
"city": "str",
35
+
"property_type": "str",
36
+
"only_for_sale": "bool",
37
+
"only_for_rent": "bool",
38
+
...
39
+
}
40
+
```
41
+
and so on. Anything what may be usefull **please try to include**. Dependingly on the data the URL should be somehow generated. Look into Url how the Url is changed accordingly to what search parameters you applied on the site.
42
+
43
+
**Solutions** you can create in the **pracuj/task1/<your_name>** file and then make create a pull request.
Let the User provide a link for the page of listings. For example [this one](https://www.pracuj.pl/praca/warszawa;wp?rd=30&cc=5016%2C5015&sal=1). **We only want to fetch listings with a given salary range.** You should be able to collect information from the page and create a JSON (dict) from it as following:
5
+
```json
6
+
{
7
+
"url": "str",
8
+
"pracuj_id": "str",
9
+
"title" : "str",
10
+
"company": "str",
11
+
"type_of_contract": "list[str]",
12
+
"salary": "int",
13
+
"specialization": "str",
14
+
"operating_mode": "list[str]",
15
+
"promoted": "bool",
16
+
}
17
+
```
18
+
If something is missing you can leave the value as an empty string.
19
+
*#### Second stage
20
+
The Bot should be able to iterate through all the listings pages. The listings should be again collected and the duplicates should be removed.
21
+
### Task 2
22
+
23
+
Create a **settings.json** file. It should contain things which are going to define what bot is going to scrap. An example may look like:
24
+
```json
25
+
{
26
+
"base_url": "str",
27
+
"salary_min": "str",
28
+
"salary_max": "str",
29
+
"city": "str",
30
+
"category": "str",
31
+
...
32
+
}
33
+
```
34
+
and so on. Anything what may be usefull **please try to include**. Start with the most important things. Dependingly on the data the URL should be somehow generated. Look into Url how the Url is changed accordingly to what search parameters you applied on the site.
35
+
36
+
**Solutions** you can create in the **pracuj/task1/<your_name>** file and then make create a pull request.
0 commit comments