Skip to content

Commit eb553a5

Browse files
committed
add download images tutorial & fixed web-scraping topic
1 parent b5bfe53 commit eb553a5

File tree

13 files changed

+133
-3
lines changed

13 files changed

+133
-3
lines changed

Diff for: README.md

+4-3
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,8 @@ This is a repository of all the tutorials of [The Python Code](https://www.thepy
4040
- [How to Download Files in Python](https://www.thepythoncode.com/article/download-files-python). ([code](general/file-downloader))
4141

4242
- ### [Web Scraping](https://www.thepythoncode.com/topic/web-scraping)
43-
- [How to Access Wikipedia in Python](https://www.thepythoncode.com/article/access-wikipedia-python). ([code](general/wikipedia-extractor))
44-
- [How to Extract YouTube Data in Python](https://www.thepythoncode.com/article/get-youtube-data-python). ([code](general/youtube-extractor))
45-
- [How to Extract Weather Data from Google in Python](https://www.thepythoncode.com/article/extract-weather-data-python). ([code](general/weather-extractor))
43+
- [How to Access Wikipedia in Python](https://www.thepythoncode.com/article/access-wikipedia-python). ([code](web-scraping/wikipedia-extractor))
44+
- [How to Extract YouTube Data in Python](https://www.thepythoncode.com/article/get-youtube-data-python). ([code](web-scraping/youtube-extractor))
45+
- [How to Extract Weather Data from Google in Python](https://www.thepythoncode.com/article/extract-weather-data-python). ([code](web-scraping/weather-extractor))
46+
- [How to Download All Images from a Web Page in Python](https://www.thepythoncode.com/article/download-web-page-images-python). ([code](web-scraping/download-images))
4647

Diff for: web-scraping/download-images/README.md

+26
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# [How to Download All Images from a Web Page in Python](https://www.thepythoncode.com/article/download-web-page-images-python)
2+
To run this:
3+
- `pip3 install -r requirements.txt`
4+
-
5+
```
6+
python download_images --help
7+
```
8+
**Output:**
9+
```
10+
usage: download_images.py [-h] [-p PATH] url
11+
12+
This script downloads all images from a web page
13+
14+
positional arguments:
15+
url The URL of the web page you want to download images
16+
17+
optional arguments:
18+
-h, --help show this help message and exit
19+
-p PATH, --path PATH The Directory you want to store your images, default
20+
is the domain of URL passed
21+
```
22+
- If you want to download all images from https://www.thepythoncode.com/topic/web-scraping for example:
23+
```
24+
python download_images https://www.thepythoncode.com/topic/web-scraping
25+
```
26+
A new folder `www.thepythoncode.com` will be created automatically that contains all the images of that web page.

Diff for: web-scraping/download-images/download_images.py

+100
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
import requests
2+
import os
3+
from tqdm import tqdm
4+
from bs4 import BeautifulSoup as bs
5+
from urllib.parse import urljoin, urlparse
6+
7+
8+
def is_absolute(url):
9+
"""
10+
Determines whether a `url` is absolute.
11+
"""
12+
return bool(urlparse(url).netloc)
13+
14+
15+
def is_valid(url):
16+
"""
17+
Checks whether `url` is a valid URL.
18+
"""
19+
parsed = urlparse(url)
20+
return bool(parsed.netloc) and bool(parsed.scheme)
21+
22+
23+
def get_all_images(url):
24+
"""
25+
Returns all image URLs on a single `url`
26+
"""
27+
soup = bs(requests.get(url).content, "html.parser")
28+
urls = []
29+
for img in tqdm(soup.find_all("img"), "Extracting images"):
30+
img_url = img.attrs.get("src")
31+
32+
if not img_url:
33+
# if img does not contain src attribute, just skip
34+
continue
35+
36+
if not is_absolute(img_url):
37+
# if img has relative URL, make it absolute by joining
38+
img_url = urljoin(url, img_url)
39+
# remove URLs like '/hsts-pixel.gif?c=3.2.5'
40+
try:
41+
pos = img_url.index("?")
42+
img_url = img_url[:pos]
43+
except ValueError:
44+
pass
45+
# finally, if the url is valid
46+
if is_valid(img_url):
47+
urls.append(img_url)
48+
return urls
49+
50+
51+
def download(url, pathname):
52+
"""
53+
Downloads a file given an URL and puts it in the folder `pathname`
54+
"""
55+
# if path doesn't exist, make that path dir
56+
if not os.path.isdir(pathname):
57+
os.makedirs(pathname)
58+
# download the body of response by chunk, not immediately
59+
response = requests.get(url, stream=True)
60+
61+
# get the total file size
62+
file_size = int(response.headers.get("Content-Length", 0))
63+
64+
# get the file name
65+
filename = os.path.join(pathname, url.split("/")[-1])
66+
67+
# progress bar, changing the unit to bytes instead of iteration (default by tqdm)
68+
progress = tqdm(response.iter_content(1024), f"Downloading {filename}", total=file_size, unit="B", unit_scale=True, unit_divisor=1024)
69+
with open(filename, "wb") as f:
70+
for data in progress:
71+
# write data read to the file
72+
f.write(data)
73+
# update the progress bar manually
74+
progress.update(len(data))
75+
76+
77+
def main(url, path):
78+
# get all images
79+
imgs = get_all_images(url)
80+
for img in imgs:
81+
# for each img, download it
82+
download(img, path)
83+
84+
85+
86+
if __name__ == "__main__":
87+
import argparse
88+
parser = argparse.ArgumentParser(description="This script downloads all images from a web page")
89+
parser.add_argument("url", help="The URL of the web page you want to download images")
90+
parser.add_argument("-p", "--path", help="The Directory you want to store your images, default is the domain of URL passed")
91+
92+
args = parser.parse_args()
93+
url = args.url
94+
path = args.path
95+
96+
if not path:
97+
# if path isn't specified, use the domain name of that url as the folder name
98+
path = urlparse(url).netloc
99+
100+
main(url, path)

Diff for: web-scraping/download-images/requirements.txt

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
requests
2+
bs4
3+
tqdm
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)