Skip to content

Commit e7fe8ee

Browse files
committed
added link extractor tutorial
1 parent ad4d82d commit e7fe8ee

File tree

4 files changed

+149
-0
lines changed

4 files changed

+149
-0
lines changed

Diff for: README.md

+1
Original file line numberDiff line numberDiff line change
@@ -46,5 +46,6 @@ This is a repository of all the tutorials of [The Python Code](https://www.thepy
4646
- [How to Extract YouTube Data in Python](https://www.thepythoncode.com/article/get-youtube-data-python). ([code](web-scraping/youtube-extractor))
4747
- [How to Extract Weather Data from Google in Python](https://www.thepythoncode.com/article/extract-weather-data-python). ([code](web-scraping/weather-extractor))
4848
- [How to Download All Images from a Web Page in Python](https://www.thepythoncode.com/article/download-web-page-images-python). ([code](web-scraping/download-images))
49+
- [How to Extract All Website Links in Python](https://www.thepythoncode.com/article/extract-all-website-links-python). ([code](web-scraping/link-extractor))
4950

5051
For any feedback, please consider pulling requests.

Diff for: web-scraping/link-extractor/README.md

+37
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# [How to Extract All Website Links in Python](https://www.thepythoncode.com/article/extract-all-website-links-python)
2+
To run this:
3+
- `pip3 install -r requirements.txt`
4+
-
5+
```
6+
python link_extractor.py --help
7+
```
8+
**Output:**
9+
```
10+
usage: link_extractor.py [-h] [-m MAX_URLS] url
11+
12+
Link Extractor Tool with Python
13+
14+
positional arguments:
15+
url The URL to extract links from.
16+
17+
optional arguments:
18+
-h, --help show this help message and exit
19+
-m MAX_URLS, --max-urls MAX_URLS
20+
Number of max URLs to crawl, default is 30.
21+
```
22+
- For instance, to extract all links from 2 first URLs appeared in github.com:
23+
```
24+
python link_extractor.py https://github.com -m 2
25+
```
26+
This will result in a large list, here is the last 5 links:
27+
```
28+
[!] External link: https://developer.github.com/
29+
[*] Internal link: https://help.github.com/
30+
[!] External link: https://github.blog/
31+
[*] Internal link: https://help.github.com/articles/github-terms-of-service/
32+
[*] Internal link: https://help.github.com/articles/github-privacy-statement/
33+
[+] Total Internal links: 85
34+
[+] Total External links: 21
35+
[+] Total URLs: 106
36+
```
37+
This will also save these URLs in `github.com_external_links.txt` for external links and `github.com_internal_links.txt` for internal links.

Diff for: web-scraping/link-extractor/link_extractor.py

+108
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
import requests
2+
from urllib.request import urlparse, urljoin
3+
from bs4 import BeautifulSoup
4+
import colorama
5+
6+
# init the colorama module
7+
colorama.init()
8+
9+
GREEN = colorama.Fore.GREEN
10+
GRAY = colorama.Fore.LIGHTBLACK_EX
11+
RESET = colorama.Fore.RESET
12+
13+
# initialize the set of links (unique links)
14+
internal_urls = set()
15+
external_urls = set()
16+
17+
total_urls_visited = 0
18+
19+
20+
def is_valid(url):
21+
"""
22+
Checks whether `url` is a valid URL.
23+
"""
24+
parsed = urlparse(url)
25+
return bool(parsed.netloc) and bool(parsed.scheme)
26+
27+
28+
def get_all_website_links(url):
29+
"""
30+
Returns all URLs that is found on `url` in which it belongs to the same website
31+
"""
32+
# all URLs of `url`
33+
urls = set()
34+
# domain name of the URL without the protocol
35+
domain_name = urlparse(url).netloc
36+
soup = BeautifulSoup(requests.get(url).content, "html.parser")
37+
for a_tag in soup.findAll("a"):
38+
href = a_tag.attrs.get("href")
39+
if href == "" or href is None:
40+
# href empty tag
41+
continue
42+
# join the URL if it's relative (not absolute link)
43+
href = urljoin(url, href)
44+
parsed_href = urlparse(href)
45+
# remove URL GET parameters, URL fragments, etc.
46+
href = parsed_href.scheme + "://" + parsed_href.netloc + parsed_href.path
47+
if not is_valid(href):
48+
# not a valid URL
49+
continue
50+
if href in internal_urls:
51+
# already in the set
52+
continue
53+
if domain_name not in href:
54+
# external link
55+
if href not in external_urls:
56+
print(f"{GRAY}[!] External link: {href}{RESET}")
57+
external_urls.add(href)
58+
continue
59+
print(f"{GREEN}[*] Internal link: {href}{RESET}")
60+
urls.add(href)
61+
internal_urls.add(href)
62+
return urls
63+
64+
65+
def crawl(url, max_urls=50):
66+
"""
67+
Crawls a web page and extracts all links.
68+
You'll find all links in `external_urls` and `internal_urls` global set variables.
69+
params:
70+
max_urls (int): number of max urls to crawl, default is 30.
71+
"""
72+
global total_urls_visited
73+
total_urls_visited += 1
74+
links = get_all_website_links(url)
75+
for link in links:
76+
if total_urls_visited > max_urls:
77+
break
78+
crawl(link, max_urls=max_urls)
79+
80+
81+
if __name__ == "__main__":
82+
import argparse
83+
parser = argparse.ArgumentParser(description="Link Extractor Tool with Python")
84+
parser.add_argument("url", help="The URL to extract links from.")
85+
parser.add_argument("-m", "--max-urls", help="Number of max URLs to crawl, default is 30.", default=30, type=int)
86+
87+
args = parser.parse_args()
88+
url = args.url
89+
max_urls = args.max_urls
90+
91+
crawl(url, max_urls=max_urls)
92+
93+
print("[+] Total Internal links:", len(internal_urls))
94+
print("[+] Total External links:", len(external_urls))
95+
print("[+] Total URLs:", len(external_urls) + len(internal_urls))
96+
97+
domain_name = urlparse(url).netloc
98+
99+
# save the internal links to a file
100+
with open(f"{domain_name}_internal_links.txt", "w") as f:
101+
for internal_link in internal_urls:
102+
print(internal_link.strip(), file=f)
103+
104+
# save the external links to a file
105+
with open(f"{domain_name}_external_links.txt", "w") as f:
106+
for external_link in external_urls:
107+
print(external_link.strip(), file=f)
108+

Diff for: web-scraping/link-extractor/requirements.txt

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
requests
2+
bs4
3+
colorama

0 commit comments

Comments
 (0)