Skip to content

Commit e1c2cb8

Browse files
committedDec 3, 2019
added email extractor tutorial
1 parent 56f9836 commit e1c2cb8

File tree

4 files changed

+25
-0
lines changed

4 files changed

+25
-0
lines changed
 

‎README.md

+1
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,7 @@ This is a repository of all the tutorials of [The Python Code](https://www.thepy
6262
- [How to Extract Weather Data from Google in Python](https://www.thepythoncode.com/article/extract-weather-data-python). ([code](web-scraping/weather-extractor))
6363
- [How to Download All Images from a Web Page in Python](https://www.thepythoncode.com/article/download-web-page-images-python). ([code](web-scraping/download-images))
6464
- [How to Extract All Website Links in Python](https://www.thepythoncode.com/article/extract-all-website-links-python). ([code](web-scraping/link-extractor))
65+
- [How to Make an Email Extractor in Python](https://www.thepythoncode.com/article/extracting-email-addresses-from-web-pages-using-python). ([code](web-scraping/email-extractor))
6566

6667
- ### [Python Standard Library](https://www.thepythoncode.com/topic/python-standard-library)
6768
- [How to Use Pickle for Object Serialization in Python](https://www.thepythoncode.com/article/object-serialization-saving-and-loading-objects-using-pickle-python). ([code](general/object-serialization))
+7
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# [How to Make an Email Extractor in Python](https://www.thepythoncode.com/article/extracting-email-addresses-from-web-pages-using-python)
2+
To run this:
3+
- `pip3 install -r requirements.txt`
4+
- To extract email addresses from `"https://www.randomlists.com/email-addresses"` website and save them to the file `emails.txt`:
5+
```
6+
python email_harvester.py https://www.randomlists.com/email-addresses emails.txt
7+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
import re
2+
from requests_html import HTMLSession
3+
import sys
4+
5+
url = sys.argv[1]
6+
EMAIL_REGEX = r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"""
7+
8+
# initiate an HTTP session
9+
session = HTMLSession()
10+
# get the HTTP Response
11+
r = session.get(url)
12+
# for JAVA-Script driven websites
13+
r.html.render()
14+
with open(sys.argv[2], "a") as f:
15+
for re_match in re.finditer(EMAIL_REGEX, r.html.raw_html.decode()):
16+
print(re_match.group().strip(), file=f)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
requests-html

0 commit comments

Comments
 (0)
Please sign in to comment.