Skip to content

Commit 67b14ec

Browse files
committed
added extract links from pdf tutorial
1 parent 4d3c786 commit 67b14ec

File tree

7 files changed

+44
-0
lines changed

7 files changed

+44
-0
lines changed

Diff for: README.md

+1
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,7 @@ This is a repository of all the tutorials of [The Python Code](https://www.thepy
8888
- [How to Extract and Submit Web Forms from a URL using Python](https://www.thepythoncode.com/article/extracting-and-submitting-web-page-forms-in-python). ([code](web-scraping/extract-and-fill-forms))
8989
- [How to Get Domain Name Information in Python](https://www.thepythoncode.com/article/extracting-domain-name-information-in-python). ([code](web-scraping/get-domain-info))
9090
- [How to Extract YouTube Comments in Python](https://www.thepythoncode.com/article/extract-youtube-comments-in-python). ([code](web-scraping/youtube-comments-extractor))
91+
- [How to Extract All PDF Links in Python](https://www.thepythoncode.com/article/extract-pdf-links-with-python). ([code](web-scraping/pdf-url-extractor))
9192

9293
- ### [Python Standard Library](https://www.thepythoncode.com/topic/python-standard-library)
9394
- [How to Transfer Files in the Network using Sockets in Python](https://www.thepythoncode.com/article/send-receive-files-using-sockets-python). ([code](general/transfer-files/))

Diff for: web-scraping/pdf-url-extractor/1710.05006.pdf

5.09 MB
Binary file not shown.

Diff for: web-scraping/pdf-url-extractor/1810.04805.pdf

757 KB
Binary file not shown.

Diff for: web-scraping/pdf-url-extractor/README.md

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# [How to Extract All PDF Links in Python](https://www.thepythoncode.com/article/extract-pdf-links-with-python)
2+
To run this:
3+
- `pip3 install -r requirements.txt`
4+
- Use `pdf_link_extractor.py` to get clickable links, and `pdf_link_extractor_regex.py` to get links that are in text form.

Diff for: web-scraping/pdf-url-extractor/pdf_link_extractor.py

+15
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
import pikepdf # pip3 install pikepdf
2+
3+
file = "1810.04805.pdf"
4+
# file = "1710.05006.pdf"
5+
pdf_file = pikepdf.Pdf.open(file)
6+
urls = []
7+
# iterate over PDF pages
8+
for page in pdf_file.pages:
9+
for annots in page.get("/Annots"):
10+
uri = annots.get("/A").get("/URI")
11+
if uri is not None:
12+
print("[+] URL Found:", uri)
13+
urls.append(uri)
14+
15+
print("[*] Total URLs extracted:", len(urls))
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
import fitz # pip install PyMuPDF
2+
import re
3+
4+
# a regular expression of URLs
5+
url_regex = r"https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)"
6+
# extract raw text from pdf
7+
# file = "1710.05006.pdf"
8+
file = "1810.04805.pdf"
9+
# open the PDF file
10+
with fitz.open(file) as pdf:
11+
text = ""
12+
for page in pdf:
13+
# extract text of each PDF page
14+
text += page.getText()
15+
urls = []
16+
# extract all urls using the regular expression
17+
for match in re.finditer(url_regex, text):
18+
url = match.group()
19+
print("[+] URL Found:", url)
20+
urls.append(url)
21+
print("[*] Total URLs extracted:", len(urls))
22+

Diff for: web-scraping/pdf-url-extractor/requirements.txt

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
pikepdf
2+
PyMuPDF

0 commit comments

Comments
 (0)