Skip to content

Commit 9776921

Browse files
committed
added pdf table extractor tutorial
1 parent 070ab98 commit 9776921

File tree

5 files changed

+33
-0
lines changed

5 files changed

+33
-0
lines changed

Diff for: README.md

+1
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ This is a repository of all the tutorials of [The Python Code](https://www.thepy
4747
- [How to Generate and Read QR Code in Python](https://www.thepythoncode.com/article/generate-read-qr-code-python). ([code](general/generating-reading-qrcode))
4848
- [How to Download Files in Python](https://www.thepythoncode.com/article/download-files-python). ([code](general/file-downloader))
4949
- [How to Compress and Decompress Files in Python](https://www.thepythoncode.com/article/compress-decompress-files-tarfile-python). ([code](general/compressing-files))
50+
- [How to Extract PDF Tables in Python](https://www.thepythoncode.com/article/extract-pdf-tables-in-python-camelot). ([code](general/pdf-table-extractor))
5051

5152
- ### [Web Scraping](https://www.thepythoncode.com/topic/web-scraping)
5253
- [How to Access Wikipedia in Python](https://www.thepythoncode.com/article/access-wikipedia-python). ([code](web-scraping/wikipedia-extractor))

Diff for: general/pdf-table-extractor/README.md

+8
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# [How to Extract PDF Tables in Python](https://www.thepythoncode.com/article/extract-pdf-tables-in-python-camelot)
2+
To run this:
3+
- You need to install required dependencies for the library [here](https://camelot-py.readthedocs.io/en/master/user/install-deps.html#install-deps).
4+
- `pip3 install -r requirements.txt`
5+
- Extract PDFs of the file `foo.pdf`:
6+
```
7+
python pdf_table_extractor.py foo.pdf
8+
```

Diff for: general/pdf-table-extractor/foo.pdf

82.2 KB
Binary file not shown.

Diff for: general/pdf-table-extractor/pdf_table_extractor.py

+23
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
import camelot
2+
import sys
3+
4+
# PDF file to extract tables from (from command-line)
5+
file = sys.argv[1]
6+
7+
# extract all the tables in the PDF file
8+
tables = camelot.read_pdf(file)
9+
10+
# number of tables extracted
11+
print("Total tables extracted:", tables.n)
12+
13+
# print the first table as Pandas DataFrame
14+
print(tables[0].df)
15+
16+
# export individually
17+
tables[0].to_csv("foo.csv")
18+
19+
# or export all in a zip
20+
tables.export("foo.csv", f="csv", compress=True)
21+
22+
# export to HTML
23+
tables.export("foo.html", f="html")

Diff for: general/pdf-table-extractor/requirements.txt

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
camelot-py[cv]

0 commit comments

Comments
 (0)