Skip to content
This repository was archived by the owner on Nov 30, 2022. It is now read-only.

Commit fbd51da

Browse files
authored
Merge pull request #263 from kaustubhgupta/pdf_to_csv
PDF tables to CSV files
2 parents 48fa6ed + 92d52d6 commit fbd51da

File tree

4 files changed

+50
-0
lines changed

4 files changed

+50
-0
lines changed
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# Vscode files
2+
.vscode
3+
4+
# Sample Files
5+
sample.pdf
6+
sample2.pdf
7+
8+
# Python
9+
__pycache__
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# PDF to CSV
2+
This script will convert the tables in the PDF file into CSV files. Each CSV file has one table from the PDF and the number of CSV equal to the number of tables in the PDF.
3+
4+
# Requirements
5+
`pip install tabula-py, pandas`
6+
7+
# How to use?
8+
Just use the following command while executing the scrpit:
9+
10+
`python app.py location_of_pdf pages`
11+
12+
Pages have two options:
13+
- 'all' will extract tables from whole PDF
14+
- specific page (ex 1,2,54..) will extract table from that page
15+
16+
Example:
17+
- `python app.py sample.pdf all`
18+
- `python app.py sample2.pdf 45`
19+
20+
# Preview
21+
22+
![](preview.gif)
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
import tabula
2+
import pandas as pd
3+
import sys
4+
5+
def extract(path, number_pages):
6+
tables = tabula.read_pdf(path, multiple_tables=True, pages=number_pages)
7+
count = 1
8+
if len(tables)!=0:
9+
for table in tables:
10+
print
11+
print(f"Saving file -{count}")
12+
table.to_csv(f'Table- {count}.csv')
13+
count += 1
14+
print("All tables saved as seperate files !")
15+
else:
16+
print("No tables found !")
17+
18+
if __name__ == "__main__":
19+
extract(sys.argv[1], sys.argv[2])
Loading

0 commit comments

Comments
 (0)