This project is a modular Python program designed to download PDF files from a list of URLs in an Excel file and update a metadata file to reflect the download status. The program adheres to the principle of separation of concerns, dividing responsibilities into distinct, reusable modules. This setup includes multi-threading for improved performance.
- Project Overview
- Project Structure
- Installation
- Usage
- Requirements
- Modules
- Known Issues
- Contributing
The purpose of this project is to:
- Validate URLs: Read URLs from a primary column, falling back on a secondary column if necessary.
- Download PDFs: Save each valid PDF with a unique identifier (BRnum) as a prefix.
- Track Download Status: Log the status of each PDF (Downloaded or Not Downloaded) in a separate metadata Excel file.
- Excel Integration: Reads URLs and metadata from Excel files using
pandas
. - Multithreading: Utilizes Python's
ThreadPoolExecutor
for parallel processing of download tasks. - Robust Validation: Validates URLs before attempting downloads to ensure efficiency and error handling.
- Metadata Management: Tracks download statuses (e.g., "Downloaded", "Not downloaded") and updates the metadata file.
- Error Handling: Logs errors for invalid URLs, failed downloads, or metadata update issues.
project/
│
├── data/
│ └── GRI_2017_2020.xlsx # Source Excel file with URLs and BRnum entries
│ └── test.xlsx # Lightweight Excel file used for testing
│
├── downloads/
│ └── ... # Folder where downloaded PDFs will be stored
│
├── metadata/
│ └── Metadata2024.xlsx # Metadata file tracking download statuses
│
├── src/
│ ├── __pycache__/ # Compiled Python files
│ ├── download_pdf.py # Module to handle PDF downloads
│ ├── load_excel.py # Module to load the source Excel file
│ ├── main.py # Main script orchestrating the workflow
│ ├── threaded_executor.py # Multithreading for URL validation and downloads
│ ├── update_metadata.py # Metadata update management
│ └── validate_urls.py # URL validation module
│
├── tests/
│ ├── test_download_pdf.py # Tests for the download_pdf module
│ ├── test_placeholder.py # Placeholder test file
│ ├── test_update_metadata.py# Tests for the update_metadata module
│ └── test_validate_urls.py # Tests for the validate_urls module
│
├── venv/ # Virtual environment for dependencies
│
├── .gitignore # Git ignored files configuration
├── LICENSE # Project license
├── README.md # Project documentation
├── pytest.ini # Pytest configuration file
└── requirements.txt # Python dependencies
- Python 3.12 or higher
pip
(Python package installer)
-
Clone this repository:
git clone https://github.com/jmbab/PDF_Downloader cd project
-
Set up a virtual environment:
python -m venv venv source venv/bin/activate
# On MacOS/Linuxvenv\Scripts\activate
# On Windows -
Install dependencies:
pip install -r requirements.txt
- Prepare the Excel File:
- Place the Excel file with URLs in the
data/
folder (e.g.,GRI_2017_2020.xlsx
). - Ensure columns for
Pdf_URL
(primary) andReport Html Address
(alternative) exist. - Ensure a
BRnum
column contains unique identifiers for each row.
-
Run the Main Script:
python main.py
-
Output:
- Downloaded PDFs will appear in the
downloads/
folder. - Metadata updates will be reflected in
metadata/Metadata2024.xlsx
.
- Python 3.12 or later
- Required packages specified in
requirements.txt
: pandas
requests
openpyxl
main.py
: Initializes the process and coordinates module actions.load_excel.py
: Loads and reads the Excel file.validate_urls.py
: Contains functions to validate URLs.download_pdf.py
: Handles downloading and naming PDFs.update_metadata.py
: Updates the metadata log with each PDF’s download status.threaded_executor.py
: Manages multi-threading for faster execution.
- Slow URL Validation: Adjusting the
timeout
parameter invalidate_urls.py
may speed up or slow down URL validation. - Redundant Entries in Metadata: The script will overwrite when it encounters duplicate
BRnum
entries.
- Fork the repository.
- Create a feature branch (
git checkout -b feature/YourFeature
). - Commit your changes (
git commit -m 'Add YourFeature'
). - Push to the branch (
git push origin feature/YourFeature
). - Submit a pull request with your changes.
This project is open source and available under the MIT License.
For questions or feedback, feel free to reach out via email: babonneau[at]gmail[dot]com