PDF Downloader and Metadata Updater

This project is a modular Python program designed to download PDF files from a list of URLs in an Excel file and update a metadata file to reflect the download status. The program adheres to the principle of separation of concerns, dividing responsibilities into distinct, reusable modules. This setup includes multi-threading for improved performance.

Project Overview

The purpose of this project is to:

Validate URLs: Read URLs from a primary column, falling back on a secondary column if necessary.
Download PDFs: Save each valid PDF with a unique identifier (BRnum) as a prefix.
Track Download Status: Log the status of each PDF (Downloaded or Not Downloaded) in a separate metadata Excel file.

Features

Excel Integration: Reads URLs and metadata from Excel files using pandas.
Multithreading: Utilizes Python's ThreadPoolExecutor for parallel processing of download tasks.
Robust Validation: Validates URLs before attempting downloads to ensure efficiency and error handling.
Metadata Management: Tracks download statuses (e.g., "Downloaded", "Not downloaded") and updates the metadata file.
Error Handling: Logs errors for invalid URLs, failed downloads, or metadata update issues.

Project Structure

project/
│
├── data/
│   └── GRI_2017_2020.xlsx     # Source Excel file with URLs and BRnum entries
│   └── test.xlsx              # Lightweight Excel file used for testing
│
├── downloads/
│   └── ...                    # Folder where downloaded PDFs will be stored
│
├── metadata/
│   └── Metadata2024.xlsx      # Metadata file tracking download statuses
│
├── src/
│   ├── __pycache__/           # Compiled Python files
│   ├── download_pdf.py        # Module to handle PDF downloads
│   ├── load_excel.py          # Module to load the source Excel file
│   ├── main.py                # Main script orchestrating the workflow
│   ├── threaded_executor.py   # Multithreading for URL validation and downloads
│   ├── update_metadata.py     # Metadata update management
│   └── validate_urls.py       # URL validation module
│
├── tests/
│   ├── test_download_pdf.py   # Tests for the download_pdf module
│   ├── test_placeholder.py    # Placeholder test file
│   ├── test_update_metadata.py# Tests for the update_metadata module
│   └── test_validate_urls.py  # Tests for the validate_urls module
│
├── venv/                      # Virtual environment for dependencies
│
├── .gitignore                 # Git ignored files configuration
├── LICENSE                    # Project license
├── README.md                  # Project documentation
├── pytest.ini                 # Pytest configuration file
└── requirements.txt           # Python dependencies

Installation

Prerequisites

Python 3.12 or higher
pip (Python package installer)

Steps

Clone this repository:

git clone https://github.com/jmbab/PDF_Downloader cd project
Set up a virtual environment:

python -m venv venv source venv/bin/activate # On MacOS/Linux venv\Scripts\activate # On Windows
Install dependencies:

pip install -r requirements.txt

Usage

Prepare the Excel File:

Place the Excel file with URLs in the data/ folder (e.g., GRI_2017_2020.xlsx).
Ensure columns for Pdf_URL (primary) and Report Html Address (alternative) exist.
Ensure a BRnum column contains unique identifiers for each row.

Run the Main Script:

python main.py
Output:

Downloaded PDFs will appear in the downloads/ folder.
Metadata updates will be reflected in metadata/Metadata2024.xlsx.

Requirements

Python 3.12 or later
Required packages specified in requirements.txt:
pandas
requests
openpyxl

Modules

main.py: Initializes the process and coordinates module actions.
load_excel.py: Loads and reads the Excel file.
validate_urls.py: Contains functions to validate URLs.
download_pdf.py: Handles downloading and naming PDFs.
update_metadata.py: Updates the metadata log with each PDF’s download status.
threaded_executor.py: Manages multi-threading for faster execution.

Known Issues

Slow URL Validation: Adjusting the timeout parameter in validate_urls.py may speed up or slow down URL validation.
Redundant Entries in Metadata: The script will overwrite when it encounters duplicate BRnum entries.

Contributing

Fork the repository.
Create a feature branch (git checkout -b feature/YourFeature).
Commit your changes (git commit -m 'Add YourFeature').
Push to the branch (git push origin feature/YourFeature).
Submit a pull request with your changes.

License

This project is open source and available under the MIT License.

Contact

For questions or feedback, feel free to reach out via email: babonneau[at]gmail[dot]com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Downloader and Metadata Updater

Table of Contents

Project Overview

Features

Project Structure

Installation

Prerequisites

Steps

Usage

Requirements

Modules

Known Issues

Contributing

License

Contact

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
data		data
downloads		downloads
metadata		metadata
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dev-requirements.txt		dev-requirements.txt
pytest.ini		pytest.ini
requirements.txt		requirements.txt

License

jmbab/PDF_Downloader

Folders and files

Latest commit

History

Repository files navigation

PDF Downloader and Metadata Updater

Table of Contents

Project Overview

Features

Project Structure

Installation

Prerequisites

Steps

Usage

Requirements

Modules

Known Issues

Contributing

License

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages