📄 PDF Outline Extractor (Adobe Hackathon Challenge-1A)

This project intelligently predicts the structure of a PDF (Title, H1–H4) using machine‑learning and robust PDF parsing.

🚀 Features

Predicts heading labels via an XGBoost classifier
Outputs clean JSON outlines ready for sem‑search / downstream tasks
Ships as a lightweight, offline Docker image (≤ 200 MB, AMD64)

📂 Project Structure

Adobe-India-Hackathon25/
├── extract_features.py   
├── infer.py              # runtime prediction pipeline
├── train.py              # model training script 
├── Dockerfile            # Docker container definition
├── .dockerignore         # trims build context
├── models/               # ⇢ trained model + encoders
│   ├── model_<ts>.pkl
│   ├── label_encoder.pkl
│   ├── onehot_encoder.pkl
│   └── feature_list.json
├── input/                # place PDFs here when running
└── output/               # JSON results appear here

🐳 Run via Docker

1 · Build Image

docker build -t pdf-outline .

2 · Prepare Folders

mkdir -p input output
# copy your PDFs into ./input

3 · Run Inference

docker run --rm \
  -v "$(pwd)/input:/app/input" \
  -v "$(pwd)/output:/app/output" \
  --network none \
  pdf-outline

Every *.pdf in input/ produces a *.json in output/.

💻 Local (Non‑Docker) Run

pip install pdfplumber pandas numpy regex rapidfuzz scikit-learn xgboost joblib
python infer.py input/ output/

✅ JSON Schema

{
  "title": "Document Title",
  "outline": [
    { "level": "H1", "text": "Section 1", "page": 1 },
    { "level": "H2", "text": "Subsection", "page": 2 }
  ]
}

📌 Implementation Notes

Feature set – font size, position, indent, spacing, bold/italic %, etc.
Model – XGBoost multi‑class (TITLE, H1‑H4) with class‑weighted loss.
Post‑processing – fixes curly quotes, ensures 1‑based pages, fills title fallback.
Constraints met – CPU‑only, offline, AMD64, runtime < 10 s on 50‑page PDF.

Our Team

We are a cross-functional team of machine learning engineers, NLP researchers, full-stack developers, and software architects passionate about document intelligence. Our mission is to make complex document structures easily interpretable by building accurate, scalable, and user-friendly PDF outline extraction systems powered by AI.

GitHub Repository

You can find the complete source code to the project on GitHub: GitHub Repository

Acknowledgment

Special thanks to Adobe India for organizing this hackathon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📄 PDF Outline Extractor (Adobe Hackathon Challenge-1A)

🚀 Features

📂 Project Structure

🐳 Run via Docker

1 · Build Image

2 · Prepare Folders

3 · Run Inference

💻 Local (Non‑Docker) Run

✅ JSON Schema

📌 Implementation Notes

Our Team

GitHub Repository

Acknowledgment

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
models		models
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
extract_features.py		extract_features.py
infer.py		infer.py
train.py		train.py

itshivams/PDF-Outline-Extractor

Folders and files

Latest commit

History

Repository files navigation

📄 PDF Outline Extractor (Adobe Hackathon Challenge-1A)

🚀 Features

📂 Project Structure

🐳 Run via Docker

1 · Build Image

2 · Prepare Folders

3 · Run Inference

💻 Local (Non‑Docker) Run

✅ JSON Schema

📌 Implementation Notes

Our Team

GitHub Repository

Acknowledgment

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages