GPU Performance Optimization & Benchmark #3442

pyek-bot · 2026-05-13T23:32:03Z

pyek-bot
May 13, 2026

What is the latest guide to GPU optimization for Docling? What is expected throughput for parsing pdfs?

Are the below suggestions still valid:

Is pypdfium2 backend still recommended to be faster? Is it replaced by DoclingParsev4 backend?
Flash attention for VLM
Granite-Docling-258M instead of SmolVLM-256M?
Disable unnecessary pipeline features (generate_picture_images, generate_page_images)
Use RapidOCR with ocr_backend as torch. Is EasyOCR more accurate and can it use GPU backend?

Looking for suggestions and ideas towards optimization. I'm currently running benchmarks parallelly and will report back with my findings. Hoping to get any community answer or dosu suggestion for the latest updates as most issues/discussions seem outdated for 2026.

AI091 · 2026-05-14T08:47:42Z

AI091
May 14, 2026

From my experience in prod:
1- Time is dominated by other stages like ocr and table model, so changing layout models generally didn't matter in our usage.
2,3. Granite docling was much more usable than SmolDocling, but found normal pipeline more effective.
5- RapidOCR is much more VRAM effecient and in our use case, finding it slightly more accurate than EasyOCR

1 reply

pyek-bot May 14, 2026
Author

Thanks for the response! Appreciate it!

Post pipeline profiling, I definitely found that OCR and VLM are the major contributors to latency.
In my experience, SmolDocling is faster but Granite Docling is more accurate

A few follow up questions:

What kind of throughput are you able to deliver in terms of pages per second?
What GPU utilization are seeing on average and peak? I see that my GPU utilization peaks at 100% at intervals but the average utilization is only around ~28% using 24 GB VRAM.
For VLM, are you using a local inference server as mentioned here? https://docling-project.github.io/docling/usage/gpu/#vlm-pipeline

pyek-bot · 2026-05-14T18:38:33Z

pyek-bot
May 14, 2026
Author

Reporting my results

Setup:

docling 2.93.0, PyTorch 2.12.0+cu130, Python 3.11
NVIDIA L4 (24GB VRAM) on AWS g6.xlarge
ThreadedStandardPdfPipeline
Corpus: 8 PDFs, 189 pages (mix of text, images, tables, academic papers)

What we tested:

OCR on/off × VLM on/off
Concurrency: 1, 2, 4
Batch sizes: 4, 64, 128, 256 (ocr_batch, layout_batch, table_batch)
Pipeline profiling enabled

Results:

Best throughput: 3.3 pages/sec (OCR=off, VLM=off)
With OCR only: 1.5 pages/sec
With VLM only: 0.83 pages/sec
With both: 0.64 pages/sec
GPU utilization avg: 24-29% (peak 100% in bursts)
Batch size (4 → 256): zero effect on throughput. VRAM increases but performance identical.
Concurrency (1, 2, 4): zero effect on throughput.
VLM is 58% of pipeline time, OCR is 38%

Key findings:

Batch size has zero effect — identical throughput from batch=4 to batch=256 across all 48 runs. VRAM increases (1.6GB → 21GB) but no performance gain.
Concurrency has zero effect — doc_batch_concurrency 1, 2, 4 all produce the same results (GIL limitation).
GPU utilization never exceeds 29% average regardless of configuration. Peak hits 100% in bursts during model forward passes.
VLM is the dominant bottleneck (58% of pipeline time), followed by OCR (38%).

Pipeline profiling breakdown (OCR=on, VLM=on):

VLM (doc_enrich): 170s
OCR: 113s
table_structure: 44s
page_parse: 43s
layout: 10s

Further questions:

Is there a way to increase GPU utilization beyond 29% on a single GPU? Although peak utilization is 100%
Does the threaded pipeline actually batch pages at each stage, or does it process them one at a time through the queue?
Any plans for multiprocessing support (separate processes instead of threads to bypass GIL)?

0 replies

yudin-s · 2026-05-16T20:24:14Z

yudin-s
May 16, 2026

Your benchmark suggests the pipeline is not GPU-saturated; it is stage-bound.

The important clue is:

GPU avg: 24-29%, peak 100%
batch size: no throughput effect
doc_batch_concurrency: no throughput effect
VLM + OCR dominate wall time

That usually means the GPU work happens in short bursts, but the full pipeline is waiting on CPU work, preprocessing, image conversion, Python scheduling, model handoff, or per-page orchestration between bursts. In that situation, increasing batch size can raise VRAM usage without improving throughput because the stages are not actually feeding large batches into the model continuously.

For optimization, I would separate three questions:

Do you need OCR?
If the PDFs already contain text, turning OCR off is usually the largest win.
Do you need VLM enrichment for every page?
Your numbers show VLM is the dominant cost. A routing strategy can help: run the normal pipeline first, then use VLM only for pages with figures, scanned content, or low-confidence extraction.
Are you optimizing one document latency or batch throughput?
For throughput, multiple independent worker processes may help more than threads if the bottleneck is Python-side orchestration/GIL/queueing. For latency of one PDF, process-level parallelism may not help as much.

I would also benchmark by stage with a bigger corpus split into homogeneous groups:

born-digital PDFs
scanned PDFs
table-heavy PDFs
image/figure-heavy PDFs
academic papers

Expected throughput will vary heavily by those categories, so a single pages/sec number can be misleading.

Based on your data, the next useful experiment is probably not batch_size=512; it is running N separate Docling worker processes against different PDFs on the same L4 and watching whether total GPU utilization rises. If it does, the current single pipeline is orchestration-bound. If it does not, the bottleneck is likely inside the OCR/VLM/model execution path itself.

RapidOCR also looks like the right default to test first if VRAM efficiency matters. I would only pay the EasyOCR cost if you have a measured accuracy win on your document set.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Performance Optimization & Benchmark #3442

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

GPU Performance Optimization & Benchmark #3442

Uh oh!

pyek-bot May 13, 2026

Replies: 3 comments · 1 reply

Uh oh!

AI091 May 14, 2026

Uh oh!

pyek-bot May 14, 2026 Author

Uh oh!

pyek-bot May 14, 2026 Author

Uh oh!

yudin-s May 16, 2026

pyek-bot
May 13, 2026

Replies: 3 comments 1 reply

AI091
May 14, 2026

pyek-bot May 14, 2026
Author

pyek-bot
May 14, 2026
Author

yudin-s
May 16, 2026