Omniparse v2 #112

adithya-s-k · 2025-08-01T20:42:38Z

Summary by CodeRabbit

New Features
- Switched document parsing and OCR models from Surya OCR/Marker to Docling, improving PDF, Word, and PPT file support.
- Updated Docker and dependency management to use newer versions of PyTorch (CUDA 12.4), Transformers, and added Docling support.
- Added .python-version file for consistent Python 3.10 environments.
Bug Fixes
- Improved image extraction and encoding from documents for more reliable processing.
Documentation
- Updated all references and instructions to reflect the use of Docling models.
- Removed Skypilot deployment instructions.
- Clarified dependency installation and model usage in README and docs.
Refactor
- Replaced internal document and image parsing logic to use Docling’s APIs and data structures.
- Updated project metadata and licensing to MIT and Docling attribution.
- Streamlined dependency and package management in pyproject.toml.
Chores
- Updated .gitignore to exclude .vscode/ directory.

Migrate PDF Processing Core from Marker to Docling

coderabbitai · 2025-08-01T20:42:45Z

Walkthrough

This update transitions the codebase from using Marker and Surya OCR models to Docling models for document parsing, updates licensing and attribution to MIT and Docling, and migrates PDF, PPT, and DOC parsing logic to Docling-based workflows. The project configuration is modernized, dependencies updated, and documentation revised to reflect these changes.

Changes

Cohort / File(s)	Change Summary
Project Configuration & Dependency Management `.python-version`, `pyproject.toml`, `Dockerfile`, `.gitignore`	Added Python version file. Migrated project config from Poetry to PEP 621 `[project]` format, updated dependencies, added CUDA 12.4 PyTorch support, and updated ignore rules for VSCode.
Documentation & Usage Instructions `README.md`, `docs/README.md`, `docs/installation.md`, `docs/deployment.md`, `examples/OmniParse_GoogleColab.ipynb`	Replaced references to Surya OCR/Marker with Docling, updated instructions and acknowledgements, removed Skypilot deployment docs, and adjusted example notebook for new dependency management and model names.
Model Loading & Shared State `omniparse/__init__.py`	Replaced Marker-based model loading with Docling equivalents, updated licensing, attribution, and added PDF pipeline configuration.
Document Parsing Logic `omniparse/documents/__init__.py` (deleted), `omniparse/documents/router.py`	Deleted legacy Marker-based document parsing code. Refactored router to use Docling for PDF, PPT, DOC, and generic file parsing, updating metadata and image encoding logic.
Image Parsing `omniparse/image/__init__.py`, `omniparse/image/router.py`	Updated image parsing to use Docling, changed function signatures to include image name, and updated endpoint to pass filename.
Image Encoding Utility `omniparse/utils.py`	Refactored image encoding to operate on Docling document objects and use safe temporary file handling.
Media Handling `omniparse/media/__init__.py`	Cleaned up unused imports and adjusted import style for video processing.
Demo & Usage Comments `omniparse/demo.py`	Updated model description in CLI usage comments to reference Docling models.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant API (FastAPI)
    participant DoclingConverter
    participant Florence2Model
    participant Utils

    User->>API: Uploads PDF/PPT/DOC file
    API->>DoclingConverter: convert(file bytes)
    DoclingConverter-->>API: Parsed document object
    API->>DoclingConverter: Export to markdown
    DoclingConverter-->>API: Markdown text
    API->>Utils: encode_images(document_name, document, responseDocument)
    Utils-->>API: Images encoded in response
    API-->>User: Returns responseDocument with text, metadata, images

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~35 minutes

Poem

🐇
From Marker to Docling, we leap and bound,
Parsing docs with models newly found.
Licenses lighter, dependencies bright,
Images encoded, metadata just right.
With every hop, our project grows strong—
Docling and Florence now lead us along!
— Your code rabbit, singing this song.

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch omniparse-v2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 5

🔭 Outside diff range comments (9)

omniparse/media/__init__.py (2)

52-55: temp_audio_path may be undefined → UnboundLocalError during cleanup
If the function raises ValueError (invalid input), the finally block still references temp_audio_path, which was never set.

-    finally:
-        # Clean up the temporary file
-        if os.path.exists(temp_audio_path):
-            os.remove(temp_audio_path)
+    finally:
+        # Clean up the temporary file if it was created
+        if 'temp_audio_path' in locals() and os.path.exists(temp_audio_path):
+            os.remove(temp_audio_path)

90-95: Same undefined-variable risk for video_path / audio_path in parse_video
A validation error before these variables are assigned will crash the cleanup block.

-    finally:
-        # Clean up the temporary files
-        if os.path.exists(video_path):
-            os.remove(video_path)
-        if os.path.exists(audio_path):
-            os.remove(audio_path)
+    finally:
+        # Clean up the temporary files if they were created
+        if 'video_path' in locals() and os.path.exists(video_path):
+            os.remove(video_path)
+        if 'audio_path' in locals() and os.path.exists(audio_path):
+            os.remove(audio_path)

pyproject.toml (1)

35-47: tool.uv indexes: platform marker syntax might be ignored

sys_platform == 'linux' or sys_platform == 'win32' is not valid PEP 508 syntax inside TOML arrays.
Use a separate table per marker or wrap with extras according to uv docs, e.g.
torch = [
  { index = "pytorch-cu124", markers = "sys_platform == 'linux' or sys_platform == 'win32'" }
]
Validate with uv pip install -r pyproject.toml before merging.

Dockerfile (1)

1-1: Fix CUDA version inconsistency.

The CUDA_VERSION argument is set to "11.8.0" but the PyTorch installation uses the cu124 index, creating an inconsistency that could cause confusion or deployment issues.

Apply this diff to align the CUDA version:
-ARG CUDA_VERSION="11.8.0"
+ARG CUDA_VERSION="12.4.0"
Also applies to: 49-49

omniparse/image/__init__.py (1)

45-108: Consider adding validation for Docling converter availability

The function should verify that the Docling converter is initialized before use.

Add this validation at the beginning of the function:
 def parse_image(image_name, input_data, model_state) -> dict:
+    if not model_state.docling_converter:
+        raise ValueError("Docling converter not initialized. Please load document models first.")
+        
     temp_files = []

omniparse/documents/router.py (3)

112-112: Fix incorrect file suffix for Word documents

The parse_doc_endpoint uses ".ppt" suffix for Word documents, which is incorrect.

-    with tempfile.NamedTemporaryFile(delete=False, suffix=".ppt") as tmp_ppt:
+    with tempfile.NamedTemporaryFile(delete=False, suffix=".docx") as tmp_doc:
+        tmp_doc.write(await file.read())
+        tmp_doc.flush()
+        input_path = tmp_doc.name
-        tmp_ppt.write(await file.read())
-        tmp_ppt.flush()
-        input_path = tmp_ppt.name

62-152: Refactor duplicate code between PPT and DOC endpoints

The parse_ppt_endpoint and parse_doc_endpoint have nearly identical implementations. This violates the DRY principle.

Create a shared helper function:

async def _convert_office_to_pdf_and_parse(file: UploadFile, suffix: str):
    """Convert Office documents to PDF and parse with Docling."""
    with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp_file:
        tmp_file.write(await file.read())
        tmp_file.flush()
        input_path = tmp_file.name

    output_dir = tempfile.mkdtemp()
    try:
        command = [
            "libreoffice",
            "--headless",
            "--convert-to",
            "pdf",
            "--outdir",
            output_dir,
            input_path,
        ]
        result = subprocess.run(command, check=True, capture_output=True, text=True)
        
        output_pdf_path = os.path.join(
            output_dir, os.path.splitext(os.path.basename(input_path))[0] + ".pdf"
        )

        with open(output_pdf_path, "rb") as pdf_file:
            pdf_bytes = pdf_file.read()

        source = DocumentStream(name=file.filename, stream=BytesIO(pdf_bytes))
        filetype = os.path.splitext(file.filename)[1].lstrip('.').upper()
        out_meta = {"filename": file.filename, "filetype": filetype}

        docling_result = model_state.docling_converter.convert(source)
        full_text = docling_result.document.export_to_markdown()

        out_meta["block_stats"] = {
            "images": len(docling_result.document.pictures),
            "tables": len(docling_result.document.tables),
        }

        result = responseDocument(text=full_text, metadata=out_meta)
        encode_images(file.filename, docling_result.document, result)
        
        return result
    except subprocess.CalledProcessError as e:
        raise HTTPException(
            status_code=500, 
            detail=f"LibreOffice conversion failed: {e.stderr}"
        )
    finally:
        os.remove(input_path)
        if os.path.exists(output_pdf_path):
            os.remove(output_pdf_path)
        os.rmdir(output_dir)

Then simplify the endpoints:

@document_router.post("/ppt")
async def parse_ppt_endpoint(file: UploadFile = File(...)):
    try:
        result = await _convert_office_to_pdf_and_parse(file, ".pptx")
        return JSONResponse(content=result.model_dump())
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@document_router.post("/docs")
async def parse_doc_endpoint(file: UploadFile = File(...)):
    try:
        result = await _convert_office_to_pdf_and_parse(file, ".docx")
        return JSONResponse(content=result.model_dump())
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

167-208: Add proper cleanup with try/finally block

File cleanup should be in a finally block to ensure it happens even if an exception occurs.

Wrap the logic in try/finally:

 with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
     tmp_file.write(await file.read())
     tmp_file.flush()
     input_path = tmp_file.name

+output_dir = None
+try:
     if file_ext.lower() in {".ppt", ".pptx", ".doc", ".docx"}:
         output_dir = tempfile.mkdtemp()
         # ... conversion logic ...

     # ... parsing logic ...
     
     result = responseDocument(text=full_text, metadata=out_meta)
     encode_images(file.filename, docling_result.document, result)
     
     return JSONResponse(content=result.model_dump())
+finally:
+    if os.path.exists(input_path):
+        os.remove(input_path)
+    if output_dir and os.path.exists(output_dir):
+        for f in os.listdir(output_dir):
+            os.remove(os.path.join(output_dir, f))
+        os.rmdir(output_dir)
-
-    os.remove(input_path)
-
-    result = responseDocument(text=full_text, metadata=out_meta)
-    encode_images(file.filename, docling_result.document, result)
-
-    return JSONResponse(content=result.model_dump())

omniparse/__init__.py (1)

50-78: Add error handling for model loading

Model loading can fail for various reasons (network issues, insufficient memory, etc.). Add proper error handling.

Wrap each model loading section in try-except blocks:

 def load_omnimodel(load_documents: bool, load_media: bool, load_web: bool):
     global shared_state
     print_omniparse_text_art()
     device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    
     if load_documents:
-        print("[LOG] ✅ Loading OCR Model")
-        download_models()
-        shared_state.docling_converter = DocumentConverter(
-            format_options={
-                InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
-            }
-        )
+        try:
+            print("[LOG] ✅ Loading OCR Model")
+            download_models()
+            shared_state.docling_converter = DocumentConverter(
+                format_options={
+                    InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
+                }
+            )
+        except Exception as e:
+            print(f"[ERROR] ❌ Failed to load OCR Model: {e}")
+            raise
+            
-        print("[LOG] ✅ Loading Vision Model")
-        shared_state.vision_model = AutoModelForCausalLM.from_pretrained(
-            "microsoft/Florence-2-base", torch_dtype=torch.float32, trust_remote_code=True
-        ).to(device)
-        shared_state.vision_processor = AutoProcessor.from_pretrained(
-            "microsoft/Florence-2-base", trust_remote_code=True
-        )
+        try:
+            print("[LOG] ✅ Loading Vision Model")
+            shared_state.vision_model = AutoModelForCausalLM.from_pretrained(
+                "microsoft/Florence-2-base", torch_dtype=torch.float32, trust_remote_code=True
+            ).to(device)
+            shared_state.vision_processor = AutoProcessor.from_pretrained(
+                "microsoft/Florence-2-base", trust_remote_code=True
+            )
+        except Exception as e:
+            print(f"[ERROR] ❌ Failed to load Vision Model: {e}")
+            raise

     if load_media:
-        print("[LOG] ✅ Loading Audio Model")
-        shared_state.whisper_model = whisper.load_model("small")
+        try:
+            print("[LOG] ✅ Loading Audio Model")
+            shared_state.whisper_model = whisper.load_model("small")
+        except Exception as e:
+            print(f"[ERROR] ❌ Failed to load Audio Model: {e}")
+            raise

     if load_web:
-        print("[LOG] ✅ Loading Web Crawler")
-        shared_state.crawler = WebCrawler(verbose=True)
+        try:
+            print("[LOG] ✅ Loading Web Crawler")
+            shared_state.crawler = WebCrawler(verbose=True)
+        except Exception as e:
+            print(f"[ERROR] ❌ Failed to load Web Crawler: {e}")
+            raise

🧹 Nitpick comments (4)

omniparse/media/__init__.py (1)
74-79: Use context managers to guarantee clip closure and reduce boilerplate
VideoFileClip and its audio clip both support context-manager semantics, eliminating manual close() calls.
-        video_clip = VideoFileClip(video_path)
-        audio_clip = video_clip.audio
-        audio_clip.write_audiofile(audio_path)
-        audio_clip.close()
-        video_clip.close()
+        with VideoFileClip(video_path) as video_clip:
+            with video_clip.audio as audio_clip:
+                audio_clip.write_audiofile(audio_path)
docs/README.md (1)
303-305: Fix markdown list formatting for consistency.

The static analysis tool flagged inconsistent list formatting. The project appears to use asterisks for unordered lists elsewhere.

Apply this diff to fix the formatting:
-- Docling IBM models
-- Florence-2 base
-- Whisper Small
+* Docling IBM models
+* Florence-2 base
+* Whisper Small
omniparse/image/__init__.py (1)
13-14: Fix module description to reflect image parsing functionality

The description mentions "pdf/word/ppt parsing" but this module is specifically for image parsing.
 Description:
-This section of the code was adapted from the Docling repository to enhance text pdf/word/ppt parsing.
+This section of the code was adapted from the Docling repository to enhance image parsing and text extraction.
 All credits for the original implementation go to Docling.
omniparse/__init__.py (1)
33-39: Improve type annotations for SharedState

Using Any type reduces type safety and IDE support. Consider using proper type annotations.
+from typing import Optional
+from docling.document_converter import DocumentConverter
+from transformers import PreTrainedModel, PreTrainedTokenizer

 class SharedState(BaseModel):
-    docling_converter: Any = None
-    vision_model: Any = None
-    vision_processor: Any = None
-    whisper_model: Any = None
-    crawler: Any = None
+    docling_converter: Optional[DocumentConverter] = None
+    vision_model: Optional[PreTrainedModel] = None
+    vision_processor: Optional[PreTrainedTokenizer] = None
+    whisper_model: Optional[Any] = None  # whisper doesn't provide type hints
+    crawler: Optional[WebCrawler] = None

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d1ae83 and 87f1da0.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (17)

.gitignore (1 hunks)
.python-version (1 hunks)
Dockerfile (2 hunks)
README.md (6 hunks)
docs/README.md (2 hunks)
docs/deployment.md (1 hunks)
docs/installation.md (1 hunks)
examples/OmniParse_GoogleColab.ipynb (5 hunks)
omniparse/__init__.py (3 hunks)
omniparse/demo.py (1 hunks)
omniparse/documents/__init__.py (0 hunks)
omniparse/documents/router.py (5 hunks)
omniparse/image/__init__.py (3 hunks)
omniparse/image/router.py (1 hunks)
omniparse/media/__init__.py (1 hunks)
omniparse/utils.py (1 hunks)
pyproject.toml (1 hunks)

💤 Files with no reviewable changes (1)

omniparse/documents/init.py

🧰 Additional context used

🧬 Code Graph Analysis (3)

omniparse/utils.py (1)

omniparse/models/__init__.py (2)

responseDocument (15-64)

add_image (21-49)

omniparse/documents/router.py (3)

omniparse/__init__.py (1)

get_shared_state (80-81)

omniparse/utils.py (1)

encode_images (8-21)

omniparse/models/__init__.py (1)

responseDocument (15-64)

omniparse/__init__.py (1)

omniparse/utils.py (1)

print_omniparse_text_art (24-34)

🪛 markdownlint-cli2 (0.17.2)

docs/README.md

303-303: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)

304-304: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)

305-305: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build-and-push

🔇 Additional comments (17)

docs/installation.md (1)

35-35: Bullet wording update looks correct
The switch to “Docling models and Florence-2” is accurate and consistent with the rest of the PR. No further action needed.

pyproject.toml (1)

12-33: Unreleased / future package versions will break installation

torch>=2.6.0, html2text>=2025.4.15, openai-whisper>=20240930, etc. do not exist on PyPI today. pip will fail immediately.

Recommend pinning to the latest released versions or using “~=latest_minor” constraints.

.gitignore (1)

165-166: ✅ Good addition
Ignoring .vscode/ keeps editor artefacts out of the repo.

.python-version (1)

1-1: Python version pin acknowledged

Including .python-version aligns local tooling with the requires-python field.

omniparse/demo.py (1)

136-136: Description updated successfully

The flag description now reflects the Docling migration. No further changes required.

docs/deployment.md (1)

31-31: LGTM! Documentation cleanup aligns with migration.

The removal of Skypilot deployment instructions and retention of Docker-based deployment documentation is consistent with the project's modernization efforts moving to Docling models.

Dockerfile (1)

58-58: LGTM! Transformers version updated appropriately.

The upgrade from transformers 4.41.2 to 4.50.3 aligns with the migration to Docling models and ensures compatibility with newer features.
omniparse/image/router.py (1)

15-15: Verified parse_image signature accepts filename parameter.

The function in omniparse/image/__init__.py is declared as:
def parse_image(image_name, input_data, model_state) -> dict:
which directly matches the call in omniparse/image/router.py:
result: responseDocument = parse_image(file.filename, file_bytes, model_state)
No further changes needed.
docs/README.md (2)

108-108: LGTM! Documentation updated for Docling migration.

The update correctly reflects the change from "Surya OCR series of models" to "Docling models and Florence-2" in the server documentation.

299-299: LGTM! Acknowledgements updated appropriately.

The acknowledgements section correctly credits the Docling project instead of the previous Marker project, reflecting the migration.

examples/OmniParse_GoogleColab.ipynb (4)

46-54: LGTM! Added uv installation for modern dependency management.

The addition of uv installation aligns with the project's migration to modern Python tooling and dependency management.

84-84: LGTM! Dependency installation updated to use uv.

The change from pip install -e . to uv -q sync is consistent with the migration to uv for dependency management.

93-96: LGTM! Proactive fix for matplotlib backend issues.

The addition of matplotlib-inline installation with a helpful comment addresses a common issue in Colab environments.

202-202: LGTM! Server launch commands updated for uv.

The server launch commands are correctly updated to use uv run server.py instead of direct Python execution, maintaining consistency with the uv migration.

Also applies to: 250-250

README.md (2)

96-109: Documentation updates look good!

The model references have been consistently updated to reflect the migration from Surya OCR to Docling models.

296-319: Limitations and acknowledgements properly updated

The documentation correctly reflects the new Docling-based implementation and gives appropriate credit.
omniparse/__init__.py (1)

65-66: Verify torch.float32 usage for vision model

The vision model is explicitly set to use float32, which uses more memory than float16. Was this intentional?

If memory usage is a concern and the model supports it, consider using float16:
 shared_state.vision_model = AutoModelForCausalLM.from_pretrained(
-    "microsoft/Florence-2-base", torch_dtype=torch.float32, trust_remote_code=True
+    "microsoft/Florence-2-base", torch_dtype=torch.float16 if device.type == "cuda" else torch.float32, trust_remote_code=True
 ).to(device)
This would use float16 on GPU (where it's typically supported) and float32 on CPU.

coderabbitai · 2025-08-01T20:48:59Z

omniparse/documents/router.py

@@ -163,36 +186,22 @@ async def parse_any_endpoint(file: UploadFile = File(...)):
        )
        input_path = output_pdf_path

-    # Common parsing logic
-    full_text, images, out_meta = convert_single_pdf(input_path, model_state.model_list)
+    docling_result = model_state.docling_converter.convert(Path(input_path))


🛠️ Refactor suggestion

Use DocumentStream consistently across all endpoints

This endpoint uses Path while other endpoints use DocumentStream. This inconsistency could lead to different behavior.

- docling_result = model_state.docling_converter.convert(Path(input_path)) + with open(input_path, "rb") as f: + file_bytes = f.read() + source = DocumentStream(name=file.filename, stream=BytesIO(file_bytes)) + docling_result = model_state.docling_converter.convert(source)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In omniparse/documents/router.py at line 189, replace the use of Path(input_path) with a DocumentStream instance to maintain consistency across all endpoints. Modify the code so that docling_converter.convert receives a DocumentStream object instead of a Path, ensuring uniform behavior and interface usage throughout the codebase.

coderabbitai · 2025-08-01T20:48:59Z

omniparse/media/__init__.py

-from fastapi import UploadFile
-from fastapi.responses import JSONResponse
-from moviepy.editor import VideoFileClip
+from moviepy import VideoFileClip


⚠️ Potential issue

Importing VideoFileClip from the top-level moviepy module will raise ImportError on most installs
VideoFileClip is re-exported only by moviepy.editor; the root package does not expose it. Importing from the wrong path will break parse_video at runtime.

-from moviepy import VideoFileClip +from moviepy.editor import VideoFileClip

🤖 Prompt for AI Agents

In omniparse/media/__init__.py at line 22, change the import of VideoFileClip from the top-level moviepy module to import it from moviepy.editor instead, because VideoFileClip is only re-exported by moviepy.editor and importing from moviepy directly causes ImportError at runtime.

coderabbitai · 2025-08-01T20:49:00Z

omniparse/utils.py

+def encode_images(document_name, document, inputDocument: responseDocument):
+    file_name = os.path.splitext(document_name)[0]
+    for idx, image in enumerate(document.pictures):
+        with tempfile.NamedTemporaryFile(delete=True, suffix=".png") as tmp_file:
+            filename = tmp_file.name
+            image.get_image(document).save(filename, "PNG")
+            with open(filename, "rb") as f:
+                image_bytes = f.read()

-        inputDocument.add_image(image_name=filename, image_data=image_base64)
-
-        # Remove the temporary image file
-        os.remove(filename)
+            # Convert image to base64
+            image_base64 = base64.b64encode(image_bytes).decode("utf-8")
+            inputDocument.add_image(
+                image_name=f"{file_name}_image_{idx}", image_data=image_base64
+            )


🛠️ Refactor suggestion

Add error handling and optimize file operations

The function should handle potential errors during image extraction and can be optimized to avoid redundant file operations.

Apply this diff to improve error handling and efficiency:

def encode_images(document_name, document, inputDocument: responseDocument): file_name = os.path.splitext(document_name)[0] + if not hasattr(document, 'pictures'): + return + for idx, image in enumerate(document.pictures): - with tempfile.NamedTemporaryFile(delete=True, suffix=".png") as tmp_file: - filename = tmp_file.name - image.get_image(document).save(filename, "PNG") - with open(filename, "rb") as f: - image_bytes = f.read() - - # Convert image to base64 - image_base64 = base64.b64encode(image_bytes).decode("utf-8") - inputDocument.add_image( - image_name=f"{file_name}_image_{idx}", image_data=image_base64 - ) + try: + # Get PIL image and convert directly to bytes + pil_image = image.get_image(document) + buffer = io.BytesIO() + pil_image.save(buffer, format="PNG") + image_bytes = buffer.getvalue() + + # Convert image to base64 + image_base64 = base64.b64encode(image_bytes).decode("utf-8") + inputDocument.add_image( + image_name=f"{file_name}_image_{idx}", image_data=image_base64 + ) + except Exception as e: + print(f"Warning: Failed to extract image {idx} from {document_name}: {e}") + continue

Don't forget to add the import:

import io

🤖 Prompt for AI Agents

In omniparse/utils.py around lines 8 to 21, the encode_images function lacks error handling and inefficiently writes images to temporary files before reading them back. To fix this, wrap the image processing code in a try-except block to catch and handle exceptions during image extraction. Replace the temporary file usage by saving the image directly into an in-memory bytes buffer using io.BytesIO, then encode the image bytes from this buffer to base64. Also, add the import statement "import io" at the top of the file.

coderabbitai · 2025-08-01T20:49:00Z

pyproject.toml

+[project]
 name = "omniparse"
-version = "0.0.1"
+version = "0.0.2"
 description = "API to convert Documents | Media | Webpage into Structured Markdown for LLM usecases"
-authors = ["Adithya S K <[email protected]>"]
-license = "Apache"
+authors = [
+    {name = "Adithya S K", email = "[email protected]"}
+]
+license = {file = "LICENSE"}
 readme = "README.md"


⚠️ Potential issue

[build-system] table was removed – wheels can no longer be built

PEP 517 builds require a [build-system] section. Without it, cloning users (and CI/CD) will be unable to pip install . or publish to PyPI.

+[build-system] +requires = ["setuptools>=69", "wheel"] +build-backend = "setuptools.build_meta"

Add this (or an equivalent backend such as uv) to restore buildability.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

[project]

name = "omniparse"

version = "0.0.1"

version = "0.0.2"

description = "API to convert Documents | Media | Webpage into Structured Markdown for LLM usecases"

authors = ["Adithya S K <[email protected]>"]

license = "Apache"

authors = [

{name = "Adithya S K", email = "[email protected]"}

]

license = {file = "LICENSE"}

readme = "README.md"

[project]

name = "omniparse"

version = "0.0.2"

description = "API to convert Documents | Media | Webpage into Structured Markdown for LLM usecases"

authors = [

{name = "Adithya S K", email = "[email protected]"}

]

license = {file = "LICENSE"}

readme = "README.md"

[build-system]

requires = ["setuptools>=69", "wheel"]

build-backend = "setuptools.build_meta"

🤖 Prompt for AI Agents

In pyproject.toml lines 1 to 9, the [build-system] table is missing, which is required by PEP 517 for building wheels and installing the package. Add a [build-system] section specifying the build backend (e.g., "setuptools.build_meta") and the required build dependencies (e.g., "setuptools" and "wheel") to enable pip install and PyPI publishing.

coderabbitai · 2025-08-01T20:49:00Z

README.md

@@ -50,13 +50,15 @@ conda activate omniparse-venv
 Install Dependencies:

 ```bash
-poetry install
+uv sync
 # or
 pip install -e .
 # or
 pip install -r pyproject.toml


⚠️ Potential issue

Fix incorrect pip install command

The command pip install -r pyproject.toml is incorrect. The -r flag expects a requirements.txt file, not a pyproject.toml file.

Apply this diff to fix the installation command:

-pip install -r pyproject.toml +pip install .

Alternatively, if you want to list all installation methods:

-uv sync -# or -pip install -e . -# or -pip install -r pyproject.toml +uv sync # Using uv package manager +# or +pip install -e . # Editable install with pip +# or +pip install . # Regular install with pip

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

pip install -r pyproject.toml

pip install .

🤖 Prompt for AI Agents

In README.md at line 57, the pip install command incorrectly uses the -r flag with pyproject.toml, which is invalid because -r expects a requirements.txt file. Replace the command with the correct installation method for pyproject.toml, such as using a tool like poetry or pip install with the appropriate options, or remove the -r flag if installing from a requirements file. Update the command to reflect the proper way to install dependencies from pyproject.toml.

VyoJ and others added 4 commits May 23, 2025 17:16

Integrated Docling

75e24cd

Fixed PyTorch issues, ported from poetry to uv

142ae47

Updated pyproject.toml and Colab

cbcc575

Merge pull request #110 from VyoJ/main

87f1da0

Migrate PDF Processing Core from Marker to Docling

coderabbitai bot reviewed Aug 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Omniparse v2 #112

Omniparse v2 #112

Uh oh!

adithya-s-k commented Aug 1, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Aug 1, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Aug 1, 2025

Uh oh!

coderabbitai bot Aug 1, 2025

Uh oh!

coderabbitai bot Aug 1, 2025

Uh oh!

coderabbitai bot Aug 1, 2025

Uh oh!

coderabbitai bot Aug 1, 2025

Uh oh!

Uh oh!

Omniparse v2 #112

Are you sure you want to change the base?

Omniparse v2 #112

Uh oh!

Conversation

adithya-s-k commented Aug 1, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adithya-s-k commented Aug 1, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 1, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)