Skip to content

Omniparse v2 #112

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Omniparse v2 #112

wants to merge 4 commits into from

Conversation

adithya-s-k
Copy link
Owner

@adithya-s-k adithya-s-k commented Aug 1, 2025

Summary by CodeRabbit

  • New Features

    • Switched document parsing and OCR models from Surya OCR/Marker to Docling, improving PDF, Word, and PPT file support.
    • Updated Docker and dependency management to use newer versions of PyTorch (CUDA 12.4), Transformers, and added Docling support.
    • Added .python-version file for consistent Python 3.10 environments.
  • Bug Fixes

    • Improved image extraction and encoding from documents for more reliable processing.
  • Documentation

    • Updated all references and instructions to reflect the use of Docling models.
    • Removed Skypilot deployment instructions.
    • Clarified dependency installation and model usage in README and docs.
  • Refactor

    • Replaced internal document and image parsing logic to use Docling’s APIs and data structures.
    • Updated project metadata and licensing to MIT and Docling attribution.
    • Streamlined dependency and package management in pyproject.toml.
  • Chores

    • Updated .gitignore to exclude .vscode/ directory.

Copy link

coderabbitai bot commented Aug 1, 2025

Walkthrough

This update transitions the codebase from using Marker and Surya OCR models to Docling models for document parsing, updates licensing and attribution to MIT and Docling, and migrates PDF, PPT, and DOC parsing logic to Docling-based workflows. The project configuration is modernized, dependencies updated, and documentation revised to reflect these changes.

Changes

Cohort / File(s) Change Summary
Project Configuration & Dependency Management
.python-version, pyproject.toml, Dockerfile, .gitignore
Added Python version file. Migrated project config from Poetry to PEP 621 [project] format, updated dependencies, added CUDA 12.4 PyTorch support, and updated ignore rules for VSCode.
Documentation & Usage Instructions
README.md, docs/README.md, docs/installation.md, docs/deployment.md, examples/OmniParse_GoogleColab.ipynb
Replaced references to Surya OCR/Marker with Docling, updated instructions and acknowledgements, removed Skypilot deployment docs, and adjusted example notebook for new dependency management and model names.
Model Loading & Shared State
omniparse/__init__.py
Replaced Marker-based model loading with Docling equivalents, updated licensing, attribution, and added PDF pipeline configuration.
Document Parsing Logic
omniparse/documents/__init__.py (deleted), omniparse/documents/router.py
Deleted legacy Marker-based document parsing code. Refactored router to use Docling for PDF, PPT, DOC, and generic file parsing, updating metadata and image encoding logic.
Image Parsing
omniparse/image/__init__.py, omniparse/image/router.py
Updated image parsing to use Docling, changed function signatures to include image name, and updated endpoint to pass filename.
Image Encoding Utility
omniparse/utils.py
Refactored image encoding to operate on Docling document objects and use safe temporary file handling.
Media Handling
omniparse/media/__init__.py
Cleaned up unused imports and adjusted import style for video processing.
Demo & Usage Comments
omniparse/demo.py
Updated model description in CLI usage comments to reference Docling models.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant API (FastAPI)
    participant DoclingConverter
    participant Florence2Model
    participant Utils

    User->>API: Uploads PDF/PPT/DOC file
    API->>DoclingConverter: convert(file bytes)
    DoclingConverter-->>API: Parsed document object
    API->>DoclingConverter: Export to markdown
    DoclingConverter-->>API: Markdown text
    API->>Utils: encode_images(document_name, document, responseDocument)
    Utils-->>API: Images encoded in response
    API-->>User: Returns responseDocument with text, metadata, images
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~35 minutes

Poem

🐇
From Marker to Docling, we leap and bound,
Parsing docs with models newly found.
Licenses lighter, dependencies bright,
Images encoded, metadata just right.
With every hop, our project grows strong—
Docling and Florence now lead us along!
— Your code rabbit, singing this song.

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch omniparse-v2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🔭 Outside diff range comments (9)
omniparse/media/__init__.py (2)

52-55: temp_audio_path may be undefined → UnboundLocalError during cleanup
If the function raises ValueError (invalid input), the finally block still references temp_audio_path, which was never set.

-    finally:
-        # Clean up the temporary file
-        if os.path.exists(temp_audio_path):
-            os.remove(temp_audio_path)
+    finally:
+        # Clean up the temporary file if it was created
+        if 'temp_audio_path' in locals() and os.path.exists(temp_audio_path):
+            os.remove(temp_audio_path)

90-95: Same undefined-variable risk for video_path / audio_path in parse_video
A validation error before these variables are assigned will crash the cleanup block.

-    finally:
-        # Clean up the temporary files
-        if os.path.exists(video_path):
-            os.remove(video_path)
-        if os.path.exists(audio_path):
-            os.remove(audio_path)
+    finally:
+        # Clean up the temporary files if they were created
+        if 'video_path' in locals() and os.path.exists(video_path):
+            os.remove(video_path)
+        if 'audio_path' in locals() and os.path.exists(audio_path):
+            os.remove(audio_path)
pyproject.toml (1)

35-47: tool.uv indexes: platform marker syntax might be ignored

sys_platform == 'linux' or sys_platform == 'win32' is not valid PEP 508 syntax inside TOML arrays.
Use a separate table per marker or wrap with extras according to uv docs, e.g.

torch = [
  { index = "pytorch-cu124", markers = "sys_platform == 'linux' or sys_platform == 'win32'" }
]

Validate with uv pip install -r pyproject.toml before merging.

Dockerfile (1)

1-1: Fix CUDA version inconsistency.

The CUDA_VERSION argument is set to "11.8.0" but the PyTorch installation uses the cu124 index, creating an inconsistency that could cause confusion or deployment issues.

Apply this diff to align the CUDA version:

-ARG CUDA_VERSION="11.8.0"
+ARG CUDA_VERSION="12.4.0"

Also applies to: 49-49

omniparse/image/__init__.py (1)

45-108: Consider adding validation for Docling converter availability

The function should verify that the Docling converter is initialized before use.

Add this validation at the beginning of the function:

 def parse_image(image_name, input_data, model_state) -> dict:
+    if not model_state.docling_converter:
+        raise ValueError("Docling converter not initialized. Please load document models first.")
+        
     temp_files = []
omniparse/documents/router.py (3)

112-112: Fix incorrect file suffix for Word documents

The parse_doc_endpoint uses ".ppt" suffix for Word documents, which is incorrect.

-    with tempfile.NamedTemporaryFile(delete=False, suffix=".ppt") as tmp_ppt:
+    with tempfile.NamedTemporaryFile(delete=False, suffix=".docx") as tmp_doc:
+        tmp_doc.write(await file.read())
+        tmp_doc.flush()
+        input_path = tmp_doc.name
-        tmp_ppt.write(await file.read())
-        tmp_ppt.flush()
-        input_path = tmp_ppt.name

62-152: Refactor duplicate code between PPT and DOC endpoints

The parse_ppt_endpoint and parse_doc_endpoint have nearly identical implementations. This violates the DRY principle.

Create a shared helper function:

async def _convert_office_to_pdf_and_parse(file: UploadFile, suffix: str):
    """Convert Office documents to PDF and parse with Docling."""
    with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp_file:
        tmp_file.write(await file.read())
        tmp_file.flush()
        input_path = tmp_file.name

    output_dir = tempfile.mkdtemp()
    try:
        command = [
            "libreoffice",
            "--headless",
            "--convert-to",
            "pdf",
            "--outdir",
            output_dir,
            input_path,
        ]
        result = subprocess.run(command, check=True, capture_output=True, text=True)
        
        output_pdf_path = os.path.join(
            output_dir, os.path.splitext(os.path.basename(input_path))[0] + ".pdf"
        )

        with open(output_pdf_path, "rb") as pdf_file:
            pdf_bytes = pdf_file.read()

        source = DocumentStream(name=file.filename, stream=BytesIO(pdf_bytes))
        filetype = os.path.splitext(file.filename)[1].lstrip('.').upper()
        out_meta = {"filename": file.filename, "filetype": filetype}

        docling_result = model_state.docling_converter.convert(source)
        full_text = docling_result.document.export_to_markdown()

        out_meta["block_stats"] = {
            "images": len(docling_result.document.pictures),
            "tables": len(docling_result.document.tables),
        }

        result = responseDocument(text=full_text, metadata=out_meta)
        encode_images(file.filename, docling_result.document, result)
        
        return result
    except subprocess.CalledProcessError as e:
        raise HTTPException(
            status_code=500, 
            detail=f"LibreOffice conversion failed: {e.stderr}"
        )
    finally:
        os.remove(input_path)
        if os.path.exists(output_pdf_path):
            os.remove(output_pdf_path)
        os.rmdir(output_dir)

Then simplify the endpoints:

@document_router.post("/ppt")
async def parse_ppt_endpoint(file: UploadFile = File(...)):
    try:
        result = await _convert_office_to_pdf_and_parse(file, ".pptx")
        return JSONResponse(content=result.model_dump())
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@document_router.post("/docs")
async def parse_doc_endpoint(file: UploadFile = File(...)):
    try:
        result = await _convert_office_to_pdf_and_parse(file, ".docx")
        return JSONResponse(content=result.model_dump())
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

167-208: Add proper cleanup with try/finally block

File cleanup should be in a finally block to ensure it happens even if an exception occurs.

Wrap the logic in try/finally:

 with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
     tmp_file.write(await file.read())
     tmp_file.flush()
     input_path = tmp_file.name

+output_dir = None
+try:
     if file_ext.lower() in {".ppt", ".pptx", ".doc", ".docx"}:
         output_dir = tempfile.mkdtemp()
         # ... conversion logic ...

     # ... parsing logic ...
     
     result = responseDocument(text=full_text, metadata=out_meta)
     encode_images(file.filename, docling_result.document, result)
     
     return JSONResponse(content=result.model_dump())
+finally:
+    if os.path.exists(input_path):
+        os.remove(input_path)
+    if output_dir and os.path.exists(output_dir):
+        for f in os.listdir(output_dir):
+            os.remove(os.path.join(output_dir, f))
+        os.rmdir(output_dir)
-
-    os.remove(input_path)
-
-    result = responseDocument(text=full_text, metadata=out_meta)
-    encode_images(file.filename, docling_result.document, result)
-
-    return JSONResponse(content=result.model_dump())
omniparse/__init__.py (1)

50-78: Add error handling for model loading

Model loading can fail for various reasons (network issues, insufficient memory, etc.). Add proper error handling.

Wrap each model loading section in try-except blocks:

 def load_omnimodel(load_documents: bool, load_media: bool, load_web: bool):
     global shared_state
     print_omniparse_text_art()
     device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    
     if load_documents:
-        print("[LOG] ✅ Loading OCR Model")
-        download_models()
-        shared_state.docling_converter = DocumentConverter(
-            format_options={
-                InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
-            }
-        )
+        try:
+            print("[LOG] ✅ Loading OCR Model")
+            download_models()
+            shared_state.docling_converter = DocumentConverter(
+                format_options={
+                    InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
+                }
+            )
+        except Exception as e:
+            print(f"[ERROR] ❌ Failed to load OCR Model: {e}")
+            raise
+            
-        print("[LOG] ✅ Loading Vision Model")
-        shared_state.vision_model = AutoModelForCausalLM.from_pretrained(
-            "microsoft/Florence-2-base", torch_dtype=torch.float32, trust_remote_code=True
-        ).to(device)
-        shared_state.vision_processor = AutoProcessor.from_pretrained(
-            "microsoft/Florence-2-base", trust_remote_code=True
-        )
+        try:
+            print("[LOG] ✅ Loading Vision Model")
+            shared_state.vision_model = AutoModelForCausalLM.from_pretrained(
+                "microsoft/Florence-2-base", torch_dtype=torch.float32, trust_remote_code=True
+            ).to(device)
+            shared_state.vision_processor = AutoProcessor.from_pretrained(
+                "microsoft/Florence-2-base", trust_remote_code=True
+            )
+        except Exception as e:
+            print(f"[ERROR] ❌ Failed to load Vision Model: {e}")
+            raise

     if load_media:
-        print("[LOG] ✅ Loading Audio Model")
-        shared_state.whisper_model = whisper.load_model("small")
+        try:
+            print("[LOG] ✅ Loading Audio Model")
+            shared_state.whisper_model = whisper.load_model("small")
+        except Exception as e:
+            print(f"[ERROR] ❌ Failed to load Audio Model: {e}")
+            raise

     if load_web:
-        print("[LOG] ✅ Loading Web Crawler")
-        shared_state.crawler = WebCrawler(verbose=True)
+        try:
+            print("[LOG] ✅ Loading Web Crawler")
+            shared_state.crawler = WebCrawler(verbose=True)
+        except Exception as e:
+            print(f"[ERROR] ❌ Failed to load Web Crawler: {e}")
+            raise
🧹 Nitpick comments (4)
omniparse/media/__init__.py (1)

74-79: Use context managers to guarantee clip closure and reduce boilerplate
VideoFileClip and its audio clip both support context-manager semantics, eliminating manual close() calls.

-        video_clip = VideoFileClip(video_path)
-        audio_clip = video_clip.audio
-        audio_clip.write_audiofile(audio_path)
-        audio_clip.close()
-        video_clip.close()
+        with VideoFileClip(video_path) as video_clip:
+            with video_clip.audio as audio_clip:
+                audio_clip.write_audiofile(audio_path)
docs/README.md (1)

303-305: Fix markdown list formatting for consistency.

The static analysis tool flagged inconsistent list formatting. The project appears to use asterisks for unordered lists elsewhere.

Apply this diff to fix the formatting:

-- Docling IBM models
-- Florence-2 base
-- Whisper Small
+* Docling IBM models
+* Florence-2 base
+* Whisper Small
omniparse/image/__init__.py (1)

13-14: Fix module description to reflect image parsing functionality

The description mentions "pdf/word/ppt parsing" but this module is specifically for image parsing.

 Description:
-This section of the code was adapted from the Docling repository to enhance text pdf/word/ppt parsing.
+This section of the code was adapted from the Docling repository to enhance image parsing and text extraction.
 All credits for the original implementation go to Docling.
omniparse/__init__.py (1)

33-39: Improve type annotations for SharedState

Using Any type reduces type safety and IDE support. Consider using proper type annotations.

+from typing import Optional
+from docling.document_converter import DocumentConverter
+from transformers import PreTrainedModel, PreTrainedTokenizer

 class SharedState(BaseModel):
-    docling_converter: Any = None
-    vision_model: Any = None
-    vision_processor: Any = None
-    whisper_model: Any = None
-    crawler: Any = None
+    docling_converter: Optional[DocumentConverter] = None
+    vision_model: Optional[PreTrainedModel] = None
+    vision_processor: Optional[PreTrainedTokenizer] = None
+    whisper_model: Optional[Any] = None  # whisper doesn't provide type hints
+    crawler: Optional[WebCrawler] = None
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d1ae83 and 87f1da0.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (17)
  • .gitignore (1 hunks)
  • .python-version (1 hunks)
  • Dockerfile (2 hunks)
  • README.md (6 hunks)
  • docs/README.md (2 hunks)
  • docs/deployment.md (1 hunks)
  • docs/installation.md (1 hunks)
  • examples/OmniParse_GoogleColab.ipynb (5 hunks)
  • omniparse/__init__.py (3 hunks)
  • omniparse/demo.py (1 hunks)
  • omniparse/documents/__init__.py (0 hunks)
  • omniparse/documents/router.py (5 hunks)
  • omniparse/image/__init__.py (3 hunks)
  • omniparse/image/router.py (1 hunks)
  • omniparse/media/__init__.py (1 hunks)
  • omniparse/utils.py (1 hunks)
  • pyproject.toml (1 hunks)
💤 Files with no reviewable changes (1)
  • omniparse/documents/init.py
🧰 Additional context used
🧬 Code Graph Analysis (3)
omniparse/utils.py (1)
omniparse/models/__init__.py (2)
  • responseDocument (15-64)
  • add_image (21-49)
omniparse/documents/router.py (3)
omniparse/__init__.py (1)
  • get_shared_state (80-81)
omniparse/utils.py (1)
  • encode_images (8-21)
omniparse/models/__init__.py (1)
  • responseDocument (15-64)
omniparse/__init__.py (1)
omniparse/utils.py (1)
  • print_omniparse_text_art (24-34)
🪛 markdownlint-cli2 (0.17.2)
docs/README.md

303-303: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


304-304: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


305-305: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build-and-push
🔇 Additional comments (17)
docs/installation.md (1)

35-35: Bullet wording update looks correct
The switch to “Docling models and Florence-2” is accurate and consistent with the rest of the PR. No further action needed.

pyproject.toml (1)

12-33: Unreleased / future package versions will break installation

torch>=2.6.0, html2text>=2025.4.15, openai-whisper>=20240930, etc. do not exist on PyPI today. pip will fail immediately.

Recommend pinning to the latest released versions or using “~=latest_minor” constraints.

.gitignore (1)

165-166: ✅ Good addition
Ignoring .vscode/ keeps editor artefacts out of the repo.

.python-version (1)

1-1: Python version pin acknowledged

Including .python-version aligns local tooling with the requires-python field.

omniparse/demo.py (1)

136-136: Description updated successfully

The flag description now reflects the Docling migration. No further changes required.

docs/deployment.md (1)

31-31: LGTM! Documentation cleanup aligns with migration.

The removal of Skypilot deployment instructions and retention of Docker-based deployment documentation is consistent with the project's modernization efforts moving to Docling models.

Dockerfile (1)

58-58: LGTM! Transformers version updated appropriately.

The upgrade from transformers 4.41.2 to 4.50.3 aligns with the migration to Docling models and ensures compatibility with newer features.

omniparse/image/router.py (1)

15-15: Verified parse_image signature accepts filename parameter.

The function in omniparse/image/__init__.py is declared as:

def parse_image(image_name, input_data, model_state) -> dict:

which directly matches the call in omniparse/image/router.py:

result: responseDocument = parse_image(file.filename, file_bytes, model_state)

No further changes needed.

docs/README.md (2)

108-108: LGTM! Documentation updated for Docling migration.

The update correctly reflects the change from "Surya OCR series of models" to "Docling models and Florence-2" in the server documentation.


299-299: LGTM! Acknowledgements updated appropriately.

The acknowledgements section correctly credits the Docling project instead of the previous Marker project, reflecting the migration.

examples/OmniParse_GoogleColab.ipynb (4)

46-54: LGTM! Added uv installation for modern dependency management.

The addition of uv installation aligns with the project's migration to modern Python tooling and dependency management.


84-84: LGTM! Dependency installation updated to use uv.

The change from pip install -e . to uv -q sync is consistent with the migration to uv for dependency management.


93-96: LGTM! Proactive fix for matplotlib backend issues.

The addition of matplotlib-inline installation with a helpful comment addresses a common issue in Colab environments.


202-202: LGTM! Server launch commands updated for uv.

The server launch commands are correctly updated to use uv run server.py instead of direct Python execution, maintaining consistency with the uv migration.

Also applies to: 250-250

README.md (2)

96-109: Documentation updates look good!

The model references have been consistently updated to reflect the migration from Surya OCR to Docling models.


296-319: Limitations and acknowledgements properly updated

The documentation correctly reflects the new Docling-based implementation and gives appropriate credit.

omniparse/__init__.py (1)

65-66: Verify torch.float32 usage for vision model

The vision model is explicitly set to use float32, which uses more memory than float16. Was this intentional?

If memory usage is a concern and the model supports it, consider using float16:

 shared_state.vision_model = AutoModelForCausalLM.from_pretrained(
-    "microsoft/Florence-2-base", torch_dtype=torch.float32, trust_remote_code=True
+    "microsoft/Florence-2-base", torch_dtype=torch.float16 if device.type == "cuda" else torch.float32, trust_remote_code=True
 ).to(device)

This would use float16 on GPU (where it's typically supported) and float32 on CPU.

@@ -163,36 +186,22 @@ async def parse_any_endpoint(file: UploadFile = File(...)):
)
input_path = output_pdf_path

# Common parsing logic
full_text, images, out_meta = convert_single_pdf(input_path, model_state.model_list)
docling_result = model_state.docling_converter.convert(Path(input_path))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use DocumentStream consistently across all endpoints

This endpoint uses Path while other endpoints use DocumentStream. This inconsistency could lead to different behavior.

-    docling_result = model_state.docling_converter.convert(Path(input_path))
+    with open(input_path, "rb") as f:
+        file_bytes = f.read()
+    source = DocumentStream(name=file.filename, stream=BytesIO(file_bytes))
+    docling_result = model_state.docling_converter.convert(source)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In omniparse/documents/router.py at line 189, replace the use of
Path(input_path) with a DocumentStream instance to maintain consistency across
all endpoints. Modify the code so that docling_converter.convert receives a
DocumentStream object instead of a Path, ensuring uniform behavior and interface
usage throughout the codebase.

from fastapi import UploadFile
from fastapi.responses import JSONResponse
from moviepy.editor import VideoFileClip
from moviepy import VideoFileClip
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Importing VideoFileClip from the top-level moviepy module will raise ImportError on most installs
VideoFileClip is re-exported only by moviepy.editor; the root package does not expose it. Importing from the wrong path will break parse_video at runtime.

-from moviepy import VideoFileClip
+from moviepy.editor import VideoFileClip
🤖 Prompt for AI Agents
In omniparse/media/__init__.py at line 22, change the import of VideoFileClip
from the top-level moviepy module to import it from moviepy.editor instead,
because VideoFileClip is only re-exported by moviepy.editor and importing from
moviepy directly causes ImportError at runtime.

Comment on lines +8 to +21
def encode_images(document_name, document, inputDocument: responseDocument):
file_name = os.path.splitext(document_name)[0]
for idx, image in enumerate(document.pictures):
with tempfile.NamedTemporaryFile(delete=True, suffix=".png") as tmp_file:
filename = tmp_file.name
image.get_image(document).save(filename, "PNG")
with open(filename, "rb") as f:
image_bytes = f.read()

inputDocument.add_image(image_name=filename, image_data=image_base64)

# Remove the temporary image file
os.remove(filename)
# Convert image to base64
image_base64 = base64.b64encode(image_bytes).decode("utf-8")
inputDocument.add_image(
image_name=f"{file_name}_image_{idx}", image_data=image_base64
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling and optimize file operations

The function should handle potential errors during image extraction and can be optimized to avoid redundant file operations.

Apply this diff to improve error handling and efficiency:

 def encode_images(document_name, document, inputDocument: responseDocument):
     file_name = os.path.splitext(document_name)[0]
+    if not hasattr(document, 'pictures'):
+        return
+        
     for idx, image in enumerate(document.pictures):
-        with tempfile.NamedTemporaryFile(delete=True, suffix=".png") as tmp_file:
-            filename = tmp_file.name
-            image.get_image(document).save(filename, "PNG")
-            with open(filename, "rb") as f:
-                image_bytes = f.read()
-
-            # Convert image to base64
-            image_base64 = base64.b64encode(image_bytes).decode("utf-8")
-            inputDocument.add_image(
-                image_name=f"{file_name}_image_{idx}", image_data=image_base64
-            )
+        try:
+            # Get PIL image and convert directly to bytes
+            pil_image = image.get_image(document)
+            buffer = io.BytesIO()
+            pil_image.save(buffer, format="PNG")
+            image_bytes = buffer.getvalue()
+            
+            # Convert image to base64
+            image_base64 = base64.b64encode(image_bytes).decode("utf-8")
+            inputDocument.add_image(
+                image_name=f"{file_name}_image_{idx}", image_data=image_base64
+            )
+        except Exception as e:
+            print(f"Warning: Failed to extract image {idx} from {document_name}: {e}")
+            continue

Don't forget to add the import:

import io
🤖 Prompt for AI Agents
In omniparse/utils.py around lines 8 to 21, the encode_images function lacks
error handling and inefficiently writes images to temporary files before reading
them back. To fix this, wrap the image processing code in a try-except block to
catch and handle exceptions during image extraction. Replace the temporary file
usage by saving the image directly into an in-memory bytes buffer using
io.BytesIO, then encode the image bytes from this buffer to base64. Also, add
the import statement "import io" at the top of the file.

Comment on lines +1 to 9
[project]
name = "omniparse"
version = "0.0.1"
version = "0.0.2"
description = "API to convert Documents | Media | Webpage into Structured Markdown for LLM usecases"
authors = ["Adithya S K <[email protected]>"]
license = "Apache"
authors = [
{name = "Adithya S K", email = "[email protected]"}
]
license = {file = "LICENSE"}
readme = "README.md"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

[build-system] table was removed – wheels can no longer be built

PEP 517 builds require a [build-system] section. Without it, cloning users (and CI/CD) will be unable to pip install . or publish to PyPI.

+[build-system]
+requires = ["setuptools>=69", "wheel"]
+build-backend = "setuptools.build_meta"

Add this (or an equivalent backend such as uv) to restore buildability.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
[project]
name = "omniparse"
version = "0.0.1"
version = "0.0.2"
description = "API to convert Documents | Media | Webpage into Structured Markdown for LLM usecases"
authors = ["Adithya S K <[email protected]>"]
license = "Apache"
authors = [
{name = "Adithya S K", email = "[email protected]"}
]
license = {file = "LICENSE"}
readme = "README.md"
[project]
name = "omniparse"
version = "0.0.2"
description = "API to convert Documents | Media | Webpage into Structured Markdown for LLM usecases"
authors = [
{name = "Adithya S K", email = "[email protected]"}
]
license = {file = "LICENSE"}
readme = "README.md"
[build-system]
requires = ["setuptools>=69", "wheel"]
build-backend = "setuptools.build_meta"
🤖 Prompt for AI Agents
In pyproject.toml lines 1 to 9, the [build-system] table is missing, which is
required by PEP 517 for building wheels and installing the package. Add a
[build-system] section specifying the build backend (e.g.,
"setuptools.build_meta") and the required build dependencies (e.g., "setuptools"
and "wheel") to enable pip install and PyPI publishing.

@@ -50,13 +50,15 @@ conda activate omniparse-venv
Install Dependencies:

```bash
poetry install
uv sync
# or
pip install -e .
# or
pip install -r pyproject.toml
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix incorrect pip install command

The command pip install -r pyproject.toml is incorrect. The -r flag expects a requirements.txt file, not a pyproject.toml file.

Apply this diff to fix the installation command:

-pip install -r pyproject.toml
+pip install .

Alternatively, if you want to list all installation methods:

-uv sync
-# or
-pip install -e .
-# or
-pip install -r pyproject.toml
+uv sync              # Using uv package manager
+# or
+pip install -e .     # Editable install with pip
+# or  
+pip install .        # Regular install with pip
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pip install -r pyproject.toml
pip install .
🤖 Prompt for AI Agents
In README.md at line 57, the pip install command incorrectly uses the -r flag
with pyproject.toml, which is invalid because -r expects a requirements.txt
file. Replace the command with the correct installation method for
pyproject.toml, such as using a tool like poetry or pip install with the
appropriate options, or remove the -r flag if installing from a requirements
file. Update the command to reflect the proper way to install dependencies from
pyproject.toml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants