Feature/add zerox parser (#1394)

* Add KG tests (#1351) * cli tests * add sdk tests * typo fix * change workflow ordering * add collection integration tests (#1352) * bump pkg * remove workflows * fix sdk test port * fix delete collection return check * Fix document info serialization (#1353) * Update integration-test-workflow-debian.yml * pre-commit * slightly modify * up * up * smaller file * up * typo, change order * up * up * change order --------- Co-authored-by: emrgnt-cmplxty <[email protected]> Co-authored-by: emrgnt-cmplxty <[email protected]> Co-authored-by: Nolan Tremelling <[email protected]> * add graphrag docs (#1362) * add documentation * up * Update js/sdk/src/models.tsx Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com> * pre-commit --------- Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com> * Concurrent index creation, allow -1 for paginated entries (#1363) * update webdev-template for current next.js and r2r-js sdk (#1218) Co-authored-by: Simeon <[email protected]> * Feature/extend integration tests rebased (#1361) * cleanups * add back overzealous edits * extend workflows * fix full setup * simplify cli * add ymls * rename to light * try again * start light * add cli tests * fix * fix * testing.. * trying complete matrix testflow * cleanup matrix logic * cleanup matrix logic * cleanup matrix logic * cleanup matrix logic * cleanup matrix logic * cleanup matrix logic * cleanup matrix logic * up * up * up * All actions * rename to runner * rename to runner * rename to runner * rename to runner * rename to runner * rename to runner * rename to runner * rename to runner * rename to runner * try offic pgvec formula * sudo make * sudo make * push and pray * push and pray * add new actions * add new actions * docker push & pray * inspect manifests during launch * inspect manifests during launch * inspect manifests during launch * inspect manifests during launch * setup docker * setup docker * fix default * fix default * Feature/rebase to r2r vars (#1364) * cleanups * add back overzealous edits * extend workflows * fix full setup * simplify cli * add ymls * rename to light * try again * start light * add cli tests * fix * fix * testing.. * trying complete matrix testflow * cleanup matrix logic * cleanup matrix logic * cleanup matrix logic * cleanup matrix logic * cleanup matrix logic * cleanup matrix logic * cleanup matrix logic * up * up * up * All actions * rename to runner * rename to runner * rename to runner * rename to runner * rename to runner * rename to runner * rename to runner * rename to runner * rename to runner * try offic pgvec formula * sudo make * sudo make * push and pray * push and pray * add new actions * add new actions * docker push & pray * inspect manifests during launch * inspect manifests during launch * inspect manifests during launch * inspect manifests during launch * setup docker * setup docker * fix default * fix default * make changes * update the windows workflow * update the windows workflow * remove extra workflows for now * bump pkg * push and pray * revive full workflow * revive full workflow * revive full workflow * revive full workflow * revive full workflow * revive full workflow * revive full workflow * revive full workflow * revive tests * revive tests * revive tests * revive tests * update tests * fix typos (#1366) * update tests * up * up * up * bump max connections * bump max connections * bump max connections * bump max connections * bump max connections * bump max connections * bump max connections * bump max connections * bump max connections * bump max connections * bump max connections * bump max connections * bump max connections * Add ingestion concurrency limit (#1367) * up * up * up --------- Co-authored-by: --global=Shreyas Pimpalgaonkar <[email protected]> * tweaks and fixes * Fix Ollama Tool Calling (#1372) * Update graphrag.mdx * Fix Ollama tool calling --------- Co-authored-by: Shreyas Pimpalgaonkar <[email protected]> * Clean up Docker Compose (#1368) * Fix hatchet, dockerfile * Update compose * point to correct docker image * Fix bug in deletion, better validation error handling (#1374) * Update graphrag.mdx * Fix bug in deletion, better validation error handling --------- Co-authored-by: Shreyas Pimpalgaonkar <[email protected]> * vec index creation endpoint (#1373) * Update graphrag.mdx * upload files * create vector index endpoint * add to fastapi background task * pre-commit * move logging * add api spec, support for all vecs * pre-commit * add workflow * Modify KG Endpoints and update API spec (#1369) * Update graphrag.mdx * modify API endpoints and update documentation * Update ingestion_router.py * try different docker setup (#1371) * try different docker setup * action * add login * add full * update action * cleanup upload script * cleanup upload script * tweak action * tweak action * tweak action * tweak action * tweak action * tweak action * Nolan/ingest chunks js (#1375) * Update graphrag.mdx * Clean up ingest chunks, add to JS SDK * Update JS docs --------- Co-authored-by: Shreyas Pimpalgaonkar <[email protected]> * up (#1376) * Bump JS package (#1378) * add conversation * checkin progress * checkin progress * Fix Create Graph (#1379) * up * up * modify assertion * up * up * increase entity limit * changing aristotle back to v2 * pre-commit * typos * add test_ingest_sample_file_2_sdk * Update server.py * checkin progress * up * update * Graphrag docs (#1382) * add docs and refine code * add python SDK documentation * up * update * checkin * up * cleanup * working sync logging * test conversation history * fix runner tests, rename `CHUNKS` to `chunks` * adding zerox parser --------- Co-authored-by: Shreyas Pimpalgaonkar <[email protected]> Co-authored-by: Nolan Tremelling <[email protected]> Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com> Co-authored-by: FutureProofTechOps <[email protected]> Co-authored-by: Simeon <[email protected]> Co-authored-by: --global=Shreyas Pimpalgaonkar <[email protected]>
SciPhi-AI · Oct 14, 2024 · 89089af · 89089af
1 parent 450c993
commit 89089af
Show file tree

Hide file tree

Showing 14 changed files with 900 additions and 674 deletions.
diff --git a/py/Dockerfile b/py/Dockerfile
@@ -3,11 +3,13 @@ FROM python:3.10-slim AS builder
 # Install system dependencies
 RUN apt-get update && apt-get install -y --no-install-recommends \
     gcc g++ musl-dev curl libffi-dev gfortran libopenblas-dev \
+    poppler-utils \
     && apt-get clean && rm -rf /var/lib/apt/lists/* \
     && curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
 
     RUN pip install --no-cache-dir poetry
 
+
 # Add Rust to PATH
 ENV PATH="/root/.cargo/bin:${PATH}"
 

diff --git a/py/core/base/parsers/base_parser.py b/py/core/base/parsers/base_parser.py
@@ -10,5 +10,7 @@
 
 class AsyncParser(ABC, Generic[T]):
     @abstractmethod
-    async def ingest(self, data: T) -> AsyncGenerator[DataType, None]:
+    async def ingest(
+        self, data: T, **kwargs
+    ) -> AsyncGenerator[DataType, None]:
         pass
diff --git a/py/core/base/providers/ingestion.py b/py/core/base/providers/ingestion.py
@@ -10,6 +10,7 @@
 class IngestionConfig(ProviderConfig):
     provider: str = "r2r"
     excluded_parsers: list[str] = ["mp4"]
+    extra_parsers: dict[str, str] = {}
 
     @property
     def supported_providers(self) -> list[str]:

diff --git a/py/core/parsers/media/__init__.py b/py/core/parsers/media/__init__.py
@@ -5,6 +5,7 @@
     PDFParser,
     PDFParserMarker,
     PDFParserUnstructured,
+    ZeroxPDFParser,
 )
 from .ppt_parser import PPTParser
 
@@ -14,6 +15,7 @@
     "ImageParser",
     "PDFParser",
     "PDFParserUnstructured",
+    "ZeroxPDFParser",
     "PDFParserMarker",
     "PPTParser",
 ]
diff --git a/py/core/parsers/media/audio_parser.py b/py/core/parsers/media/audio_parser.py
@@ -15,7 +15,7 @@ def __init__(
         self.openai_api_key = os.environ.get("OPENAI_API_KEY")
 
     async def ingest(  # type: ignore
-        self, data: bytes, chunk_size: int = 1024
+        self, data: bytes, chunk_size: int = 1024, **kwargs
     ) -> AsyncGenerator[str, None]:
         """Ingest audio data and yield a transcription."""
         temp_audio_path = "temp_audio.wav"

diff --git a/py/core/parsers/media/docx_parser.py b/py/core/parsers/media/docx_parser.py
@@ -18,7 +18,7 @@ def __init__(self):
                 "Error, `python-docx` is required to run `DOCXParser`. Please install it using `pip install python-docx`."
             )
 
-    async def ingest(self, data: DataType) -> AsyncGenerator[str, None]:  # type: ignore
+    async def ingest(self, data: DataType, **kwargs) -> AsyncGenerator[str, None]:  # type: ignore
         """Ingest DOCX data and yield text from each paragraph."""
         if isinstance(data, str):
             raise ValueError("DOCX data must be in bytes format.")

diff --git a/py/core/parsers/media/img_parser.py b/py/core/parsers/media/img_parser.py
@@ -27,7 +27,7 @@ def __init__(
         self.max_image_size = max_image_size
 
     async def ingest(  # type: ignore
-        self, data: DataType, chunk_size: int = 1024
+        self, data: DataType, chunk_size: int = 1024, **kwargs
     ) -> AsyncGenerator[str, None]:
         """Ingest image data and yield a description."""
 

diff --git a/py/core/parsers/media/pdf_parser.py b/py/core/parsers/media/pdf_parser.py
@@ -10,6 +10,7 @@
 from core.base.parsers.base_parser import AsyncParser
 
 logger = logging.getLogger(__name__)
+ZEROX_DEFAULT_MODEL = "openai/gpt-4o-mini"
 
 
 class PDFParser(AsyncParser[DataType]):
@@ -25,7 +26,9 @@ def __init__(self):
                 "Error, `pypdf` is required to run `PyPDFParser`. Please install it using `pip install pypdf`."
             )
 
-    async def ingest(self, data: DataType) -> AsyncGenerator[str, None]:
+    async def ingest(
+        self, data: DataType, **kwargs
+    ) -> AsyncGenerator[str, None]:
         """Ingest PDF data and yield text from each page."""
         if isinstance(data, str):
             raise ValueError("PDF data must be in bytes format.")
@@ -76,7 +79,7 @@ def __init__(self):
                 "Error, `pdfminer.six` is required to run `PDFParser`. Please install it using `pip install pdfminer.six`."
             )
 
-    async def ingest(self, data: bytes) -> AsyncGenerator[str, None]:
+    async def ingest(self, data: bytes, **kwargs) -> AsyncGenerator[str, None]:
         """Ingest PDF data and yield text from each page."""
         if not isinstance(data, bytes):
             raise ValueError("PDF data must be in bytes format.")
@@ -156,11 +159,61 @@ def __init__(self):
                 f"Error, marker is not installed {e}, please install using `pip install marker-pdf` "
             )
 
-    async def ingest(self, data: DataType) -> AsyncGenerator[str, None]:
+    async def ingest(
+        self, data: DataType, **kwargs
+    ) -> AsyncGenerator[str, None]:
         if isinstance(data, str):
             raise ValueError("PDF data must be in bytes format.")
 
         text, _, _ = self.convert_single_pdf(
             BytesIO(data), PDFParserMarker.model_refs
         )
         yield text
+
+
+class ZeroxPDFParser(AsyncParser[DataType]):
+    """An advanced PDF parser using zerox."""
+
+    def __init__(self):
+        """
+        Use the zerox library to parse PDF data.
+
+        Args:
+            cleanup (bool, optional): Whether to clean up temporary files after processing. Defaults to True.
+            concurrency (int, optional): The number of concurrent processes to run. Defaults to 10.
+            file_data (Optional[str], optional): The file data to process. Defaults to an empty string.
+            maintain_format (bool, optional): Whether to maintain the format from the previous page. Defaults to False.
+            model (str, optional): The model to use for generating completions. Defaults to "gpt-4o-mini". Refer to LiteLLM Providers for the correct model name, as it may differ depending on the provider.
+            temp_dir (str, optional): The directory to store temporary files, defaults to some named folder in system's temp directory. If already exists, the contents will be deleted before zerox uses it.
+            custom_system_prompt (str, optional): The system prompt to use for the model, this overrides the default system prompt of zerox.Generally it is not required unless you want some specific behaviour. When set, it will raise a friendly warning. Defaults to None.
+            kwargs (dict, optional): Additional keyword arguments to pass to the litellm.completion method. Refer to the LiteLLM Documentation and Completion Input for details.
+
+        """
+        try:
+            # from pyzerox import zerox
+            from .zerox.py_zerox.pyzerox import zerox
+
+            self.zerox = zerox
+
+        except ImportError as e:
+            raise ValueError(
+                f"Error, zerox is not installed {e}, please install using `pip install py-zerox` "
+            )
+
+    async def ingest(
+        self, data: DataType, **kwargs
+    ) -> AsyncGenerator[str, None]:
+        if isinstance(data, str):
+            raise ValueError("PDF data must be in bytes format.")
+
+        model = kwargs.get("zerox_parsing_model", ZEROX_DEFAULT_MODEL)
+        model = model.split("/")[-1]  # remove the provider prefix
+
+        result = await self.zerox(
+            file_data=data,
+            model=model,
+            verbose=True,
+        )
+
+        for page in result.pages:
+            yield page.content
diff --git a/py/core/parsers/media/ppt_parser.py b/py/core/parsers/media/ppt_parser.py
@@ -18,7 +18,7 @@ def __init__(self):
                 "Error, `python-pptx` is required to run `PPTParser`. Please install it using `pip install python-pptx`."
             )
 
-    async def ingest(self, data: DataType) -> AsyncGenerator[str, None]:  # type: ignore
+    async def ingest(self, data: DataType, **kwargs) -> AsyncGenerator[str, None]:  # type: ignore
         """Ingest PPT data and yield text from each slide."""
         if isinstance(data, str):
             raise ValueError("PPT data must be in bytes format.")

diff --git a/py/core/parsers/media/zerox b/py/core/parsers/media/zerox
diff --git a/py/core/providers/ingestion/r2r/base.py b/py/core/providers/ingestion/r2r/base.py
@@ -31,23 +31,33 @@ class R2RIngestionConfig(IngestionConfig):
 
 
 class R2RIngestionProvider(IngestionProvider):
-    AVAILABLE_PARSERS = {
-        DocumentType.CSV: [parsers.CSVParser, parsers.CSVParserAdvanced],
-        DocumentType.DOCX: [parsers.DOCXParser],
-        DocumentType.HTML: [parsers.HTMLParser],
-        DocumentType.HTM: [parsers.HTMLParser],
-        DocumentType.JSON: [parsers.JSONParser],
-        DocumentType.MD: [parsers.MDParser],
-        DocumentType.PDF: [parsers.PDFParser, parsers.PDFParserUnstructured],
-        DocumentType.PPTX: [parsers.PPTParser],
-        DocumentType.TXT: [parsers.TextParser],
-        DocumentType.XLSX: [parsers.XLSXParser, parsers.XLSXParserAdvanced],
-        DocumentType.GIF: [parsers.ImageParser],
-        DocumentType.JPEG: [parsers.ImageParser],
-        DocumentType.JPG: [parsers.ImageParser],
-        DocumentType.PNG: [parsers.ImageParser],
-        DocumentType.SVG: [parsers.ImageParser],
-        DocumentType.MP3: [parsers.AudioParser],
+    DEFAULT_PARSERS = {
+        DocumentType.CSV: parsers.CSVParser,
+        DocumentType.DOCX: parsers.DOCXParser,
+        DocumentType.HTML: parsers.HTMLParser,
+        DocumentType.HTM: parsers.HTMLParser,
+        DocumentType.JSON: parsers.JSONParser,
+        DocumentType.MD: parsers.MDParser,
+        DocumentType.PDF: parsers.PDFParser,
+        DocumentType.PPTX: parsers.PPTParser,
+        DocumentType.TXT: parsers.TextParser,
+        DocumentType.XLSX: parsers.XLSXParser,
+        DocumentType.GIF: parsers.ImageParser,
+        DocumentType.JPEG: parsers.ImageParser,
+        DocumentType.JPG: parsers.ImageParser,
+        DocumentType.PNG: parsers.ImageParser,
+        DocumentType.SVG: parsers.ImageParser,
+        DocumentType.MP3: parsers.AudioParser,
+    }
+
+    EXTRA_PARSERS = {
+        DocumentType.CSV: {"advanced": parsers.CSVParserAdvanced},
+        DocumentType.PDF: {
+            "unstructured": parsers.PDFParserUnstructured,
+            "zerox": parsers.ZeroxPDFParser,
+            "marker": parsers.PDFParserMarker,
+        },
+        DocumentType.XLSX: {"advanced": parsers.XLSXParserAdvanced},
     }
 
     IMAGE_TYPES = {
@@ -70,14 +80,14 @@ def __init__(self, config: R2RIngestionConfig):
         )
 
     def _initialize_parsers(self):
-        for doc_type, parser_infos in self.AVAILABLE_PARSERS.items():
-            for parser_info in parser_infos:
-                if (
-                    doc_type not in self.config.excluded_parsers
-                    and doc_type not in self.parsers
-                ):
-                    # will choose the first parser in the list
-                    self.parsers[doc_type] = parser_info()
+        for doc_type, parser in self.DEFAULT_PARSERS.items():
+            # will choose the first parser in the list
+            if doc_type not in self.config.excluded_parsers:
+                self.parsers[doc_type] = parser()
+        for doc_type, doc_parser_name in self.config.extra_parsers.items():
+            self.parsers[f"{doc_parser_name}_{str(doc_type)}"] = (
+                R2RIngestionProvider.EXTRA_PARSERS[doc_type][doc_parser_name]()
+            )
 
     def _build_text_splitter(
         self, ingestion_config_override: Optional[dict] = None
@@ -178,8 +188,35 @@ async def parse(  # type: ignore
             t0 = time.time()
 
             contents = ""
-            async for text in self.parsers[document.type].ingest(file_content):
-                contents += text + "\n"
+            parser_overrides = ingestion_config_override.get(
+                "parser_overrides", {}
+            )
+            print("parser_overrides = ", parser_overrides)
+            print("document.type.value = ", document.type.value)
+            print(
+                "document.type.value in parser_overrides = ",
+                document.type.value in parser_overrides,
+            )
+            if document.type.value in parser_overrides:
+                print("In zerox parser ...")
+                # TODO - Cleanup this approach to be less hardcoded
+                if (
+                    document.type != DocumentType.PDF
+                    or parser_overrides[DocumentType.PDF.value] != "zerox"
+                ):
+                    raise ValueError(
+                        "Only Zerox PDF parser override is available."
+                    )
+                print("keys = ", self.parsers.keys())
+                async for text in self.parsers[
+                    f"zerox_{DocumentType.PDF.value}"
+                ].ingest(file_content, **ingestion_config_override):
+                    contents += text + "\n"
+            else:
+                async for text in self.parsers[document.type].ingest(
+                    file_content, **ingestion_config_override
+                ):
+                    contents += text + "\n"
 
             iteration = 0
             chunks = self.chunk(contents, ingestion_config_override)