Skip to content

Commit

Permalink
Feature/add zerox parser (#1394)
Browse files Browse the repository at this point in the history
* Add KG tests (#1351)

* cli tests

* add sdk tests

* typo fix

* change workflow ordering

* add collection integration tests (#1352)

* bump pkg

* remove workflows

* fix sdk test port

* fix delete collection return check

* Fix document info serialization (#1353)

* Update integration-test-workflow-debian.yml

* pre-commit

* slightly modify

* up

* up

* smaller file

* up

* typo, change order

* up

* up

* change order

---------

Co-authored-by: emrgnt-cmplxty <[email protected]>
Co-authored-by: emrgnt-cmplxty <[email protected]>
Co-authored-by: Nolan Tremelling <[email protected]>

* add graphrag docs (#1362)

* add documentation

* up

* Update js/sdk/src/models.tsx

Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>

* pre-commit

---------

Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>

* Concurrent index creation, allow -1 for paginated entries (#1363)

* update webdev-template for current next.js and r2r-js sdk (#1218)

Co-authored-by: Simeon <[email protected]>

* Feature/extend integration tests rebased (#1361)

* cleanups

* add back overzealous edits

* extend workflows

* fix full setup

* simplify cli

* add ymls

* rename to light

* try again

* start light

* add cli tests

* fix

* fix

* testing..

* trying complete matrix testflow

* cleanup matrix logic

* cleanup matrix logic

* cleanup matrix logic

* cleanup matrix logic

* cleanup matrix logic

* cleanup matrix logic

* cleanup matrix logic

* up

* up

* up

* All actions

* rename to runner

* rename to runner

* rename to runner

* rename to runner

* rename to runner

* rename to runner

* rename to runner

* rename to runner

* rename to runner

* try offic pgvec formula

* sudo make

* sudo make

* push and pray

* push and pray

* add new actions

* add new actions

* docker push & pray

* inspect manifests during launch

* inspect manifests during launch

* inspect manifests during launch

* inspect manifests during launch

* setup docker

* setup docker

* fix default

* fix default

* Feature/rebase to r2r vars (#1364)

* cleanups

* add back overzealous edits

* extend workflows

* fix full setup

* simplify cli

* add ymls

* rename to light

* try again

* start light

* add cli tests

* fix

* fix

* testing..

* trying complete matrix testflow

* cleanup matrix logic

* cleanup matrix logic

* cleanup matrix logic

* cleanup matrix logic

* cleanup matrix logic

* cleanup matrix logic

* cleanup matrix logic

* up

* up

* up

* All actions

* rename to runner

* rename to runner

* rename to runner

* rename to runner

* rename to runner

* rename to runner

* rename to runner

* rename to runner

* rename to runner

* try offic pgvec formula

* sudo make

* sudo make

* push and pray

* push and pray

* add new actions

* add new actions

* docker push & pray

* inspect manifests during launch

* inspect manifests during launch

* inspect manifests during launch

* inspect manifests during launch

* setup docker

* setup docker

* fix default

* fix default

* make changes

* update the windows workflow

* update the windows workflow

* remove extra workflows for now

* bump pkg

* push and pray

* revive full workflow

* revive full workflow

* revive full workflow

* revive full workflow

* revive full workflow

* revive full workflow

* revive full workflow

* revive full workflow

* revive tests

* revive tests

* revive tests

* revive tests

* update tests

* fix typos (#1366)

* update tests

* up

* up

* up

* bump max connections

* bump max connections

* bump max connections

* bump max connections

* bump max connections

* bump max connections

* bump max connections

* bump max connections

* bump max connections

* bump max connections

* bump max connections

* bump max connections

* bump max connections

* Add ingestion concurrency limit (#1367)

* up

* up

* up

---------

Co-authored-by: --global=Shreyas Pimpalgaonkar <[email protected]>

* tweaks and fixes

* Fix Ollama Tool Calling (#1372)

* Update graphrag.mdx

* Fix Ollama tool calling

---------

Co-authored-by: Shreyas Pimpalgaonkar <[email protected]>

* Clean up Docker Compose (#1368)

* Fix hatchet, dockerfile

* Update compose

* point to correct docker image

* Fix bug in deletion, better validation error handling (#1374)

* Update graphrag.mdx

* Fix bug in deletion, better validation error handling

---------

Co-authored-by: Shreyas Pimpalgaonkar <[email protected]>

* vec index creation endpoint (#1373)

* Update graphrag.mdx

* upload files

* create vector index endpoint

* add to fastapi background task

* pre-commit

* move logging

* add api spec, support for all vecs

* pre-commit

* add workflow

* Modify KG Endpoints and update API spec (#1369)

* Update graphrag.mdx

* modify API endpoints and update documentation

* Update ingestion_router.py

* try different docker setup (#1371)

* try different docker setup

* action

* add login

* add full

* update action

* cleanup upload script

* cleanup upload script

* tweak action

* tweak action

* tweak action

* tweak action

* tweak action

* tweak action

* Nolan/ingest chunks js (#1375)

* Update graphrag.mdx

* Clean up ingest chunks, add to JS SDK

* Update JS docs

---------

Co-authored-by: Shreyas Pimpalgaonkar <[email protected]>

* up (#1376)

* Bump JS package (#1378)

* add conversation

* checkin progress

* checkin progress

* Fix Create Graph (#1379)

* up

* up

* modify assertion

* up

* up

* increase entity limit

* changing aristotle back to v2

* pre-commit

* typos

* add test_ingest_sample_file_2_sdk

* Update server.py

* checkin progress

* up

* update

* Graphrag docs (#1382)

* add docs and refine code

* add python SDK documentation

* up

* update

* checkin

* up

* cleanup

* working sync logging

* test conversation history

* fix runner tests, rename `CHUNKS` to `chunks`

* adding zerox parser

---------

Co-authored-by: Shreyas Pimpalgaonkar <[email protected]>
Co-authored-by: Nolan Tremelling <[email protected]>
Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>
Co-authored-by: FutureProofTechOps <[email protected]>
Co-authored-by: Simeon <[email protected]>
Co-authored-by: --global=Shreyas Pimpalgaonkar <[email protected]>
  • Loading branch information
7 people authored Oct 14, 2024
1 parent 450c993 commit 89089af
Show file tree
Hide file tree
Showing 14 changed files with 900 additions and 674 deletions.
2 changes: 2 additions & 0 deletions py/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,13 @@ FROM python:3.10-slim AS builder
# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc g++ musl-dev curl libffi-dev gfortran libopenblas-dev \
poppler-utils \
&& apt-get clean && rm -rf /var/lib/apt/lists/* \
&& curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y

RUN pip install --no-cache-dir poetry


# Add Rust to PATH
ENV PATH="/root/.cargo/bin:${PATH}"

Expand Down
4 changes: 3 additions & 1 deletion py/core/base/parsers/base_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,7 @@

class AsyncParser(ABC, Generic[T]):
@abstractmethod
async def ingest(self, data: T) -> AsyncGenerator[DataType, None]:
async def ingest(
self, data: T, **kwargs
) -> AsyncGenerator[DataType, None]:
pass
1 change: 1 addition & 0 deletions py/core/base/providers/ingestion.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
class IngestionConfig(ProviderConfig):
provider: str = "r2r"
excluded_parsers: list[str] = ["mp4"]
extra_parsers: dict[str, str] = {}

@property
def supported_providers(self) -> list[str]:
Expand Down
2 changes: 2 additions & 0 deletions py/core/parsers/media/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
PDFParser,
PDFParserMarker,
PDFParserUnstructured,
ZeroxPDFParser,
)
from .ppt_parser import PPTParser

Expand All @@ -14,6 +15,7 @@
"ImageParser",
"PDFParser",
"PDFParserUnstructured",
"ZeroxPDFParser",
"PDFParserMarker",
"PPTParser",
]
2 changes: 1 addition & 1 deletion py/core/parsers/media/audio_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ def __init__(
self.openai_api_key = os.environ.get("OPENAI_API_KEY")

async def ingest( # type: ignore
self, data: bytes, chunk_size: int = 1024
self, data: bytes, chunk_size: int = 1024, **kwargs
) -> AsyncGenerator[str, None]:
"""Ingest audio data and yield a transcription."""
temp_audio_path = "temp_audio.wav"
Expand Down
2 changes: 1 addition & 1 deletion py/core/parsers/media/docx_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ def __init__(self):
"Error, `python-docx` is required to run `DOCXParser`. Please install it using `pip install python-docx`."
)

async def ingest(self, data: DataType) -> AsyncGenerator[str, None]: # type: ignore
async def ingest(self, data: DataType, **kwargs) -> AsyncGenerator[str, None]: # type: ignore
"""Ingest DOCX data and yield text from each paragraph."""
if isinstance(data, str):
raise ValueError("DOCX data must be in bytes format.")
Expand Down
2 changes: 1 addition & 1 deletion py/core/parsers/media/img_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ def __init__(
self.max_image_size = max_image_size

async def ingest( # type: ignore
self, data: DataType, chunk_size: int = 1024
self, data: DataType, chunk_size: int = 1024, **kwargs
) -> AsyncGenerator[str, None]:
"""Ingest image data and yield a description."""

Expand Down
59 changes: 56 additions & 3 deletions py/core/parsers/media/pdf_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from core.base.parsers.base_parser import AsyncParser

logger = logging.getLogger(__name__)
ZEROX_DEFAULT_MODEL = "openai/gpt-4o-mini"


class PDFParser(AsyncParser[DataType]):
Expand All @@ -25,7 +26,9 @@ def __init__(self):
"Error, `pypdf` is required to run `PyPDFParser`. Please install it using `pip install pypdf`."
)

async def ingest(self, data: DataType) -> AsyncGenerator[str, None]:
async def ingest(
self, data: DataType, **kwargs
) -> AsyncGenerator[str, None]:
"""Ingest PDF data and yield text from each page."""
if isinstance(data, str):
raise ValueError("PDF data must be in bytes format.")
Expand Down Expand Up @@ -76,7 +79,7 @@ def __init__(self):
"Error, `pdfminer.six` is required to run `PDFParser`. Please install it using `pip install pdfminer.six`."
)

async def ingest(self, data: bytes) -> AsyncGenerator[str, None]:
async def ingest(self, data: bytes, **kwargs) -> AsyncGenerator[str, None]:
"""Ingest PDF data and yield text from each page."""
if not isinstance(data, bytes):
raise ValueError("PDF data must be in bytes format.")
Expand Down Expand Up @@ -156,11 +159,61 @@ def __init__(self):
f"Error, marker is not installed {e}, please install using `pip install marker-pdf` "
)

async def ingest(self, data: DataType) -> AsyncGenerator[str, None]:
async def ingest(
self, data: DataType, **kwargs
) -> AsyncGenerator[str, None]:
if isinstance(data, str):
raise ValueError("PDF data must be in bytes format.")

text, _, _ = self.convert_single_pdf(
BytesIO(data), PDFParserMarker.model_refs
)
yield text


class ZeroxPDFParser(AsyncParser[DataType]):
"""An advanced PDF parser using zerox."""

def __init__(self):
"""
Use the zerox library to parse PDF data.
Args:
cleanup (bool, optional): Whether to clean up temporary files after processing. Defaults to True.
concurrency (int, optional): The number of concurrent processes to run. Defaults to 10.
file_data (Optional[str], optional): The file data to process. Defaults to an empty string.
maintain_format (bool, optional): Whether to maintain the format from the previous page. Defaults to False.
model (str, optional): The model to use for generating completions. Defaults to "gpt-4o-mini". Refer to LiteLLM Providers for the correct model name, as it may differ depending on the provider.
temp_dir (str, optional): The directory to store temporary files, defaults to some named folder in system's temp directory. If already exists, the contents will be deleted before zerox uses it.
custom_system_prompt (str, optional): The system prompt to use for the model, this overrides the default system prompt of zerox.Generally it is not required unless you want some specific behaviour. When set, it will raise a friendly warning. Defaults to None.
kwargs (dict, optional): Additional keyword arguments to pass to the litellm.completion method. Refer to the LiteLLM Documentation and Completion Input for details.
"""
try:
# from pyzerox import zerox
from .zerox.py_zerox.pyzerox import zerox

self.zerox = zerox

except ImportError as e:
raise ValueError(
f"Error, zerox is not installed {e}, please install using `pip install py-zerox` "
)

async def ingest(
self, data: DataType, **kwargs
) -> AsyncGenerator[str, None]:
if isinstance(data, str):
raise ValueError("PDF data must be in bytes format.")

model = kwargs.get("zerox_parsing_model", ZEROX_DEFAULT_MODEL)
model = model.split("/")[-1] # remove the provider prefix

result = await self.zerox(
file_data=data,
model=model,
verbose=True,
)

for page in result.pages:
yield page.content
2 changes: 1 addition & 1 deletion py/core/parsers/media/ppt_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ def __init__(self):
"Error, `python-pptx` is required to run `PPTParser`. Please install it using `pip install python-pptx`."
)

async def ingest(self, data: DataType) -> AsyncGenerator[str, None]: # type: ignore
async def ingest(self, data: DataType, **kwargs) -> AsyncGenerator[str, None]: # type: ignore
"""Ingest PPT data and yield text from each slide."""
if isinstance(data, str):
raise ValueError("PPT data must be in bytes format.")
Expand Down
1 change: 1 addition & 0 deletions py/core/parsers/media/zerox
Submodule zerox added at bdc5f3
91 changes: 64 additions & 27 deletions py/core/providers/ingestion/r2r/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,23 +31,33 @@ class R2RIngestionConfig(IngestionConfig):


class R2RIngestionProvider(IngestionProvider):
AVAILABLE_PARSERS = {
DocumentType.CSV: [parsers.CSVParser, parsers.CSVParserAdvanced],
DocumentType.DOCX: [parsers.DOCXParser],
DocumentType.HTML: [parsers.HTMLParser],
DocumentType.HTM: [parsers.HTMLParser],
DocumentType.JSON: [parsers.JSONParser],
DocumentType.MD: [parsers.MDParser],
DocumentType.PDF: [parsers.PDFParser, parsers.PDFParserUnstructured],
DocumentType.PPTX: [parsers.PPTParser],
DocumentType.TXT: [parsers.TextParser],
DocumentType.XLSX: [parsers.XLSXParser, parsers.XLSXParserAdvanced],
DocumentType.GIF: [parsers.ImageParser],
DocumentType.JPEG: [parsers.ImageParser],
DocumentType.JPG: [parsers.ImageParser],
DocumentType.PNG: [parsers.ImageParser],
DocumentType.SVG: [parsers.ImageParser],
DocumentType.MP3: [parsers.AudioParser],
DEFAULT_PARSERS = {
DocumentType.CSV: parsers.CSVParser,
DocumentType.DOCX: parsers.DOCXParser,
DocumentType.HTML: parsers.HTMLParser,
DocumentType.HTM: parsers.HTMLParser,
DocumentType.JSON: parsers.JSONParser,
DocumentType.MD: parsers.MDParser,
DocumentType.PDF: parsers.PDFParser,
DocumentType.PPTX: parsers.PPTParser,
DocumentType.TXT: parsers.TextParser,
DocumentType.XLSX: parsers.XLSXParser,
DocumentType.GIF: parsers.ImageParser,
DocumentType.JPEG: parsers.ImageParser,
DocumentType.JPG: parsers.ImageParser,
DocumentType.PNG: parsers.ImageParser,
DocumentType.SVG: parsers.ImageParser,
DocumentType.MP3: parsers.AudioParser,
}

EXTRA_PARSERS = {
DocumentType.CSV: {"advanced": parsers.CSVParserAdvanced},
DocumentType.PDF: {
"unstructured": parsers.PDFParserUnstructured,
"zerox": parsers.ZeroxPDFParser,
"marker": parsers.PDFParserMarker,
},
DocumentType.XLSX: {"advanced": parsers.XLSXParserAdvanced},
}

IMAGE_TYPES = {
Expand All @@ -70,14 +80,14 @@ def __init__(self, config: R2RIngestionConfig):
)

def _initialize_parsers(self):
for doc_type, parser_infos in self.AVAILABLE_PARSERS.items():
for parser_info in parser_infos:
if (
doc_type not in self.config.excluded_parsers
and doc_type not in self.parsers
):
# will choose the first parser in the list
self.parsers[doc_type] = parser_info()
for doc_type, parser in self.DEFAULT_PARSERS.items():
# will choose the first parser in the list
if doc_type not in self.config.excluded_parsers:
self.parsers[doc_type] = parser()
for doc_type, doc_parser_name in self.config.extra_parsers.items():
self.parsers[f"{doc_parser_name}_{str(doc_type)}"] = (
R2RIngestionProvider.EXTRA_PARSERS[doc_type][doc_parser_name]()
)

def _build_text_splitter(
self, ingestion_config_override: Optional[dict] = None
Expand Down Expand Up @@ -178,8 +188,35 @@ async def parse( # type: ignore
t0 = time.time()

contents = ""
async for text in self.parsers[document.type].ingest(file_content):
contents += text + "\n"
parser_overrides = ingestion_config_override.get(
"parser_overrides", {}
)
print("parser_overrides = ", parser_overrides)
print("document.type.value = ", document.type.value)
print(
"document.type.value in parser_overrides = ",
document.type.value in parser_overrides,
)
if document.type.value in parser_overrides:
print("In zerox parser ...")
# TODO - Cleanup this approach to be less hardcoded
if (
document.type != DocumentType.PDF
or parser_overrides[DocumentType.PDF.value] != "zerox"
):
raise ValueError(
"Only Zerox PDF parser override is available."
)
print("keys = ", self.parsers.keys())
async for text in self.parsers[
f"zerox_{DocumentType.PDF.value}"
].ingest(file_content, **ingestion_config_override):
contents += text + "\n"
else:
async for text in self.parsers[document.type].ingest(
file_content, **ingestion_config_override
):
contents += text + "\n"

iteration = 0
chunks = self.chunk(contents, ingestion_config_override)
Expand Down
Loading

0 comments on commit 89089af

Please sign in to comment.