Skip to content

Latest commit

 

History

History
142 lines (95 loc) · 4.81 KB

parse.md

File metadata and controls

142 lines (95 loc) · 4.81 KB

LlamaParse

LlamaParse is a GenAI-native document parser that can parse complex document data for any downstream LLM use case (RAG, agents).

It is really good at the following:

  • Broad file type support: Parsing a variety of unstructured file types (.pdf, .pptx, .docx, .xlsx, .html) with text, tables, visual elements, weird layouts, and more.
  • Table recognition: Parsing embedded tables accurately into text and semi-structured representations.
  • Multimodal parsing and chunking: Extracting visual elements (images/diagrams) into structured formats and return image chunks using the latest multimodal models.
  • Custom parsing: Input custom prompt instructions to customize the output the way you want it.

LlamaParse directly integrates with LlamaIndex.

The free plan is up to 1000 pages a day. Paid plan is free 7k pages per week + 0.3c per additional page by default. There is a sandbox available to test the API https://cloud.llamaindex.ai/parse ↗.

Read below for some quickstart information, or see the full documentation.

If you're a company interested in enterprise RAG solutions, and/or high volume/on-prem usage of LlamaParse, come talk to us.

Getting Started

First, login and get an api-key from https://cloud.llamaindex.ai/api-key ↗.

Then, install the package:

pip install llama-cloud-services

Now you can parse your first PDF file using the command line interface. Use the command llama-parse [file_paths]. See the help text with llama-parse --help.

export LLAMA_CLOUD_API_KEY='llx-...'

# output as text
llama-parse my_file.pdf --result-type text --output-file output.txt

# output as markdown
llama-parse my_file.pdf --result-type markdown --output-file output.md

# output as raw json
llama-parse my_file.pdf --output-raw-json --output-file output.json

You can also create simple scripts:

import nest_asyncio

nest_asyncio.apply()

from llama_cloud_services import LlamaParse

parser = LlamaParse(
    api_key="llx-...",  # can also be set in your env as LLAMA_CLOUD_API_KEY
    result_type="markdown",  # "markdown" and "text" are available
    num_workers=4,  # if multiple files passed, split in `num_workers` API calls
    verbose=True,
    language="en",  # Optionally you can define a language, default=en
)

# sync
documents = parser.load_data("./my_file.pdf")

# sync batch
documents = parser.load_data(["./my_file1.pdf", "./my_file2.pdf"])

# async
documents = await parser.aload_data("./my_file.pdf")

# async batch
documents = await parser.aload_data(["./my_file1.pdf", "./my_file2.pdf"])

Using with file object

You can parse a file object directly:

import nest_asyncio

nest_asyncio.apply()

from llama_cloud_services import LlamaParse

parser = LlamaParse(
    api_key="llx-...",  # can also be set in your env as LLAMA_CLOUD_API_KEY
    result_type="markdown",  # "markdown" and "text" are available
    num_workers=4,  # if multiple files passed, split in `num_workers` API calls
    verbose=True,
    language="en",  # Optionally you can define a language, default=en
)

file_name = "my_file1.pdf"
extra_info = {"file_name": file_name}

with open(f"./{file_name}", "rb") as f:
    # must provide extra_info with file_name key with passing file object
    documents = parser.load_data(f, extra_info=extra_info)

# you can also pass file bytes directly
with open(f"./{file_name}", "rb") as f:
    file_bytes = f.read()
    # must provide extra_info with file_name key with passing file bytes
    documents = parser.load_data(file_bytes, extra_info=extra_info)

Using with SimpleDirectoryReader

You can also integrate the parser as the default PDF loader in SimpleDirectoryReader:

import nest_asyncio

nest_asyncio.apply()

from llama_cloud_services import LlamaParse
from llama_index.core import SimpleDirectoryReader

parser = LlamaParse(
    api_key="llx-...",  # can also be set in your env as LLAMA_CLOUD_API_KEY
    result_type="markdown",  # "markdown" and "text" are available
    verbose=True,
)

file_extractor = {".pdf": parser}
documents = SimpleDirectoryReader(
    "./data", file_extractor=file_extractor
).load_data()

Full documentation for SimpleDirectoryReader can be found on the LlamaIndex Documentation.

Examples

Several end-to-end indexing examples can be found in the examples folder

Documentation

https://docs.cloud.llamaindex.ai/