Skip to content

Commit 067b19f

Browse files
cleanup docs
1 parent 187ba14 commit 067b19f

File tree

3 files changed

+82
-195
lines changed

3 files changed

+82
-195
lines changed

docs/api-reference/openapi.json

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.
Lines changed: 75 additions & 188 deletions
Original file line numberDiff line numberDiff line change
@@ -1,210 +1,97 @@
1-
---
2-
title: 'Parsing & Chunking'
3-
description: 'Learn how to configure document chunking in your R2R deployment'
4-
---
1+
## Parsing & Chunking
52

6-
## Parsing
3+
R2R supports different parsing and chunking providers to extract text from various document formats and break it down into manageable pieces for efficient processing and retrieval.
74

8-
R2R supports different parsing providers to extract text from various document formats. To configure the parsing provider:
5+
To configure the parsing and chunking settings, update the `[ingestion]` section in your `r2r.toml` file:
96

10-
```toml example r2r.toml
11-
[parsing]
12-
provider = "unstructured_local" # | rag | unstructured_api
13-
excluded_parsers = ["mp4"]
14-
```
15-
Available providers:
16-
- `r2r`: Default offering for `light` installations, a simple and lightweight parser included in R2R.
17-
- `unstructured_local`: Default offering for `full` installations, makes use of open source Unstructured package.
18-
- `unstructured_api`: Cloud offering of Unstructured
19-
20-
### Supported File Types
21-
22-
**R2R supports parsing for the following file types:**
23-
24-
- BMP (Bitmap Image)
25-
- CSV (Comma-Separated Values)
26-
- DOC (Microsoft Word Document)
27-
- DOCX (Microsoft Word Document)
28-
- EML (Electronic Mail)
29-
- EPUB (Electronic Publication)
30-
- GIF (Graphics Interchange Format)
31-
- HEIC (High-Efficiency Image Format)
32-
- HTM (HyperText Markup)
33-
- HTML (HyperText Markup Language)
34-
- JPEG (Joint Photographic Experts Group)
35-
- JPG (Joint Photographic Experts Group)
36-
- JSON (JavaScript Object Notation)
37-
- MD (Markdown)
38-
- MSG (Microsoft Outlook Message)
39-
- MP3 (MPEG Audio Layer III)
40-
- MP4 (MPEG-4 Part 14)
41-
- ODT (Open Document Text)
42-
- ORG (Org Mode)
43-
- PDF (Portable Document Format)
44-
- P7S (PKCS#7)
45-
- PNG (Portable Network Graphics)
46-
- PPT (PowerPoint)
47-
- PPTX (Microsoft PowerPoint Presentation)
48-
- RST (reStructured Text)
49-
- RTF (Rich Text Format)
50-
- SVG (Scalable Vector Graphics)
51-
- TSV (Tab-Separated Values)
52-
- TXT (Plain Text)
53-
- XLS (Microsoft Excel Spreadsheet)
54-
- XLSX (Microsoft Excel Spreadsheet)
55-
- XML (Extensible Markup Language)
56-
- TIFF (Tagged Image File Format)
57-
- MP4 (MPEG-4 Part 14)
58-
59-
<Note> Parsing providers for an R2R system cannot be configured at runtime and are instead configured server side. </Note>
60-
61-
**Refer to the [Unstructured documentation](https://docs.unstructured.io/welcome) for details about their ingestion capabilities and limitations.**
62-
63-
## Chunking
64-
65-
R2R uses chunking to break down parsed documents into smaller, manageable pieces for efficient processing and retrieval. Configure the chunking settings in `r2r.toml`:
66-
67-
```toml r2r.toml
68-
[chunking]
69-
provider = "unstructured_local"
70-
strategy = "auto"
71-
chunking_strategy = "by_title"
72-
new_after_n_chars = 512
73-
max_characters = 1_024
74-
combine_under_n_chars = 128
75-
overlap = 20
7+
```toml
8+
[ingestion]
9+
provider = "r2r" # or "unstructured_local" or "unstructured_api"
10+
# ... provider-specific settings ...
7611
```
7712

78-
Key chunking configuration options:
79-
80-
- `provider`: The chunking provider (defaults to "r2r").
81-
82-
**For R2R:**
83-
- `chunking_strategy`: The chunking method ("recursive").
84-
- `chunk_size`: The target size for each chunk.
85-
- `chunk_overlap`: The number of characters to overlap between chunks.
86-
- `excluded_parsers`: List of parsers to exclude (e.g., ["mp4"]).
87-
88-
**For Unstructured:**
89-
- `strategy`: The overall chunking strategy ("auto", "fast", or "hi_res").
90-
- `chunking_strategy`: The specific chunking method ("by_title" or "basic").
91-
- `new_after_n_chars`: Soft maximum size for a chunk.
92-
- `max_characters`: Hard maximum size for a chunk.
93-
- `combine_under_n_chars`: Minimum size for combining small sections.
94-
- `overlap`: Number of characters to overlap between chunks.
95-
96-
## Supported Providers
97-
98-
<Tabs>
99-
<Tab title="Unstructured Local">
100-
```python
101-
# Ensure unstructured is installed
102-
# Refer to the full installation docs here - [https://r2r-docs.sciphi.ai/introduction/documentation/installation/full/docker]
103-
104-
# Set 'provider = "unstructured_local"' for `ingestion` in `my_r2r.toml`.
105-
r2r serve --config-path=my_r2r.toml
106-
```
13+
### Supported Providers
14+
15+
R2R offers two main parsing and chunking providers:
16+
17+
1. **R2R (default for 'light' installation)**:
18+
- Uses R2R's built-in parsing and chunking logic.
19+
- Supports a wide range of file types, including TXT, JSON, HTML, PDF, DOCX, PPTX, XLSX, CSV, Markdown, images, audio, and video.
20+
- Configuration options:
21+
```toml
22+
[ingestion]
23+
provider = "r2r"
24+
chunking_strategy = "recursive"
25+
chunk_size = 1_024
26+
chunk_overlap = 512
27+
excluded_parsers = ["mp4"]
28+
```
29+
- `chunking_strategy`: The chunking method ("recursive").
30+
- `chunk_size`: The target size for each chunk.
31+
- `chunk_overlap`: The number of characters to overlap between chunks.
32+
- `excluded_parsers`: List of parsers to exclude (e.g., ["mp4"]).
33+
34+
2. **Unstructured (default for 'full' installation)**:
35+
- Leverages Unstructured's open-source ingestion platform.
36+
- Provides more advanced parsing capabilities.
37+
- Configuration options:
38+
```toml
39+
[ingestion]
40+
provider = "unstructured_local"
41+
strategy = "auto"
42+
chunking_strategy = "by_title"
43+
new_after_n_chars = 512
44+
max_characters = 1_024
45+
combine_under_n_chars = 128
46+
overlap = 20
47+
```
48+
- `strategy`: The overall chunking strategy ("auto", "fast", or "hi_res").
49+
- `chunking_strategy`: The specific chunking method ("by_title" or "basic").
50+
- `new_after_n_chars`: Soft maximum size for a chunk.
51+
- `max_characters`: Hard maximum size for a chunk.
52+
- `combine_under_n_chars`: Minimum size for combining small sections.
53+
- `overlap`: Number of characters to overlap between chunks.
10754

108-
This is the default `full` provider, using the open-source Unstructured library for local processing.
109-
</Tab>
110-
111-
<Tab title="Unstructured API">
112-
```python
113-
export UNSTRUCTURED_API_KEY=your_unstructured_api_key
114-
export UNSTRUCTURED_API_URL=your_unstructured_api_url
115-
# .. set other environment variables
116-
117-
# Optional - Update default provider
118-
# Set 'provider = "unstructured_api"' for `ingestion` in `my_r2r.toml`.
119-
r2r serve --config-path=my_r2r.toml
120-
```
55+
### Supported File Types
12156

122-
Uses the Unstructured platform API for chunking, which may offer additional features or performance benefits.
123-
</Tab>
57+
Both R2R and Unstructured providers support parsing a wide range of file types, including:
12458

125-
<Tab title="R2R">
126-
```python
127-
# No additional setup required
59+
- TXT, JSON, HTML, PDF, DOCX, PPTX, XLSX, CSV, Markdown, images (BMP, GIF, HEIC, JPEG, JPG, PNG, SVG, TIFF), audio (MP3), video (MP4), and more.
12860

129-
# Optional - Update default provider
130-
# Set 'provider = "r2r"' for `ingestion` in `my_r2r.toml`.
131-
r2r serve --config-path=my_r2r.toml
132-
```
133-
This is the default `light` provider, using the open-source R2R library for local processing.
61+
Refer to the [Unstructured documentation](https://docs.unstructured.io/welcome) for more details on their ingestion capabilities and limitations.
13462

135-
</Tab>
136-
</Tabs>
63+
### Configuring Parsing & Chunking
13764

138-
### Advanced Configuration Options
65+
To configure parsing and chunking settings, update the `[ingestion]` section in your `r2r.toml` file with the desired provider and its specific settings.
13966

140-
When using the Unstructured chunking provider, you can specify additional parameters in the configuration file:
67+
For example, to use the R2R provider with custom chunk size and overlap:
14168

14269
```toml
14370
[ingestion]
144-
provider = "unstructured_local"
145-
strategy = "auto" # "auto", "fast", or "hi_res"
146-
chunking_strategy = "by_title" # "by_title" or "basic"
147-
148-
# Core chunking parameters
149-
combine_under_n_chars = 128
150-
max_characters = 500
151-
new_after_n_chars = 1500
152-
overlap = 0
153-
154-
# Additional chunking options
155-
coordinates = false
156-
encoding = "utf-8"
157-
extract_image_block_types = [] # List of image block types to extract
158-
gz_uncompressed_content_type = null
159-
hi_res_model_name = null
160-
include_orig_elements = true
161-
include_page_breaks = false
162-
163-
languages = [] # List of languages to consider
164-
multipage_sections = true
165-
ocr_languages = [] # List of languages for OCR
166-
output_format = "application/json"
167-
overlap_all = false
168-
pdf_infer_table_structure = true
169-
170-
similarity_threshold = null
171-
skip_infer_table_types = [] # List of table types to skip inference
172-
split_pdf_concurrency_level = 5
173-
split_pdf_page = true
174-
starting_page_number = null
175-
unique_element_ids = false
176-
xml_keep_tags = false
71+
provider = "r2r"
72+
chunking_strategy = "recursive"
73+
chunk_size = 2_048
74+
chunk_overlap = 256
75+
excluded_parsers = ["mp4"]
17776
```
17877

179-
These options allow fine-tuning of the chunking process for specific document types or requirements. Refer to the Unstructured [documentation here](https://docs.unstructured.io/open-source/core-functionality/chunking) for more details on the available settings.
180-
181-
### Runtime Configuration
182-
183-
The chunking configuration can be specified at runtime with the [`ingest_files`](/api-reference/endpoint/ingest_files) endpoint, allowing dynamic adjustment of chunking parameters based on the input documents or specific use cases.
78+
Or, to use the Unstructured provider with a specific chunking strategy and character limits:
18479

185-
### Combining Chunking with Other R2R Components
186-
187-
Chunking is a crucial part of the document processing pipeline in R2R. It works in conjunction with other components such as parsing, embedding, and retrieval. For example:
188-
189-
```python
190-
response = client.ingest_files(
191-
file_paths=["document.pdf"],
192-
ingestion_config={
193-
"provider": "unstructured_local",
194-
"chunking_strategy": "by_title",
195-
"max_characters": 1000
196-
},
197-
embedding_config={...},
198-
# ... other configurations
199-
)
80+
```toml
81+
[ingestion]
82+
provider = "unstructured_local"
83+
strategy = "hi_res"
84+
chunking_strategy = "basic"
85+
new_after_n_chars = 1_000
86+
max_characters = 2_000
87+
combine_under_n_chars = 256
88+
overlap = 50
20089
```
20190

202-
For more detailed information on configuring chunking and other ingestion settings, please refer to the [Ingestion Configuration documentation](/documentation/configuration/ingestion/overview).
203-
204-
## Next Steps
91+
Adjust the settings based on your specific requirements and the characteristics of your input documents.
20592

206-
To learn more about configuring other components of R2R, explore the following pages:
93+
### Next Steps
20794

208-
- [Embedding Configuration](/documentation/configuration/ingestion/embedding)
209-
- [Knowledge Graph Configuration](/documentation/configuration/knowledge-graph/overview)
210-
- [Retrieval Configuration](/documentation/configuration/retrieval/overview)
95+
- Learn more about [Embedding Configuration](/documentation/configuration/ingestion/embedding).
96+
- Explore [Knowledge Graph Configuration](/documentation/configuration/knowledge-graph/overview).
97+
- Check out [Retrieval Configuration](/documentation/configuration/retrieval/overview).

py/core/main/api/data/retrieval_router_openapi.yml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ search:
1212
query="Who is Aristotle?",
1313
vector_search_settings={
1414
"use_vector_search": True,
15-
"filters": {"document_id": {"eq": "3e157b3a-8469-51db-90d9-52e7d896b49b"}},
15+
"filters": {"document_id": {"$eq": "3e157b3a-8469-51db-90d9-52e7d896b49b"}},
1616
"search_limit": 20,
1717
"use_hybrid_search": True
1818
},
@@ -42,7 +42,7 @@ search:
4242
"query": "Who is Aristotle?",
4343
"vector_search_settings": {
4444
"use_vector_search": true,
45-
"filters": {"document_id": {"eq": "3e157b3a-8469-51db-90d9-52e7d896b49b"}},
45+
"filters": {"document_id": {"$eq": "3e157b3a-8469-51db-90d9-52e7d896b49b"}},
4646
"search_limit": 20,
4747
"use_hybrid_search": true
4848
},
@@ -83,7 +83,7 @@ rag:
8383
query="Who is Aristotle?",
8484
vector_search_settings={
8585
"use_vector_search": True,
86-
"filters": {"document_id": {"eq": "3e157b3a-8469-51db-90d9-52e7d896b49b"}},
86+
"filters": {"document_id": {"$eq": "3e157b3a-8469-51db-90d9-52e7d896b49b"}},
8787
"search_limit": 20,
8888
"use_hybrid_search": True
8989
},
@@ -118,7 +118,7 @@ rag:
118118
"query": "Who is Aristotle?",
119119
"vector_search_settings": {
120120
"use_vector_search": true,
121-
"filters": {"document_id": {"eq": "3e157b3a-8469-51db-90d9-52e7d896b49b"}},
121+
"filters": {"document_id": {"$eq": "3e157b3a-8469-51db-90d9-52e7d896b49b"}},
122122
"search_limit": 20,
123123
"use_hybrid_search": True
124124
},
@@ -173,7 +173,7 @@ agent:
173173
],
174174
vector_search_settings={
175175
"use_vector_search": True,
176-
"filters": {"document_id": {"eq": "5e157b3a-8469-51db-90d9-52e7d896b49b"}},
176+
"filters": {"document_id": {"$eq": "5e157b3a-8469-51db-90d9-52e7d896b49b"}},
177177
"search_limit": 20,
178178
"use_hybrid_search": True
179179
},
@@ -197,7 +197,7 @@ agent:
197197
],
198198
"vector_search_settings": {
199199
"use_vector_search": true,
200-
"filters": {"document_id": {"eq": "5e157b3a-8469-51db-90d9-52e7d896b49b"}},
200+
"filters": {"document_id": {"$eq": "5e157b3a-8469-51db-90d9-52e7d896b49b"}},
201201
"search_limit": 20,
202202
"use_hybrid_search": true
203203
},

0 commit comments

Comments
 (0)