|
1 |
| ---- |
2 |
| -title: 'Parsing & Chunking' |
3 |
| -description: 'Learn how to configure document chunking in your R2R deployment' |
4 |
| ---- |
| 1 | +## Parsing & Chunking |
5 | 2 |
|
6 |
| -## Parsing |
| 3 | +R2R supports different parsing and chunking providers to extract text from various document formats and break it down into manageable pieces for efficient processing and retrieval. |
7 | 4 |
|
8 |
| -R2R supports different parsing providers to extract text from various document formats. To configure the parsing provider: |
| 5 | +To configure the parsing and chunking settings, update the `[ingestion]` section in your `r2r.toml` file: |
9 | 6 |
|
10 |
| -```toml example r2r.toml |
11 |
| -[parsing] |
12 |
| -provider = "unstructured_local" # | rag | unstructured_api |
13 |
| -excluded_parsers = ["mp4"] |
14 |
| -``` |
15 |
| -Available providers: |
16 |
| -- `r2r`: Default offering for `light` installations, a simple and lightweight parser included in R2R. |
17 |
| -- `unstructured_local`: Default offering for `full` installations, makes use of open source Unstructured package. |
18 |
| -- `unstructured_api`: Cloud offering of Unstructured |
19 |
| - |
20 |
| -### Supported File Types |
21 |
| - |
22 |
| -**R2R supports parsing for the following file types:** |
23 |
| - |
24 |
| -- BMP (Bitmap Image) |
25 |
| -- CSV (Comma-Separated Values) |
26 |
| -- DOC (Microsoft Word Document) |
27 |
| -- DOCX (Microsoft Word Document) |
28 |
| -- EML (Electronic Mail) |
29 |
| -- EPUB (Electronic Publication) |
30 |
| -- GIF (Graphics Interchange Format) |
31 |
| -- HEIC (High-Efficiency Image Format) |
32 |
| -- HTM (HyperText Markup) |
33 |
| -- HTML (HyperText Markup Language) |
34 |
| -- JPEG (Joint Photographic Experts Group) |
35 |
| -- JPG (Joint Photographic Experts Group) |
36 |
| -- JSON (JavaScript Object Notation) |
37 |
| -- MD (Markdown) |
38 |
| -- MSG (Microsoft Outlook Message) |
39 |
| -- MP3 (MPEG Audio Layer III) |
40 |
| -- MP4 (MPEG-4 Part 14) |
41 |
| -- ODT (Open Document Text) |
42 |
| -- ORG (Org Mode) |
43 |
| -- PDF (Portable Document Format) |
44 |
| -- P7S (PKCS#7) |
45 |
| -- PNG (Portable Network Graphics) |
46 |
| -- PPT (PowerPoint) |
47 |
| -- PPTX (Microsoft PowerPoint Presentation) |
48 |
| -- RST (reStructured Text) |
49 |
| -- RTF (Rich Text Format) |
50 |
| -- SVG (Scalable Vector Graphics) |
51 |
| -- TSV (Tab-Separated Values) |
52 |
| -- TXT (Plain Text) |
53 |
| -- XLS (Microsoft Excel Spreadsheet) |
54 |
| -- XLSX (Microsoft Excel Spreadsheet) |
55 |
| -- XML (Extensible Markup Language) |
56 |
| -- TIFF (Tagged Image File Format) |
57 |
| -- MP4 (MPEG-4 Part 14) |
58 |
| - |
59 |
| -<Note> Parsing providers for an R2R system cannot be configured at runtime and are instead configured server side. </Note> |
60 |
| - |
61 |
| -**Refer to the [Unstructured documentation](https://docs.unstructured.io/welcome) for details about their ingestion capabilities and limitations.** |
62 |
| - |
63 |
| -## Chunking |
64 |
| - |
65 |
| -R2R uses chunking to break down parsed documents into smaller, manageable pieces for efficient processing and retrieval. Configure the chunking settings in `r2r.toml`: |
66 |
| - |
67 |
| -```toml r2r.toml |
68 |
| -[chunking] |
69 |
| -provider = "unstructured_local" |
70 |
| -strategy = "auto" |
71 |
| -chunking_strategy = "by_title" |
72 |
| -new_after_n_chars = 512 |
73 |
| -max_characters = 1_024 |
74 |
| -combine_under_n_chars = 128 |
75 |
| -overlap = 20 |
| 7 | +```toml |
| 8 | +[ingestion] |
| 9 | +provider = "r2r" # or "unstructured_local" or "unstructured_api" |
| 10 | +# ... provider-specific settings ... |
76 | 11 | ```
|
77 | 12 |
|
78 |
| -Key chunking configuration options: |
79 |
| - |
80 |
| -- `provider`: The chunking provider (defaults to "r2r"). |
81 |
| - |
82 |
| -**For R2R:** |
83 |
| -- `chunking_strategy`: The chunking method ("recursive"). |
84 |
| -- `chunk_size`: The target size for each chunk. |
85 |
| -- `chunk_overlap`: The number of characters to overlap between chunks. |
86 |
| -- `excluded_parsers`: List of parsers to exclude (e.g., ["mp4"]). |
87 |
| - |
88 |
| -**For Unstructured:** |
89 |
| -- `strategy`: The overall chunking strategy ("auto", "fast", or "hi_res"). |
90 |
| -- `chunking_strategy`: The specific chunking method ("by_title" or "basic"). |
91 |
| -- `new_after_n_chars`: Soft maximum size for a chunk. |
92 |
| -- `max_characters`: Hard maximum size for a chunk. |
93 |
| -- `combine_under_n_chars`: Minimum size for combining small sections. |
94 |
| -- `overlap`: Number of characters to overlap between chunks. |
95 |
| - |
96 |
| -## Supported Providers |
97 |
| - |
98 |
| -<Tabs> |
99 |
| - <Tab title="Unstructured Local"> |
100 |
| - ```python |
101 |
| - # Ensure unstructured is installed |
102 |
| - # Refer to the full installation docs here - [https://r2r-docs.sciphi.ai/introduction/documentation/installation/full/docker] |
103 |
| - |
104 |
| - # Set 'provider = "unstructured_local"' for `ingestion` in `my_r2r.toml`. |
105 |
| - r2r serve --config-path=my_r2r.toml |
106 |
| - ``` |
| 13 | +### Supported Providers |
| 14 | + |
| 15 | +R2R offers two main parsing and chunking providers: |
| 16 | + |
| 17 | +1. **R2R (default for 'light' installation)**: |
| 18 | + - Uses R2R's built-in parsing and chunking logic. |
| 19 | + - Supports a wide range of file types, including TXT, JSON, HTML, PDF, DOCX, PPTX, XLSX, CSV, Markdown, images, audio, and video. |
| 20 | + - Configuration options: |
| 21 | + ```toml |
| 22 | + [ingestion] |
| 23 | + provider = "r2r" |
| 24 | + chunking_strategy = "recursive" |
| 25 | + chunk_size = 1_024 |
| 26 | + chunk_overlap = 512 |
| 27 | + excluded_parsers = ["mp4"] |
| 28 | + ``` |
| 29 | + - `chunking_strategy`: The chunking method ("recursive"). |
| 30 | + - `chunk_size`: The target size for each chunk. |
| 31 | + - `chunk_overlap`: The number of characters to overlap between chunks. |
| 32 | + - `excluded_parsers`: List of parsers to exclude (e.g., ["mp4"]). |
| 33 | + |
| 34 | +2. **Unstructured (default for 'full' installation)**: |
| 35 | + - Leverages Unstructured's open-source ingestion platform. |
| 36 | + - Provides more advanced parsing capabilities. |
| 37 | + - Configuration options: |
| 38 | + ```toml |
| 39 | + [ingestion] |
| 40 | + provider = "unstructured_local" |
| 41 | + strategy = "auto" |
| 42 | + chunking_strategy = "by_title" |
| 43 | + new_after_n_chars = 512 |
| 44 | + max_characters = 1_024 |
| 45 | + combine_under_n_chars = 128 |
| 46 | + overlap = 20 |
| 47 | + ``` |
| 48 | + - `strategy`: The overall chunking strategy ("auto", "fast", or "hi_res"). |
| 49 | + - `chunking_strategy`: The specific chunking method ("by_title" or "basic"). |
| 50 | + - `new_after_n_chars`: Soft maximum size for a chunk. |
| 51 | + - `max_characters`: Hard maximum size for a chunk. |
| 52 | + - `combine_under_n_chars`: Minimum size for combining small sections. |
| 53 | + - `overlap`: Number of characters to overlap between chunks. |
107 | 54 |
|
108 |
| - This is the default `full` provider, using the open-source Unstructured library for local processing. |
109 |
| - </Tab> |
110 |
| - |
111 |
| - <Tab title="Unstructured API"> |
112 |
| - ```python |
113 |
| - export UNSTRUCTURED_API_KEY=your_unstructured_api_key |
114 |
| - export UNSTRUCTURED_API_URL=your_unstructured_api_url |
115 |
| - # .. set other environment variables |
116 |
| - |
117 |
| - # Optional - Update default provider |
118 |
| - # Set 'provider = "unstructured_api"' for `ingestion` in `my_r2r.toml`. |
119 |
| - r2r serve --config-path=my_r2r.toml |
120 |
| - ``` |
| 55 | +### Supported File Types |
121 | 56 |
|
122 |
| - Uses the Unstructured platform API for chunking, which may offer additional features or performance benefits. |
123 |
| - </Tab> |
| 57 | +Both R2R and Unstructured providers support parsing a wide range of file types, including: |
124 | 58 |
|
125 |
| - <Tab title="R2R"> |
126 |
| - ```python |
127 |
| - # No additional setup required |
| 59 | +- TXT, JSON, HTML, PDF, DOCX, PPTX, XLSX, CSV, Markdown, images (BMP, GIF, HEIC, JPEG, JPG, PNG, SVG, TIFF), audio (MP3), video (MP4), and more. |
128 | 60 |
|
129 |
| - # Optional - Update default provider |
130 |
| - # Set 'provider = "r2r"' for `ingestion` in `my_r2r.toml`. |
131 |
| - r2r serve --config-path=my_r2r.toml |
132 |
| - ``` |
133 |
| - This is the default `light` provider, using the open-source R2R library for local processing. |
| 61 | +Refer to the [Unstructured documentation](https://docs.unstructured.io/welcome) for more details on their ingestion capabilities and limitations. |
134 | 62 |
|
135 |
| - </Tab> |
136 |
| -</Tabs> |
| 63 | +### Configuring Parsing & Chunking |
137 | 64 |
|
138 |
| -### Advanced Configuration Options |
| 65 | +To configure parsing and chunking settings, update the `[ingestion]` section in your `r2r.toml` file with the desired provider and its specific settings. |
139 | 66 |
|
140 |
| -When using the Unstructured chunking provider, you can specify additional parameters in the configuration file: |
| 67 | +For example, to use the R2R provider with custom chunk size and overlap: |
141 | 68 |
|
142 | 69 | ```toml
|
143 | 70 | [ingestion]
|
144 |
| -provider = "unstructured_local" |
145 |
| -strategy = "auto" # "auto", "fast", or "hi_res" |
146 |
| -chunking_strategy = "by_title" # "by_title" or "basic" |
147 |
| - |
148 |
| -# Core chunking parameters |
149 |
| -combine_under_n_chars = 128 |
150 |
| -max_characters = 500 |
151 |
| -new_after_n_chars = 1500 |
152 |
| -overlap = 0 |
153 |
| - |
154 |
| -# Additional chunking options |
155 |
| -coordinates = false |
156 |
| -encoding = "utf-8" |
157 |
| -extract_image_block_types = [] # List of image block types to extract |
158 |
| -gz_uncompressed_content_type = null |
159 |
| -hi_res_model_name = null |
160 |
| -include_orig_elements = true |
161 |
| -include_page_breaks = false |
162 |
| - |
163 |
| -languages = [] # List of languages to consider |
164 |
| -multipage_sections = true |
165 |
| -ocr_languages = [] # List of languages for OCR |
166 |
| -output_format = "application/json" |
167 |
| -overlap_all = false |
168 |
| -pdf_infer_table_structure = true |
169 |
| - |
170 |
| -similarity_threshold = null |
171 |
| -skip_infer_table_types = [] # List of table types to skip inference |
172 |
| -split_pdf_concurrency_level = 5 |
173 |
| -split_pdf_page = true |
174 |
| -starting_page_number = null |
175 |
| -unique_element_ids = false |
176 |
| -xml_keep_tags = false |
| 71 | +provider = "r2r" |
| 72 | +chunking_strategy = "recursive" |
| 73 | +chunk_size = 2_048 |
| 74 | +chunk_overlap = 256 |
| 75 | +excluded_parsers = ["mp4"] |
177 | 76 | ```
|
178 | 77 |
|
179 |
| -These options allow fine-tuning of the chunking process for specific document types or requirements. Refer to the Unstructured [documentation here](https://docs.unstructured.io/open-source/core-functionality/chunking) for more details on the available settings. |
180 |
| - |
181 |
| -### Runtime Configuration |
182 |
| - |
183 |
| -The chunking configuration can be specified at runtime with the [`ingest_files`](/api-reference/endpoint/ingest_files) endpoint, allowing dynamic adjustment of chunking parameters based on the input documents or specific use cases. |
| 78 | +Or, to use the Unstructured provider with a specific chunking strategy and character limits: |
184 | 79 |
|
185 |
| -### Combining Chunking with Other R2R Components |
186 |
| - |
187 |
| -Chunking is a crucial part of the document processing pipeline in R2R. It works in conjunction with other components such as parsing, embedding, and retrieval. For example: |
188 |
| - |
189 |
| -```python |
190 |
| -response = client.ingest_files( |
191 |
| - file_paths=["document.pdf"], |
192 |
| - ingestion_config={ |
193 |
| - "provider": "unstructured_local", |
194 |
| - "chunking_strategy": "by_title", |
195 |
| - "max_characters": 1000 |
196 |
| - }, |
197 |
| - embedding_config={...}, |
198 |
| - # ... other configurations |
199 |
| -) |
| 80 | +```toml |
| 81 | +[ingestion] |
| 82 | +provider = "unstructured_local" |
| 83 | +strategy = "hi_res" |
| 84 | +chunking_strategy = "basic" |
| 85 | +new_after_n_chars = 1_000 |
| 86 | +max_characters = 2_000 |
| 87 | +combine_under_n_chars = 256 |
| 88 | +overlap = 50 |
200 | 89 | ```
|
201 | 90 |
|
202 |
| -For more detailed information on configuring chunking and other ingestion settings, please refer to the [Ingestion Configuration documentation](/documentation/configuration/ingestion/overview). |
203 |
| - |
204 |
| -## Next Steps |
| 91 | +Adjust the settings based on your specific requirements and the characteristics of your input documents. |
205 | 92 |
|
206 |
| -To learn more about configuring other components of R2R, explore the following pages: |
| 93 | +### Next Steps |
207 | 94 |
|
208 |
| -- [Embedding Configuration](/documentation/configuration/ingestion/embedding) |
209 |
| -- [Knowledge Graph Configuration](/documentation/configuration/knowledge-graph/overview) |
210 |
| -- [Retrieval Configuration](/documentation/configuration/retrieval/overview) |
| 95 | +- Learn more about [Embedding Configuration](/documentation/configuration/ingestion/embedding). |
| 96 | +- Explore [Knowledge Graph Configuration](/documentation/configuration/knowledge-graph/overview). |
| 97 | +- Check out [Retrieval Configuration](/documentation/configuration/retrieval/overview). |
0 commit comments