v2.23.0 - 2025-02-17
- Support cuda:n GPU device allocation (#694) (
77eb77b
) - xml-jats: Parse XML JATS documents (#967) (
428b656
)
v2.22.0 - 2025-02-14
- Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) (
00d9405
) - Introduce the enable_remote_services option to allow remote connections while processing (#941) (
2716c7d
) - Allow artifacts_path to be defined as ENV (#940) (
5101e25
)
- Update Pillow constraints (#958) (
af19c03
) - Fix the initialization of the TesseractOcrModel (#935) (
c47ae70
)
- Update example Dockerfile with download CLI (#929) (
7493d5b
) - Examples for picture descriptions (#951) (
2d66e99
)
v2.21.0 - 2025-02-10
v2.20.0 - 2025-02-07
v2.19.0 - 2025-02-07
- markdown: Handle nested lists (#910) (
90b766e
) - Test cases for RTL programmatic PDFs and fixes for the formula model (#903) (
9114ada
) - msword_backend: Handle conversion error in label parsing (#896) (
722a6eb
) - Enrichment models batch size and expose picture classifier (#878) (
5ad6de0
)
v2.18.0 - 2025-02-03
- Expose equation exports (#869) (
6a76b49
) - Add option to define page range (#852) (
70d68b6
) - docx: Support of SDTs in docx backend (#853) (
d727b04
) - Python 3.13 support (#841) (
4df085a
)
- markdown: Fix parsing if doc ending with table (#873) (
5ac2887
) - markdown: Add support for HTML content (#855) (
94751a7
) - docx: Merged table cells not properly converted (#857) (
0cd81a8
) - Processing of placeholder shapes in pptx that have text but no bbox (#868) (
eff16b6
) - KeyError in tableformer prediction (#854) (
b1cf796
) - Fixed docx import with headers that are also lists (#842) (
2c037ae
) - Use new add_code in html backend and add more typing hints (#850) (
2a1f8af
) - markdown: Fix empty block handling (#843) (
bccb022
) - Fix for the crash when encountering WMF images in pptx and docx (#837) (
fea0a99
)
- Updated the readme with upcoming features (#831) (
d7c0828
) - Add example for inspection of picture content (#624) (
f9144f2
)
v2.17.0 - 2025-01-28
- CLI: Expose code and formula models in the CLI (#820) (
6882e6c
) - Add platform info to CLI version printout (#816) (
95b293a
) - ocr: Expose
rec_keys_path
in RapidOcrOptions to support custom dictionaries (#786) (5332755
) - Introduce automatic language detection in TesseractOcrCliModel (#800) (
3be2fb5
)
- Fix single newline handling in MD backend (#824) (
5aed9f8
) - Use file extension if filetype fails with PDF (#827) (
adf6353
) - Parse html with omitted body tag (#818) (
a112d7a
)
- Document Docling JSON parsing (#819) (
6875913
) - Add SSL verification error mitigation (#821) (
5139b48
) - backend XML: Do not delete temp file in notebook (#817) (
4d41db3
) - Typo (#814) (
8a4ec77
) - Added markdown headings to enable TOC in github pages (#808) (
b885b2f
) - Description of supported formats and backends (#788) (
c2ae1cc
)
v2.16.0 - 2025-01-24
- New document picture classifier (#805) (
16a218d
) - Add Docling JSON ingestion (#783) (
88a0e66
) - Code and equation model for PDF and code blocks in markdown (#752) (
3213b24
) - Add "auto" language for TesseractOcr (#759) (
8543c22
)
- Added extraction of byte-images in excel (#804) (
a458e29
) - Update docling-parse-v2 backend version with new parsing fixes (#769) (
670a08b
)
- Fix minor typos (#801) (
c58f75d
) - Add Azure RAG example (#675) (
9020a93
) - Fix links between docs pages (#697) (
c49b352
) - Fix correct Accelerator pipeline options in docs/examples/custom_convert.py (#733) (
7686083
) - Example to translate documents (#739) (
f7e1cbf
)
v2.15.1 - 2025-01-10
- Improve OCR results, stricten criteria before dropping bitmap areas (#719) (
5a060f2
) - Allow earlier requests versions (#716) (
e64b5a2
)
v2.15.0 - 2025-01-08
- Correct scaling of debug visualizations, tune OCR (#700) (
5cb4cf6
) - Let BeautifulSoup detect the HTML encoding (#695) (
42856fd
) - mspowerpoint: Handle invalid images in PowerPoint slides (#650) (
d49650c
)
- Specify docstring types (#702) (
ead396a
) - Add link to rag with granite (#698) (
6701f34
) - Add integrations, revamp docs (#693) (
2d24fae
) - Add OpenContracts as an integration (#679) (
569038d
) - Add Weaviate RAG recipe notebook (#451) (
2b591f9
) - Document Haystack & Vectara support (#628) (
fc645ea
)
v2.14.0 - 2024-12-18
v2.13.0 - 2024-12-17
- Updated Layout processing with forms and key-value areas (#530) (
60dc852
) - Create a backend to parse USPTO patents into DoclingDocument (#606) (
4e08750
) - Add Easyocr parameter recog_network (#613) (
3b53bd3
)
- Add Haystack RAG example (#615) (
3e599c7
) - Fix the path to the run_with_accelerator.py example (#608) (
3bb3bf5
)
v2.12.0 - 2024-12-13
v2.11.0 - 2024-12-12
- Do not import python modules from deepsearch-glm (#569) (
aee9c0b
) - Handle no result from RapidOcr reader (#558) (
f45499c
) - Make enum serializable with human-readable value (#555) (
a7df337
)
v2.10.0 - 2024-12-09
- Call into docling-core for legacy document transform (#551) (
7972d47
) - Introduce Image format options in CLI. Silence the tqdm downloading messages. (#544) (
78f61a8
)
v2.9.0 - 2024-12-09
- Expose new hybrid chunker, update docs (#384) (
c8ecdd9
) - MS Word backend: Make detection of headers and other styles localization agnostic (#534) (
3e073df
)
- Correcting DefaultText ID for MS Word backend (#537) (
eb7ffcd
) - Add
py.typed
marker file (#531) (9102fe1
) - Enable HTML export in CLI and add options for image mode (#513) (
0d11e30
) - Missing text in docx (t tag) when embedded in a table (#528) (
b730b2d
) - Restore pydantic version pin after fixes (#512) (
c830b92
) - Folder input in cli (#511) (
8ada0bc
)
v2.8.3 - 2024-12-03
v2.8.2 - 2024-12-03
- ParserError EOF inside string (#470) (#472) (
c90c41c
) - PermissionError when using tesseract_ocr_cli_model (#496) (
d3f84b2
)
- Add styling for faq (#502) (
5ba3807
) - Typo in faq (#484) (
33cff98
) - Add automatic api reference (#475) (
d487210
) - Introduce faq section (#468) (
8ccb3c6
)
v2.8.1 - 2024-11-29
v2.8.0 - 2024-11-27
- Use correct image index in word backend (#442) (
767563b
) - Update tests and examples for docling-core 2.5.1 (#449) (
29807a2
)
v2.7.1 - 2024-11-26
v2.7.0 - 2024-11-20
v2.6.0 - 2024-11-19
- Added support for exporting DocItem to an image when page image is available (#379) (
3f91e7d
) - Expose ocr-lang in CLI (#375) (
ed785ea
) - Added excel backend (#334) (
926dfd2
) - Extracting picture data for raster images found in PPTX (#349) (
7a97d71
)
- Fixing images in the input Word files (#330) (
8533039
) - Reduce logging by keeping option for more verbose (#323) (
8b437ad
)
- Fixed typo in v2 example v2 (#378) (
911c3bd
) - Add automatic generation of CLI reference (#325) (
ca8524e
) - Add architecture outline (#341) (
25fd149
) - Fix parameter in usage.md (#332) (
835e077
)
v2.5.2 - 2024-11-13
v2.5.1 - 2024-11-12
v2.5.0 - 2024-11-12
- OCR: Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning (#290) (
c6b3763
)
- Configure env prefix for docling settings (#315) (
5d4a10b
) - Added handling of grouped elements in pptx backend (#307) (
81c8243
) - Allow mps usage for easyocr (#286) (
97f214e
)
v2.4.2 - 2024-11-08
- EasyOcrModel: Support the use_gpu pipeline parameter in EasyOcrModel. Initialize easyocr (#282) (
0eb065e
)
v2.4.1 - 2024-11-08
- tesserocr: Raise Exception if tesserocr has not loaded any languages (#279) (
704d792
) - Dockerfile example copy command (#234) (
90836db
)
- Update badges & credits (#248) (
a84ec27
) - Add coming-soon section (#235) (
5ce02c5
) - Add artifacts-path param to CLI (#233) (
d5e65ae
)
v2.4.0 - 2024-11-04
- Add explicit artifacts path example (#224) (
eeee3b4
) - Update custom convert and dockerfile (#226) (
5f5fea9
) - Correct spelling of 'individual' (#219) (
41acaa9
) - Update LlamaIndex docs (#196) (
244ca69
)
v2.3.1 - 2024-10-30
- Simplify torch dependencies and update pinned docling deps (#190) (
eb679cc
) - Allow to explicitly initialize the pipeline (#189) (
904d24d
)
v2.3.0 - 2024-10-30
v2.2.1 - 2024-10-28
- Fix header levels for DOCX & HTML (#184) (
b9f5c74
) - Handling of long sequence of unescaped underscore chars in markdown (#173) (
94d0729
) - HTML backend, fixes for Lists and nested texts (#180) (
7d19418
) - MD Backend, fixes to properly handle trailing inline text and emphasis in headers (#178) (
88c1673
)
- Update LlamaIndex docs for Docling v2 (#182) (
2cece27
) - Fix batch convert (#177) (
189d3c2
) - Add export with embedded images (#175) (
8d356aa
)
v2.2.0 - 2024-10-23
- Update to docling-parse v2 without history (#170) (
4116819
) - Support AsciiDoc and Markdown input format (#168) (
3023f18
)
v2.1.0 - 2024-10-18
- Typo fix (#155) (
f799e77
) - Add graphical band in readme (#154) (
034a411
) - Add use docling (#150) (
61c092f
)
v2.0.0 - 2024-10-16
v1.20.0 - 2024-10-11
v1.19.1 - 2024-10-11
- Remove stderr from tesseract cli and introduce fuzziness in the text validation of OCR tests (#138) (
dae2a3b
)
v1.19.0 - 2024-10-08
v1.18.0 - 2024-10-03
v1.17.0 - 2024-10-03
v1.16.1 - 2024-09-27
v1.16.0 - 2024-09-27
v1.15.0 - 2024-09-24
v1.14.0 - 2024-09-24
v1.13.1 - 2024-09-23
v1.13.0 - 2024-09-18
v1.12.2 - 2024-09-17
v1.12.1 - 2024-09-16
v1.12.0 - 2024-09-13
v1.11.0 - 2024-09-10
v1.10.0 - 2024-09-10
v1.9.0 - 2024-09-03
v1.8.5 - 2024-08-30
v1.8.4 - 2024-08-30
v1.8.3 - 2024-08-28
v1.8.2 - 2024-08-27
v1.8.1 - 2024-08-26
v1.8.0 - 2024-08-23
v1.7.1 - 2024-08-23
- Better raise exception when a page fails to parse (#46) (
8808463
) - Upgrade docling-parse to 1.1.1, safety checks for failed parse on pages (#45) (
7e84533
)