Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for async DocParse calls in aryn-sdk #1116

Merged
merged 58 commits into from
Jan 24, 2025
Merged
Show file tree
Hide file tree
Changes from 33 commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
d6f8cb4
Add `partition_file_submit_async`
MarkLindblad Jan 16, 2025
2216c01
Add `partition_file_result_async`
MarkLindblad Jan 16, 2025
eb327a8
Fix linting
MarkLindblad Jan 16, 2025
fec3c51
Add docstring, return type for `partition_file_submit_async`
MarkLindblad Jan 16, 2025
dc8059f
Fix linting
MarkLindblad Jan 16, 2025
735c828
Make `partition_file_result_async` easier to use
MarkLindblad Jan 16, 2025
f2ebd14
Add `partition_file_result_async` docstring
MarkLindblad Jan 16, 2025
cc0ce7c
Fix linting
MarkLindblad Jan 16, 2025
c80bbdd
Add tests
MarkLindblad Jan 16, 2025
85678fb
Add general error when unexpected http code encountered
MarkLindblad Jan 17, 2025
a220bef
Fix linting
MarkLindblad Jan 17, 2025
784122e
Rename `NoSuchAsyncPartitionerJob` to `NoSuchAsyncPartitionerJobError`
MarkLindblad Jan 17, 2025
e08bd74
Fix `test_partition` with current exception (PartitionError not Value…
MarkLindblad Jan 17, 2025
6c4b9dd
Fix expected type of Exception in `test_partition_it_zero_page`
MarkLindblad Jan 18, 2025
3d73b60
Update reference json for `test_partition_it` integ test
MarkLindblad Jan 18, 2025
2ed99a2
Add support for positional arguments in `partition_file_submit_async`
MarkLindblad Jan 18, 2025
aae7356
Add explanatory comments
MarkLindblad Jan 18, 2025
bcd92af
Add support for positional arguments in `partition_file_submit_async`
MarkLindblad Jan 18, 2025
3e2c970
Fix handling of positional and keyword arguments in `partition_file_s…
MarkLindblad Jan 18, 2025
c68cda6
Fix linting
MarkLindblad Jan 18, 2025
3d2eac3
Move async tests together
MarkLindblad Jan 18, 2025
e5d2f88
Add url forwarding validation
MarkLindblad Jan 18, 2025
d4b072c
Add `test_partition_file_submit_async` unit test
MarkLindblad Jan 18, 2025
e55687e
Fix mypy issues
MarkLindblad Jan 18, 2025
a82b90c
Fix relevant docstrings. Add a multi-job example
MarkLindblad Jan 18, 2025
c38418d
Add async examples to readme
MarkLindblad Jan 18, 2025
d081f9c
Narrow excepted Exceptions in multi-document async examples
MarkLindblad Jan 18, 2025
cd94238
Add synchronous test checking unsupported file format behavior
MarkLindblad Jan 22, 2025
ab6aa3f
Make `partition_file_submit_async` response more intuitive
MarkLindblad Jan 22, 2025
49a393b
Fix examples in README
MarkLindblad Jan 22, 2025
ddf6b74
Add test that checks multiple simultaneous async requests
MarkLindblad Jan 23, 2025
fc4e95a
Add cancel API, cancel test, and reduce wait between polls in test
MarkLindblad Jan 23, 2025
df9d70f
Add support for webhooks in async requests
MarkLindblad Jan 23, 2025
a9d8749
Add webhook instructions to README
MarkLindblad Jan 23, 2025
9b2cbd5
Fix examples
MarkLindblad Jan 23, 2025
1186025
Change TOKEN to API-KEY in example
MarkLindblad Jan 23, 2025
24d43eb
Change `job` to `response` in examples, fix multi-job examples
MarkLindblad Jan 24, 2025
feca8d5
Improve single async job example
MarkLindblad Jan 24, 2025
28141eb
Make `test_smoke_webhook` more strict
MarkLindblad Jan 24, 2025
da8259c
Make two-call loop into one-call loop in `test_partition_file_async`
MarkLindblad Jan 24, 2025
aa56a33
Remove call to `split`
MarkLindblad Jan 24, 2025
b9f7b0f
Remove unneeded parameter from `.decode`
MarkLindblad Jan 24, 2025
2c435d7
move stream setting logic into funciton
MarkLindblad Jan 24, 2025
011527e
Rename async functions to have standard form
MarkLindblad Jan 24, 2025
b7be7a8
Rename `_partition_file_inner`'s `_webhook_dest_url` to just `webhook…
MarkLindblad Jan 24, 2025
e16f926
Fix function names in README, change double-call loop to single-call …
MarkLindblad Jan 24, 2025
e572033
Make return type of `partition_file_async_submit` more strict
MarkLindblad Jan 24, 2025
4fd7721
Remove pydantic, enum
MarkLindblad Jan 24, 2025
14c146e
Remove use of `inspect`, add * barrier to parameters, parse urls
MarkLindblad Jan 24, 2025
4df23ce
Change two-call loop to single-call loop in `test_partition_file_asyn…
MarkLindblad Jan 24, 2025
75646ba
Change two-call loop to single-call loop in `test_multiple_partition_…
MarkLindblad Jan 24, 2025
e2edada
Fix and improve async aryn-sdk examples
MarkLindblad Jan 24, 2025
a0ae467
Treat all 2** status codes as successful
MarkLindblad Jan 24, 2025
cd31818
Simplify mocked function sigature
MarkLindblad Jan 24, 2025
61a21af
Add documentation for orientation correction, add test
MarkLindblad Jan 24, 2025
419a63b
Add `partition_file_async_list` to `aryn-sdk`, multi-test
MarkLindblad Jan 24, 2025
b9abcd0
Make return type of `partition_file_async_result` more strict
MarkLindblad Jan 24, 2025
0ae3757
Make return type of `partition_file_async_list` more strict
MarkLindblad Jan 24, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 61 additions & 0 deletions lib/aryn-sdk/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,3 +96,64 @@ pil_img = convert_image_element(image_elts[0])
jpg_bytes = convert_image_element(image_elts[1], format='JPEG')
png_str = convert_image_element(image_elts[2], format="PNG", b64encode=True)
```

### Async Aryn DocParse

#### Single Job Example
```python
import time
from aryn_sdk.partition import partition_file_submit_async, partition_file_result_async

with open("my-favorite-pdf.pdf", "rb") as f:
job = partition_file_submit_async(
f,
use_ocr=True,
extract_table_structure=True,
)

job_id = job["job_id"]

# Poll for the results
result = partition_file_result_async(job_id)
while result.status == JobStatus.IN_PROGRESS:
time.sleep(5)
result = partition_file_result_async(job_id)
```

#### Multi-Job Example

```python
import logging
import time
from aryn_sdk.partition import partition_file_submit_async, partition_file_result_async, PartitionError

files = [open("file1.pdf", "rb"), open("file2.docx", "rb")]
job_ids = {}
for i, f in enumerate(files):
try:
job_ids[i] = partition_file_submit_async(f))
except Exception as e:
logging.warning(f"Failed to submit {f}: {e}")

results = {}
for i, job_id in job_ids.items():
result = partition_file_result_async(job_id)
while result.status == JobStatus.IN_PROGRESS:
time.sleep(5)
result = partition_file_result_async(job_id)
results[i] = result
```

#### Cancelling an async job

```python
from aryn_sdk.partition import partition_file_submit_async, cancel_async_partition_job
job_id = partition_file_submit_async(
"path/to/file.pdf",
use_ocr=True,
extract_table_structure=True,
extract_images=True,
)["job_id"]

cancel_async_partition_job(job_id)
```
16 changes: 15 additions & 1 deletion lib/aryn-sdk/aryn_sdk/partition/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,14 @@
from .partition import partition_file, tables_to_pandas, table_elem_to_dataframe, convert_image_element, PartitionError
from .partition import (
partition_file,
partition_file_submit_async,
partition_file_result_async,
cancel_async_partition_job,
tables_to_pandas,
table_elem_to_dataframe,
convert_image_element,
PartitionError,
JobStatus,
)
from .art import draw_with_boxes

__all__ = [
Expand All @@ -8,4 +18,8 @@
"draw_with_boxes",
"convert_image_element",
"PartitionError",
"partition_file_submit_async",
"partition_file_result_async",
"JobStatus",
"cancel_async_partition_job",
]
Loading
Loading