Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: AsyncPipeline that can schedule components to run concurrently #8812

Merged
merged 99 commits into from
Feb 7, 2025

Conversation

mathislucka
Copy link
Member

@mathislucka mathislucka commented Feb 4, 2025

Related Issues

Proposed Changes:

Implements an AsyncPipeline that supports:

  • running pipelines asynchronously
  • step-by-step execution through an async generator
  • concurrent execution of components whenever possible (e.g. hybrid retrieval, multiple generators that can run in parallel)
  • sync run-method with concurrent execution of components

How did you test it?

  • unit tests
  • adapted behavioral tests to use Pipeline and AsyncPipeline

Notes for the reviewer

Review after #8707
Code was reviewed here before: deepset-ai/haystack-experimental#180

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

# Conflicts:
#	releasenotes/notes/fix-pipeline-run-2fefeafc705a6d91.yaml
#	test/conftest.py
#	test/core/pipeline/features/conftest.py
#	test/core/pipeline/features/pipeline_run.feature
#	test/core/pipeline/features/test_run.py
#	test/core/pipeline/test_component_checks.py
#	test/core/pipeline/test_pipeline.py
#	test/core/pipeline/test_pipeline_base.py
@mathislucka mathislucka requested review from davidsbatista and Amnah199 and removed request for vblagoje February 6, 2025 15:34
@mathislucka
Copy link
Member Author

@Amnah199 @davidsbatista much smaller diff now that the other PR is merged.

This is largely the same as the PR that we already merged to experimental with the following differences:

  • fixed bug where we didn't wait long enough for DEFER(_LAST)
  • added pipeline type to telemetry

@@ -23,6 +23,7 @@
"default_to_dict",
"DeserializationError",
"ComponentError",
"AsyncPipeline",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) suggestion: keeping these imports ordered alphabetically helps locate something as the list grows

__all__ = [
    "Answer",
    "AsyncPipeline",
    "ComponentError",
    "DeserializationError",
    "Document",
    "ExtractedAnswer",
    "GeneratedAnswer",
    "Pipeline",
    "PredefinedPipeline",
    "component",
    "default_from_dict",
    "default_to_dict",
]

@davidsbatista
Copy link
Contributor

I did another quick review, although most of this was already reviewed before

From my side it's approved, but to play safe, let's wait for Amna to also do another quick review before merging.

Copy link
Contributor

@davidsbatista davidsbatista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 12 to 14
async_loop = asyncio.new_event_loop()
asyncio.set_event_loop(async_loop)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well, we can avoid manual handling of loops by using asyncio.run, if you feel that would be better.

Comment on lines 6 to 26
def test_async_pipeline_reentrance(waiting_component, spying_tracer):
pp = AsyncPipeline()
pp.add_component("wait", waiting_component())

run_data = [{"wait_for": 1}, {"wait_for": 2}]

async_loop = asyncio.new_event_loop()
asyncio.set_event_loop(async_loop)

async def run_all():
# Create concurrent tasks for each pipeline run
tasks = [pp.run_async(data) for data in run_data]
await asyncio.gather(*tasks)

try:
async_loop.run_until_complete(run_all())
component_spans = [sp for sp in spying_tracer.spans if sp.operation_name == "haystack.component.run_async"]
for span in component_spans:
assert span.tags["haystack.component.visits"] == 1
finally:
async_loop.close()
Copy link
Contributor

@Amnah199 Amnah199 Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this? (although I didnt test it)

Suggested change
def test_async_pipeline_reentrance(waiting_component, spying_tracer):
pp = AsyncPipeline()
pp.add_component("wait", waiting_component())
run_data = [{"wait_for": 1}, {"wait_for": 2}]
async_loop = asyncio.new_event_loop()
asyncio.set_event_loop(async_loop)
async def run_all():
# Create concurrent tasks for each pipeline run
tasks = [pp.run_async(data) for data in run_data]
await asyncio.gather(*tasks)
try:
async_loop.run_until_complete(run_all())
component_spans = [sp for sp in spying_tracer.spans if sp.operation_name == "haystack.component.run_async"]
for span in component_spans:
assert span.tags["haystack.component.visits"] == 1
finally:
async_loop.close()
def test_async_pipeline_reentrance(waiting_component, spying_tracer):
"""
Test that the AsyncPipeline can execute multiple runs concurrently and that
each component is called exactly once per run (as indicated by the 'visits' tag).
"""
async_pipeline = AsyncPipeline()
async_pipeline.add_component("wait", waiting_component())
run_data = [{"wait_for": 1}, {"wait_for": 2}]
async def run_all():
tasks = [async_pipeline.run_async(data) for data in run_data]
await asyncio.gather(*tasks)
component_spans = [
sp for sp in spying_tracer.spans
if sp.operation_name == "haystack.component.run_async"
]
for span in component_spans:
expected_visits = 1
actual_visits = span.tags.get("haystack.component.visits")
assert actual_visits == expected_visits, (
f"Expected {expected_visits} visit, got {actual_visits} for span {span}"
)
# Use asyncio.run to manage the event loop.
asyncio.run(run_all())

Copy link
Contributor

@Amnah199 Amnah199 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks again @mathislucka.
Much appreciated!

@mathislucka mathislucka merged commit e5b9bde into main Feb 7, 2025
18 checks passed
@mathislucka mathislucka deleted the feat/async_pipeline branch February 7, 2025 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

components should run concurrently when not explicitly waiting on inputs
4 participants