Skip to content

Feature Request: Configurable request parallelism for model definition #256

@yossig

Description

@yossig

Problem

Calling LLMs from Flock functions like llm_complete, llm_filter, and llm_embedding is bound by long-running, high-latency IO to the LLM provider.
batch_size helps improve throughput, but in practice it has diminishing returns as batches grow (especially with smaller/faster models). At some point, increasing batch size no longer improves total throughput and may even hurt latency/efficiency depending on provider behavior.

This creates a throughput ceiling for large datasets and makes end-to-end runtime too long for many analytical workloads.

A dedicated parallelism control would allow fan-out of requests per DataChunk, significantly improving throughput (often by an order of magnitude depending on provider/model/setup), while keeping user control over resource usage and provider rate limits.

Proposal: Add a Model-Level parallelism Configuration

Suggested Interface Changes

  • Introduce an optional integer argument (parallelism) to model definitions, e.g.:
    CREATE MODEL(
      'my-model',
      'gpt-4o',
      'openai',
      {"batch_size": 16, "parallelism": 4}
    );
  • If not set, default to serial execution (parallelism=1).
  • When set to N>1, Flock function batching logic should process up to N batches concurrently per data chunk, for the relevant functions:
    • llm_complete
    • llm_filter
    • llm_embedding

Additional points to consider

  1. Function coverage - Important for llm_complete, llm_filter, and llm_embedding, could also be supported for agg functions, but probably less critical
  2. Ordering - Output row ordering must be preserved deterministically.
  3. Error handling - Clear handling of provider errors/timeouts/rate limits under parallel execution. Retry/backoff behavior should remain safe and predictable.

Implementation Notes

  • The current provider handler layer already supports executing multiple HTTP requests in parallel using libcurl's multi-handle. For an easy win, the batching logic in scalar function implementations can be refactored to enqueue up to parallelism batch requests before invoking the provider handler's CollectCompletions, collect as they complete, then re-order by batch index
  • A more robust and cleaner approach could be to introduce provider-level async/multiplexed request handling by adding a submission API plus completion polling/collection, then execute requests with bounded async HTTP concurrency controlled by parallelism.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions