-
Notifications
You must be signed in to change notification settings - Fork 20
Description
Problem
Calling LLMs from Flock functions like llm_complete, llm_filter, and llm_embedding is bound by long-running, high-latency IO to the LLM provider.
batch_size helps improve throughput, but in practice it has diminishing returns as batches grow (especially with smaller/faster models). At some point, increasing batch size no longer improves total throughput and may even hurt latency/efficiency depending on provider behavior.
This creates a throughput ceiling for large datasets and makes end-to-end runtime too long for many analytical workloads.
A dedicated parallelism control would allow fan-out of requests per DataChunk, significantly improving throughput (often by an order of magnitude depending on provider/model/setup), while keeping user control over resource usage and provider rate limits.
Proposal: Add a Model-Level parallelism Configuration
Suggested Interface Changes
- Introduce an optional integer argument (
parallelism) to model definitions, e.g.:CREATE MODEL( 'my-model', 'gpt-4o', 'openai', {"batch_size": 16, "parallelism": 4} );
- If not set, default to serial execution (
parallelism=1). - When set to N>1, Flock function batching logic should process up to N batches concurrently per data chunk, for the relevant functions:
llm_completellm_filterllm_embedding
Additional points to consider
- Function coverage - Important for
llm_complete,llm_filter, andllm_embedding, could also be supported for agg functions, but probably less critical - Ordering - Output row ordering must be preserved deterministically.
- Error handling - Clear handling of provider errors/timeouts/rate limits under parallel execution. Retry/backoff behavior should remain safe and predictable.
Implementation Notes
- The current provider handler layer already supports executing multiple HTTP requests in parallel using libcurl's multi-handle. For an easy win, the batching logic in scalar function implementations can be refactored to enqueue up to
parallelismbatch requests before invoking the provider handler'sCollectCompletions, collect as they complete, then re-order by batch index - A more robust and cleaner approach could be to introduce provider-level async/multiplexed request handling by adding a submission API plus completion polling/collection, then execute requests with bounded async HTTP concurrency controlled by parallelism.