Feature Request: Configurable request parallelism for model definition

## Problem

Calling LLMs from Flock functions like `llm_complete`, `llm_filter`, and `llm_embedding` is bound by long-running, high-latency IO to the LLM provider. 
`batch_size` helps improve throughput, but in practice it has diminishing returns as batches grow (especially with smaller/faster models). At some point, increasing batch size no longer improves total throughput and may even hurt latency/efficiency depending on provider behavior.

This creates a throughput ceiling for large datasets and makes end-to-end runtime too long for many analytical workloads.

A dedicated parallelism control would allow fan-out of requests per DataChunk, significantly improving throughput (often by an order of magnitude depending on provider/model/setup), while keeping user control over resource usage and provider rate limits.


## Proposal: Add a Model-Level `parallelism` Configuration

### Suggested Interface Changes
- Introduce an optional integer argument (`parallelism`) to model definitions, e.g.:
  ```sql
  CREATE MODEL(
    'my-model',
    'gpt-4o',
    'openai',
    {"batch_size": 16, "parallelism": 4}
  );
  ```
- If not set, default to serial execution (`parallelism=1`).
- When set to N>1, Flock function batching logic should process up to N batches concurrently per data chunk, for the relevant functions:
    - `llm_complete`
    - `llm_filter`
    - `llm_embedding`

---

### Additional points to consider

1. **Function coverage** - Important for `llm_complete`, `llm_filter`, and `llm_embedding`, could also be supported for agg functions, but probably less critical 
2. **Ordering** - Output row ordering must be preserved deterministically.
3. **Error handling** - Clear handling of provider errors/timeouts/rate limits under parallel execution. Retry/backoff behavior should remain safe and predictable.


## Implementation Notes
- The current provider handler layer already supports executing multiple HTTP requests in parallel using libcurl's multi-handle. For an easy win, the batching logic in scalar function implementations can be refactored to enqueue up to `parallelism` batch requests before invoking the provider handler's `CollectCompletions`, collect as they complete, then re-order by batch index
- A more robust and cleaner approach could be to introduce provider-level async/multiplexed request handling by adding a submission API plus completion polling/collection, then execute requests with bounded async HTTP concurrency controlled by parallelism.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Configurable request parallelism for model definition #256

Problem

Proposal: Add a Model-Level `parallelism` Configuration

Suggested Interface Changes

Additional points to consider

Implementation Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Configurable request parallelism for model definition #256

Description

Problem

Proposal: Add a Model-Level parallelism Configuration

Suggested Interface Changes

Additional points to consider

Implementation Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Proposal: Add a Model-Level `parallelism` Configuration