Skip to content

Request for comment: batched inference API for high-throughput redaction #6

@solomonneas

Description

@solomonneas

Background

The current Python API exposes a single-input redaction surface:

  • opf.redact(text: str) -> str (module-level convenience)
  • OPF.redact(text: str, *, decode: DecodeOptions | None = None) -> str | RedactionResult

Internally, predict_text in opf/_core/runtime.py builds a batch-of-one tensor per call:

window_tokens = torch.tensor(
    [list(window.tokens)],
    device=runtime.device,
    dtype=torch.int32,
)

The CLI wrapper in opf/__main__.py also iterates inputs one at a time via iter_inputs().

Motivation

The README positions OPF for "high-throughput data sanitization workflows." For that use case, batch-size-1 inference leaves a lot on the table, particularly for this architecture:

  • The model is a sparse MoE (128 experts total, top-4 per token, 1.5B total / 50M active)
  • Attention is banded with band size 128 (effective window 257), so per-token cost is relatively stable and doesn't scale quadratically with sequence length
  • Throughput is largely bounded by expert dispatch/gather overhead, not per-token compute. Amortizing that overhead across a batch should give meaningful speedup on short-to-medium inputs.

Concretely, realistic workflows that would benefit:

  • Sanitizing a corpus of chat logs / support tickets / log lines (thousands of small inputs)
  • Pipeline preprocessors that redact in a streaming fashion
  • CI-style batch sweeps (e.g., find . -name "*.md" | xargs -P ... opf, which today serializes anyway because each call re-loads the runtime)

Proposed scope (for discussion, not committing to any shape yet)

A public redact_many or redact_batch entrypoint:

def redact_many(
    self,
    texts: Sequence[str],
    *,
    decode: DecodeOptions | None = None,
    batch_size: int | None = None,
) -> list[str | RedactionResult]: ...

And optionally a matching CLI mode so cat inputs.txt | opf --stdin-mode line can batch internally instead of serializing window-by-window.

Open questions for maintainers

Before filing a PR, I would appreciate guidance on:

  1. Appetite: is a batched public API in-scope for this repo, or would you prefer users drive batching externally (e.g., construct their own batches via the private predict_text path)?
  2. API surface: redact_many(texts) returning a list, or a streaming generator redact_iter(texts) that yields as windows complete?
  3. Batching axis: fixed batch_size vs. token-budget packing (pack until N tokens), vs. both?
  4. Windowing interaction: examples with different window counts complicate batching. Acceptable to pad the short ones, or is per-example sequential windowing with batched token-classification forward passes a better split?
  5. CLI exposure: should a batched mode be exposed via a flag (e.g. --batch-size N), or kept as Python-API-only initially?

Happy to prototype whichever shape aligns with maintainer preference. I'd rather ask than submit a large PR that touches the public API in a direction you'd push back on.

Not requesting in this issue

  • Changes to the Viterbi decoder or label taxonomy
  • Async / multi-GPU / model-parallel inference
  • Any change to the default CLI single-input behavior

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions