Skip to content

feat: support token-based budget in LostInTheMiddleRanker #11351

Description

@Aarkin7

Is your feature request related to a problem? Please describe.
I was reading through LostInTheMiddleRanker and noticed the ranker's word_count_threshold parameter is implemented with content.split() — so it counts whitespace-separated words — but I'd guess what most users actually care about is the downstream LLM's context budget, which is measured in tokens.

The two are very different in practice. English prose is roughly 1.3 tokens per word, but code-heavy documents can be 2.5–4 tokens per word (every (, ., = becomes its own token), non-Latin scripts are similarly inflated, and things like URLs or hashes can be 5–15 tokens for a single "word". So if I set word_count_threshold=8000 thinking I'm targeting an 8k context window, I might actually be sending 20k+ tokens downstream — and either hit a hard API error or, worse, get a silently-truncated prompt where the LLM answers without context I thought I'd given it. It's a quiet failure mode that's hard to notice until something goes wrong.

Describe the solution you'd like
A small, additive change: a new keyword-only count_mode: Literal["word", "char", "token"] = "word" parameter on LostInTheMiddleRanker, plus an optional tokenizer_encoding string for when token mode is selected. Default stays "word" so existing users see no behavior change at all.

The nice thing is that there's already a precedent for exactly this pattern in the codebase - RecursiveDocumentSplitter uses the same Literal["word", "char", "token"] signature, lazy-imports tiktoken, defaults to the o200k_base encoding, and initializes the tokenizer in warm_up(). So this proposal would just be applying an already-accepted pattern to a sister component that needs it for the same reason: fitting content inside an LLM's real budget.

Describe alternatives you've considered
A few things I thought about and rejected:

  • Rename word_count_threshold -> threshold. Cleaner, but breaks every existing user and every serialized pipeline using this ranker. Not worth it. Keeping the name (with a docstring note about the slight inaccuracy when in token mode) seems much less disruptive. An alias could be a possible follow-up, but I'd keep it out of this PR.
  • Pre-truncate documents myself before the ranker. Defeats the purpose — the ranker exists precisely so I don't have to. Also, pre-truncation doesn't compose with the U-shaped reordering, since the reordering itself decides which docs end up included.
  • Just document the limitation. A docstring warning helps a careful reader, but it doesn't fix the silent-failure case where a user has set a "reasonable" word threshold and is getting truncated prompts they don't know about.

Additional context
The existing precedent I mentioned is in haystack/components/preprocessors/recursive_splitter.py (the split_unit parameter, the tiktoken_imports lazy import, and the warm_up() block that calls tiktoken.get_encoding("o200k_base")). If a token-counting mode is the right call here, I'd want to mirror those exact choices for consistency rather than introduce new ones.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low priority, leave it in the backlog
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions