Monitor TermFilteredPresearcher does not return stored query if it contains filter field #14427

bjacobowitz · 2025-04-02T16:33:45Z

Description

TermFilteredPresearcher may fail to return stored queries if those queries contain the filter field in the query itself (not just in the metadata).

When building the presearcher query (here) , TermFilteredPresearcher removes filter field tokens from the main clause of the presearcher query (via acceptor here) , presumably under the assumption that they will be added back to the query in a separate clause for filter fields, ANDed with the first clause (here).

We end up with a presearcher query that is something like

+(some_other_field_in_query_index: value) #(+(my_filter_field:value))

However, if an indexed query contains that filter field, and if that field was the only indexed field for the query's associated document in the query index, the first of the AND'd clauses cannot match (because the filter field was omitted), so the overall AND'd presearcher query cannot match, and the presearcher fails to return the query.

A user can work around this by using an additional dedicated field for the filter field (i.e. adding it on both the query metadata and the document), but this seems like an easy trap to fall into.

My question here: is this intentional? Is the idea of a "filter field" that it appears in documents and in MonitorQuery metadata but must not appear in a itself query? I'm aware of another issue about the intended behavior of Monitor filter fields (#11607), so I'm unsure.

If intentional, I think we should document that more directly. If unintentional, we might consider removing the check on filter fields (here) when building the first part of the presearcher query.

I've set up a test project here to demonstrate the problem (specifically in this test file, where testWithEmptyMetadata works but testWithMetadata fails).

Version and environment details

Tested with Lucene 10.1.0 on macOS Sequoia 15.3.2 (but I think the problem has been around since at least Lucene version 8, if not earlier).

The text was updated successfully, but these errors were encountered:

bjacobowitz · 2025-05-05T17:54:03Z

On further reflection / investigation, I think updating the documentation is the way to go here.

I tried out permitting the filter fields on both sides of the presearcher query (in the terms clause and the filter clause) and that fixes the correctness issue, but it comes at a potentially unacceptable performance cost. See the attached patch here (or this commit) for the implementation I tried out.

With this (inadvisable) change, if we allow the filter field on both sides, we can end up with a presearcher query like this:

+((field:(test) language:(en)) __anytokenfield:__ANYTOKEN__) #(+(language:en))

This presearcher query would still match a stored query involving field:test and the filter field language:en, but by including the language on both sides the presearcher may return any query that includes language:en for matching. That runs the risk of entirely subverting the optimization of only running queries involving field:test.

The magnitude of the performance cost here depends on how specific the filter field's values are. With a somewhat specific filter field, where relatively few queries share the same value, inadvertently running all queries with the same value incurs a relatively low cost, but in the case where the filter field is something common like language, where many queries will share the same value for the filter field, the cost of running those extra queries would be high, and is probably not acceptable.

There is a larger question around "correctness" when a filter field appears in the query itself, because the query could contradict the filter field's value from metadata! For example, if the query's filter field in metadata indicates "+language:en" and the query itself has a clause "-language:en", what is the correct behavior? Hard to say. Allowing that field to float freely in the stored query doesn't really make sense and should be avoided, but I wouldn't say this is obvious to the user.

I think it would be ideal to block the monitor from storing a query which contains a filter field in the query itself, to prevent the user from stumbling into errors like I did, but to do that we would potentially need to throw a new exception when storing queries and it would run the risk of breaking existing users with a new error at runtime, so I don't think it's such a clean option.

In the absence of a more direct solution here, I think we should update the documentation around filter fields, so that anyone newly learning about them will also learn that they should not appear in the query. I will open a PR to do this.

bjacobowitz added the type:bug label Apr 2, 2025

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking Apr 2, 2025

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking Apr 2, 2025

bjacobowitz linked a pull request May 13, 2025 that will close this issue

Clarify filter fields usage in javadocs #14660

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Monitor TermFilteredPresearcher does not return stored query if it contains filter field #14427

Monitor TermFilteredPresearcher does not return stored query if it contains filter field #14427

bjacobowitz commented Apr 2, 2025

bjacobowitz commented May 5, 2025

Uh oh!

Monitor TermFilteredPresearcher does not return stored query if it contains filter field #14427

Monitor TermFilteredPresearcher does not return stored query if it contains filter field #14427

Comments

bjacobowitz commented Apr 2, 2025

Description

Version and environment details

bjacobowitz commented May 5, 2025

Uh oh!