Skip to content

Commit 2353a42

Browse files
committed
Avoid negative scores returned from multi_match query with cross_fields
Under specific circumstances, when using `cross_fields` scoring on a `multi_match` query, we can end up with negative scores from the inverse document frequency calculation in the BM25 formula. Specifically, the IDF is calculated as: ``` log(1 + (N - n + 0.5) / (n + 0.5)) ``` where `N` is the number of documents containing the field and `n` is the number of documents containing the given term in the field. Obviously, `n` should always be less than or equal to `N`. Unfortunately, `cross_fields` makes up a new value for `n` and tries to use it across all fields. This change finds the minimum (nonzero) value of `N` and uses that as an upper bound for the new value of `n`. Signed-off-by: Michael Froh <[email protected]>
1 parent 56d8dc6 commit 2353a42

File tree

3 files changed

+38
-0
lines changed

3 files changed

+38
-0
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
4141
- Fix get field mapping API returns 404 error in mixed cluster with multiple versions ([#13624](https://github.com/opensearch-project/OpenSearch/pull/13624))
4242
- Allow clearing `remote_store.compatibility_mode` setting ([#13646](https://github.com/opensearch-project/OpenSearch/pull/13646))
4343
- Fix ReplicaShardBatchAllocator to batch shards without duplicates ([#13710](https://github.com/opensearch-project/OpenSearch/pull/13710))
44+
- Don't return negative scores from `multi_match` query with `cross_fields` type ([#13829](https://github.com/opensearch-project/OpenSearch/pull/13829))
4445

4546
### Security
4647

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
"Cross fields do not return negative scores":
2+
- do:
3+
index:
4+
index: test
5+
id: 1
6+
body: { "color" : "orange red yellow" }
7+
- do:
8+
index:
9+
index: test
10+
id: 2
11+
body: { "color": "orange red purple", "shape": "red square" }
12+
- do:
13+
index:
14+
index: test
15+
id: 3
16+
body: { "color" : "orange red yellow purple" }
17+
- do:
18+
indices.refresh: { }
19+
- do:
20+
search:
21+
index: test
22+
body:
23+
query:
24+
multi_match:
25+
query: "red"
26+
type: "cross_fields"
27+
fields: [ "color", "shape^100"]
28+
tie_breaker: 0.1
29+
explain: true
30+
- match: { hits.total.value: 3 }
31+
- match: { hits.hits.0._id: "2" }
32+
- gt: { hits.hits.2._score: 0.0 }

server/src/main/java/org/apache/lucene/queries/BlendedTermQuery.java

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,7 @@ protected void blend(final TermStates[] contexts, int maxDoc, IndexReader reader
120120
}
121121
int max = 0;
122122
long minSumTTF = Long.MAX_VALUE;
123+
int minDocCount = Integer.MAX_VALUE;
123124
for (int i = 0; i < contexts.length; i++) {
124125
TermStates ctx = contexts[i];
125126
int df = ctx.docFreq();
@@ -133,11 +134,15 @@ protected void blend(final TermStates[] contexts, int maxDoc, IndexReader reader
133134
// we need to find out the minimum sumTTF to adjust the statistics
134135
// otherwise the statistics don't match
135136
minSumTTF = Math.min(minSumTTF, reader.getSumTotalTermFreq(terms[i].field()));
137+
minDocCount = Math.min(minDocCount, reader.getDocCount(terms[i].field()));
136138
}
137139
}
138140
if (maxDoc > minSumTTF) {
139141
maxDoc = (int) minSumTTF;
140142
}
143+
if (maxDoc > minDocCount) {
144+
maxDoc = minDocCount;
145+
}
141146
if (max == 0) {
142147
return; // we are done that term doesn't exist at all
143148
}

0 commit comments

Comments
 (0)