Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix derived source for binary and byte vectors #2533

Merged
merged 1 commit into from
Feb 18, 2025

Conversation

jmazanec15
Copy link
Member

Description

For binary and byte vectors, for derived source, we were not formatting them before adding them back to the source. Thus, they were binary strings in the source. This change fixes this formatting to format them as ints before adding back.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
  • Commits are signed per the DCO using --signoff.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@jmazanec15 jmazanec15 added Bug Fixes Changes to a system or product designed to handle a programming bug/glitch backport 2.x labels Feb 18, 2025
For binary and byte vectors, for derived source, we were not formatting
them before adding them back to the source. Thus, they were binary
strings in the source. This change fixes this formatting to format them
as ints before adding back.

Signed-off-by: John Mazanec <[email protected]>
Copy link
Collaborator

@Vikasht34 Vikasht34 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

protected Object formatVector(FieldInfo fieldInfo, KNNVectorValues<?> vectorValues) throws IOException {
Object vectorValue = vectorValues.getVector();
// If the vector value is a byte[], we must deserialize
if (vectorValue instanceof byte[]) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use the datatype of the field here, rather than instance of check on byte[].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a byte[] in order to deserialize, so this check is required. In terms of displaying, deserializeStoredVector takes the vectorDataType, so we can be sure that it will format it properly.

Comment on lines +34 to +35
VectorDataType vectorDataType = FieldInfoExtractor.extractVectorDataType(fieldInfo);
return KNNVectorFieldMapperUtil.deserializeStoredVector(vectorBytesRef, vectorDataType);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate why we need to do this? I am trying to understand this like why we need it, since we already have byte[]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this is what the IT I added looks like before:

  2> java.lang.AssertionError: Docs do not match: 1 expected:<{test_vector=[115, -43, 26, -69, -40, -100, -72, 25, 111, 14, -5, 104, -110, -7, 77, 104]}> but was:<{test_vector=c9Uau9icuBlvDvtokvlNaA==}>

Basically, source is expected to be an int array, but because we are adding a byte array, it gets serialized as a byte string

@jmazanec15 jmazanec15 merged commit 0df5f62 into opensearch-project:main Feb 18, 2025
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Bug Fixes Changes to a system or product designed to handle a programming bug/glitch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants