-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix derived source for binary and byte vectors #2533
Fix derived source for binary and byte vectors #2533
Conversation
For binary and byte vectors, for derived source, we were not formatting them before adding them back to the source. Thus, they were binary strings in the source. This change fixes this formatting to format them as ints before adding back. Signed-off-by: John Mazanec <[email protected]>
63cf35c
to
ed8a06e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
protected Object formatVector(FieldInfo fieldInfo, KNNVectorValues<?> vectorValues) throws IOException { | ||
Object vectorValue = vectorValues.getVector(); | ||
// If the vector value is a byte[], we must deserialize | ||
if (vectorValue instanceof byte[]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use the datatype of the field here, rather than instance of check on byte[].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a byte[] in order to deserialize, so this check is required. In terms of displaying, deserializeStoredVector takes the vectorDataType, so we can be sure that it will format it properly.
VectorDataType vectorDataType = FieldInfoExtractor.extractVectorDataType(fieldInfo); | ||
return KNNVectorFieldMapperUtil.deserializeStoredVector(vectorBytesRef, vectorDataType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate why we need to do this? I am trying to understand this like why we need it, since we already have byte[]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, this is what the IT I added looks like before:
2> java.lang.AssertionError: Docs do not match: 1 expected:<{test_vector=[115, -43, 26, -69, -40, -100, -72, 25, 111, 14, -5, 104, -110, -7, 77, 104]}> but was:<{test_vector=c9Uau9icuBlvDvtokvlNaA==}>
Basically, source is expected to be an int array, but because we are adding a byte array, it gets serialized as a byte string
Description
For binary and byte vectors, for derived source, we were not formatting them before adding them back to the source. Thus, they were binary strings in the source. This change fixes this formatting to format them as ints before adding back.
Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.