-
-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1 #280
base: main
Are you sure you want to change the base?
Conversation
As of the latest commit, we can compress an index ( New:
Old:
The more interesting algorithms aren't implemented yet, but stay tuned. Also, the compression is concurrent and quite fast, Robust compresses in seconds. |
A quick update: I implemented precomputed scores (only as
|
``` | ||
Header := Version, Type, Encoding | ||
Version := Major, Minor, Path | ||
Type := ValueId, Count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit confused by Type
. What are ValueId
and Count
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ValueId
would be the type, and count would be how many of those, as in one, or pair or tuple, etc.
Actually, so far I implemented it like this: https://github.com/pisa-engine/pisa/pull/280/files#diff-2a007c99bc1af07f1fb150c293383559R71
Another approach would be to always have scalars in one file, and join multiple ones for tuples. But then we can't store arrays of undetermined length (say, positional index). But all of this is up for discussion.
> The latter should be further discussed. | ||
|
||
``` | ||
Posting File := Header, [Posting Block] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there one Header
for each Posting Block
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, one header, followed by a list of blocks.
Quick tests on Clueweb09B show essentially the same results for BM25 ranked OR as before, while the average drops from |
I've been working on starting a draft of index format specification and some code examples.
Let's discuss!
include/pisa/v1
test/test_v1.cpp