-
-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query container #373
base: main
Are you sure you want to change the base?
Query container #373
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a great idea. Looks good to me. I was thinking about the thresholds issue just before, and how we should probably validate it somehow (at least requiring query identifier with the threshold), but this is even better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment, looks nice! Good job.
data.processed_terms = std::move(terms); | ||
at_least_one_required = true; | ||
} | ||
if (auto term_ids = get<std::vector<std::uint32_t>>(json, "term_ids"); term_ids) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any chance we'd have a case where the json contains raw, terms, and identifiers? And if so, which will we take as being "correct", or should we validate that they are equal somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of taking whatever's most processed, so say IDs, and maybe have a flag --force-parse
or something? Or we could decide depending on whether there is lexicon provided or not. But the disadvantage of that is that someone could leave it by mistake, but not define, say, stemmer, and then it's messed up. I don't think there's a perfect solution but whatever we decide should be easy to reason about. I think it's easier when you say "always take ID first, or if no IDs, then terms, or else query": there's a clear hierarchy and if you want something else, then be explicit. Or we can say, always try recompute as much as possible, and to be explicit if you don't want to do that. I think either of these two seem the best to me at the moment.
What are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I like the idea of having more structured queries.
I have few concerns about the design:
- I don't really like the PIMPL design. Mostly a personal taste, plus I think it complicates the code. If the reason for using this design is compilation, then I prefer not to introduce it here.
- I am not sure about having queries and thresholds together. I understand it prevents the user from passing wrong files, but on the other hand it reduces flexibility. Now a query file is chained with a
k
. I prefer more flexibility. Ideally you should pass to the QueryReader an file handler or a stream s.t. we can pass two separate files and the final QueryContainer will have query-threshold pairs.
All the rest looks good for me.
[[nodiscard]] auto id() const noexcept -> std::optional<std::string> const&; | ||
[[nodiscard]] auto string() const noexcept -> std::optional<std::string> const&; | ||
[[nodiscard]] auto terms() const noexcept -> std::optional<std::vector<std::string>> const&; | ||
[[nodiscard]] auto term_ids() const noexcept -> std::optional<std::vector<std::uint32_t>> const&; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets not use std::uint32_t
unless we decide to prepend std
to all the pods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed width integers like uint32_t
are part of the standard library and are located in std
namespace. The fact that some headers export them at the root level is not standard. These types are defined without namespaces (for obvious reasons) in the C standard. Compare example in https://en.cppreference.com/w/cpp/types/integer with https://en.cppreference.com/w/c/types/integer In either case, they are not part of the set of fundamental integer types: https://en.cppreference.com/w/cpp/language/types
This has nothing to do with being a POD. struct CustomStruct { int x; }
is a POD, yet you would use it just the same as class Complex { /* magic heap stuff going on */ }
.
include/pisa/query/parser.hpp
Outdated
private: | ||
[[nodiscard]] auto is_stopword(std::uint32_t const term) const -> bool; | ||
|
||
std::unique_ptr<StandardTermResolverParams> m_self; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why unique_ptr
?
{ | ||
auto pos = std::find(line.begin(), line.end(), ':'); | ||
QueryContainer query; | ||
QueryContainerInner& data = *query.m_data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why a reference? Why a pointer? I am not very comfortable with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can only assume by pointer you mean why m_data
is a pointer, which I explain in a different comment. As to the reference, it's just a reference. I just want a shorter alias to m_data
, that's all. I'm not following what the issue is here.
Let me explain why we want to use a pointer in Move overhead.Once you start putting many members in a structure, it gets quite big. Any time you move it, all of that data needs to be copied. I'm referring to the members, i.e., And although I agree this on its own could be less of an issue considering that this is not the most important, it strengthens the following argument. CompilationLet me start by saying that I know that query container itself is a small object that doesn't require much time to compile. Although the argument that you need to compile it about two dozen times is still valid, this is not about that. The real problem that we've struggled for a long time is that each small change causes recompilation of so much code that is not as fast to compile. Consider this: query is an essential part that will be used in a majority of compilation units. Each time you change something minor in the implementation, it will need to recompile most of the binaries. And yes, you still need to link it, but that's considerably faster than parsing and generating template code and everything else is so time consuming. And yes, the header still might change but it will change much less often than the implementation. So I say we should put to such hot header as DownsidesAll that said, let's talk about the actual downsides. Here's the additional boilerplate:
Note that outside of that one cpp file you never deal with the fact that there's any kind of pointer inside it. The class still has normal value semantics, The only real downside is As far as "additional complexity" argument goes, I don't buy it because it just makes certain fields private. And if one wants to explore the implementation, it just goes one level down, there's no tricks here, it's just a pointer, it's not rocket science. Especially if you compare to some stuff we do with templates, this is just so damn uncomplicated. |
I thought about having threshold, and I agree that it's not ideal just having a threshold. I think we should have thresholds: thresholds: [
{"k": 10, "score": 5.3},
{"k": 100, "score": 3.3},
{"k": 1000, "score": 1.3},
] This way, if we ask to run an algorithm that depends on a threshold for Note that just like we can pass term lexicon and parser and stemmer to produce IDs, we should be eventually be able to ask to estimate (one way or another) the thresholds if they are missing or if we want to force re-computation. |
Codecov Report
@@ Coverage Diff @@
## master #373 +/- ##
=======================================
Coverage 92.42% 92.42%
=======================================
Files 91 93 +2
Lines 4921 4923 +2
=======================================
+ Hits 4548 4550 +2
Misses 373 373
Continue to review full report at Codecov.
|
This is still work in progress but you might have comments already.
Basically, I introduce
QueryContainer
which is data that can be read from input and written to output (more info later) andQueryRequest
(names might differ at the end) that is the object that you'd pass to the processing algorithm.The motivation is to keep track of queries and their related features. So for example, you can pass query and its threshold together, so that there's no accidental shift when passed two files (or even two files could be by accident completely unrelated, say, wrong set of thresholds, mixed from another experiment). This also allows to pipe stuff along.
The second major concern is having data from two different tasks that are not aligned. This might not be visible right away now but will become clearer when I introduce some additional algorithms and bigrams. Anyway, here's the gist: say, you have a query,
a b c a
. Clearly, you will not open four posting lists when you start the processing. So when you pass it to the query, it becomesa b c
. But it's possible that you want to bring some additional term-related information from a different tool, be it within PISA or not. So I think it's important to have a deterministic way of, say, producing a list of term IDs. I figured it would be good to separate all query information that would be taken in and out, from the data that would be only used internally by the algorithms.Anyway, this is my first take on it, let me know what you think, and I'll keep thinking about it as well. But this is really important for the algorithms I want to port from
v1
, and would also simplify and improve our command line greatly.