Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query container #373

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open

Query container #373

wants to merge 27 commits into from

Conversation

elshize
Copy link
Member

@elshize elshize commented Apr 26, 2020

This is still work in progress but you might have comments already.

Basically, I introduce QueryContainer which is data that can be read from input and written to output (more info later) and QueryRequest (names might differ at the end) that is the object that you'd pass to the processing algorithm.

The motivation is to keep track of queries and their related features. So for example, you can pass query and its threshold together, so that there's no accidental shift when passed two files (or even two files could be by accident completely unrelated, say, wrong set of thresholds, mixed from another experiment). This also allows to pipe stuff along.

The second major concern is having data from two different tasks that are not aligned. This might not be visible right away now but will become clearer when I introduce some additional algorithms and bigrams. Anyway, here's the gist: say, you have a query, a b c a. Clearly, you will not open four posting lists when you start the processing. So when you pass it to the query, it becomes a b c. But it's possible that you want to bring some additional term-related information from a different tool, be it within PISA or not. So I think it's important to have a deterministic way of, say, producing a list of term IDs. I figured it would be good to separate all query information that would be taken in and out, from the data that would be only used internally by the algorithms.

Anyway, this is my first take on it, let me know what you think, and I'll keep thinking about it as well. But this is really important for the algorithms I want to port from v1, and would also simplify and improve our command line greatly.

@elshize elshize added the wip label Apr 26, 2020
@elshize elshize requested review from JMMackenzie and amallia April 26, 2020 23:29
@elshize elshize self-assigned this Apr 26, 2020
@elshize elshize marked this pull request as draft April 26, 2020 23:29
Copy link
Member

@JMMackenzie JMMackenzie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great idea. Looks good to me. I was thinking about the thresholds issue just before, and how we should probably validate it somehow (at least requiring query identifier with the threshold), but this is even better.

@elshize elshize marked this pull request as ready for review April 27, 2020 15:56
@elshize elshize marked this pull request as draft April 27, 2020 15:57
@elshize elshize marked this pull request as ready for review April 28, 2020 23:49
@elshize elshize added enhancement New feature or request refactoring and removed wip labels Apr 29, 2020
Copy link
Member

@JMMackenzie JMMackenzie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment, looks nice! Good job.

data.processed_terms = std::move(terms);
at_least_one_required = true;
}
if (auto term_ids = get<std::vector<std::uint32_t>>(json, "term_ids"); term_ids) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any chance we'd have a case where the json contains raw, terms, and identifiers? And if so, which will we take as being "correct", or should we validate that they are equal somewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of taking whatever's most processed, so say IDs, and maybe have a flag --force-parse or something? Or we could decide depending on whether there is lexicon provided or not. But the disadvantage of that is that someone could leave it by mistake, but not define, say, stemmer, and then it's messed up. I don't think there's a perfect solution but whatever we decide should be easy to reason about. I think it's easier when you say "always take ID first, or if no IDs, then terms, or else query": there's a clear hierarchy and if you want something else, then be explicit. Or we can say, always try recompute as much as possible, and to be explicit if you don't want to do that. I think either of these two seem the best to me at the moment.

What are your thoughts?

Copy link
Member

@amallia amallia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I like the idea of having more structured queries.
I have few concerns about the design:

  • I don't really like the PIMPL design. Mostly a personal taste, plus I think it complicates the code. If the reason for using this design is compilation, then I prefer not to introduce it here.
  • I am not sure about having queries and thresholds together. I understand it prevents the user from passing wrong files, but on the other hand it reduces flexibility. Now a query file is chained with a k. I prefer more flexibility. Ideally you should pass to the QueryReader an file handler or a stream s.t. we can pass two separate files and the final QueryContainer will have query-threshold pairs.

All the rest looks good for me.

[[nodiscard]] auto id() const noexcept -> std::optional<std::string> const&;
[[nodiscard]] auto string() const noexcept -> std::optional<std::string> const&;
[[nodiscard]] auto terms() const noexcept -> std::optional<std::vector<std::string>> const&;
[[nodiscard]] auto term_ids() const noexcept -> std::optional<std::vector<std::uint32_t>> const&;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets not use std::uint32_t unless we decide to prepend std to all the pods

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed width integers like uint32_t are part of the standard library and are located in std namespace. The fact that some headers export them at the root level is not standard. These types are defined without namespaces (for obvious reasons) in the C standard. Compare example in https://en.cppreference.com/w/cpp/types/integer with https://en.cppreference.com/w/c/types/integer In either case, they are not part of the set of fundamental integer types: https://en.cppreference.com/w/cpp/language/types

This has nothing to do with being a POD. struct CustomStruct { int x; } is a POD, yet you would use it just the same as class Complex { /* magic heap stuff going on */ }.

private:
[[nodiscard]] auto is_stopword(std::uint32_t const term) const -> bool;

std::unique_ptr<StandardTermResolverParams> m_self;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why unique_ptr?

{
auto pos = std::find(line.begin(), line.end(), ':');
QueryContainer query;
QueryContainerInner& data = *query.m_data;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why a reference? Why a pointer? I am not very comfortable with this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can only assume by pointer you mean why m_data is a pointer, which I explain in a different comment. As to the reference, it's just a reference. I just want a shorter alias to m_data, that's all. I'm not following what the issue is here.

@elshize
Copy link
Member Author

elshize commented May 1, 2020

Let me explain why we want to use a pointer in QueryContainer (the same goes for the term resolver, although to a lesser extend, as in it's not quite as important there as it is in this case). There are two major reasons, and let me tackle it one by one.

Move overhead.

Once you start putting many members in a structure, it gets quite big. Any time you move it, all of that data needs to be copied. I'm referring to the members, i.e., std::optional, not the vector data. So each time you move a query, you copy five fields. So imagine you have a vector of queries and you want to sort them. Each swap will have to move about 4 pointers, 4 sizes and one float.

And although I agree this on its own could be less of an issue considering that this is not the most important, it strengthens the following argument.

Compilation

Let me start by saying that I know that query container itself is a small object that doesn't require much time to compile. Although the argument that you need to compile it about two dozen times is still valid, this is not about that.

The real problem that we've struggled for a long time is that each small change causes recompilation of so much code that is not as fast to compile. Consider this: query is an essential part that will be used in a majority of compilation units. Each time you change something minor in the implementation, it will need to recompile most of the binaries. And yes, you still need to link it, but that's considerably faster than parsing and generating template code and everything else is so time consuming. And yes, the header still might change but it will change much less often than the implementation. So I say we should put to such hot header as query.hpp as little as we need to, which means also the definition of types we don't need to expose anywhere don't need to be in the header.

Downsides

All that said, let's talk about the actual downsides. Here's the additional boilerplate:

  • Header
    1. One line of forward declaration.
    2. std::unique_ptr field (but replacing a number of fields)
  • Cpp
    1. Additional struct QueryContainerInner { and } lines (all inside would be just in a different place).
    2. Three additional lines for copy constructor and copy assignment operator compared to = default.
    3. m_data indirection when accessing data in methods.

Note that outside of that one cpp file you never deal with the fact that there's any kind of pointer inside it. The class still has normal value semantics, move is faster, copies have exactly the same semantics, there's literally nothing tricky about using this class.

The only real downside is m_data indirection and it is in one small file only. It doesn't affect any other piece of code.

As far as "additional complexity" argument goes, I don't buy it because it just makes certain fields private. And if one wants to explore the implementation, it just goes one level down, there's no tricks here, it's just a pointer, it's not rocket science. Especially if you compare to some stuff we do with templates, this is just so damn uncomplicated.

@elshize
Copy link
Member Author

elshize commented May 1, 2020

I thought about having threshold, and I agree that it's not ideal just having a threshold. I think we should have thresholds:

thresholds: [
    {"k": 10, "score": 5.3},
    {"k": 100, "score": 3.3},
    {"k": 1000, "score": 1.3},
]

This way, if we ask to run an algorithm that depends on a threshold for k = 20 and we don't have these thresholds, then we know to fail and what to print out as an error.

Note that just like we can pass term lexicon and parser and stemmer to produce IDs, we should be eventually be able to ask to estimate (one way or another) the thresholds if they are missing or if we want to force re-computation.

@elshize elshize mentioned this pull request May 6, 2020
@codecov
Copy link

codecov bot commented Jun 15, 2020

Codecov Report

Merging #373 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #373   +/-   ##
=======================================
  Coverage   92.42%   92.42%           
=======================================
  Files          91       93    +2     
  Lines        4921     4923    +2     
=======================================
+ Hits         4548     4550    +2     
  Misses        373      373           
Impacted Files Coverage Δ
include/pisa/query.hpp 100.00% <100.00%> (ø)
include/pisa/query/query_parser.hpp 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dde6095...8e6da84. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request refactoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants