Query container #373

elshize · 2020-04-26T23:29:15Z

This is still work in progress but you might have comments already.

Basically, I introduce QueryContainer which is data that can be read from input and written to output (more info later) and QueryRequest (names might differ at the end) that is the object that you'd pass to the processing algorithm.

The motivation is to keep track of queries and their related features. So for example, you can pass query and its threshold together, so that there's no accidental shift when passed two files (or even two files could be by accident completely unrelated, say, wrong set of thresholds, mixed from another experiment). This also allows to pipe stuff along.

The second major concern is having data from two different tasks that are not aligned. This might not be visible right away now but will become clearer when I introduce some additional algorithms and bigrams. Anyway, here's the gist: say, you have a query, a b c a. Clearly, you will not open four posting lists when you start the processing. So when you pass it to the query, it becomes a b c. But it's possible that you want to bring some additional term-related information from a different tool, be it within PISA or not. So I think it's important to have a deterministic way of, say, producing a list of term IDs. I figured it would be good to separate all query information that would be taken in and out, from the data that would be only used internally by the algorithms.

Anyway, this is my first take on it, let me know what you think, and I'll keep thinking about it as well. But this is really important for the algorithms I want to port from v1, and would also simplify and improve our command line greatly.

JMMackenzie

I think this is a great idea. Looks good to me. I was thinking about the thresholds issue just before, and how we should probably validate it somehow (at least requiring query identifier with the threshold), but this is even better.

src/query.cpp

test/test_query.cpp

…ery-container

JMMackenzie

Minor comment, looks nice! Good job.

JMMackenzie · 2020-04-29T22:39:04Z

src/query.cpp

+            data.processed_terms = std::move(terms);
+            at_least_one_required = true;
+        }
+        if (auto term_ids = get<std::vector<std::uint32_t>>(json, "term_ids"); term_ids) {


Is there any chance we'd have a case where the json contains raw, terms, and identifiers? And if so, which will we take as being "correct", or should we validate that they are equal somewhere?

I was thinking of taking whatever's most processed, so say IDs, and maybe have a flag --force-parse or something? Or we could decide depending on whether there is lexicon provided or not. But the disadvantage of that is that someone could leave it by mistake, but not define, say, stemmer, and then it's messed up. I don't think there's a perfect solution but whatever we decide should be easy to reason about. I think it's easier when you say "always take ID first, or if no IDs, then terms, or else query": there's a clear hierarchy and if you want something else, then be explicit. Or we can say, always try recompute as much as possible, and to be explicit if you don't want to do that. I think either of these two seem the best to me at the moment.

What are your thoughts?

amallia

Overall I like the idea of having more structured queries.
I have few concerns about the design:

I don't really like the PIMPL design. Mostly a personal taste, plus I think it complicates the code. If the reason for using this design is compilation, then I prefer not to introduce it here.
I am not sure about having queries and thresholds together. I understand it prevents the user from passing wrong files, but on the other hand it reduces flexibility. Now a query file is chained with a k. I prefer more flexibility. Ideally you should pass to the QueryReader an file handler or a stream s.t. we can pass two separate files and the final QueryContainer will have query-threshold pairs.

All the rest looks good for me.

amallia · 2020-04-30T19:24:13Z

include/pisa/query.hpp

+    [[nodiscard]] auto id() const noexcept -> std::optional<std::string> const&;
+    [[nodiscard]] auto string() const noexcept -> std::optional<std::string> const&;
+    [[nodiscard]] auto terms() const noexcept -> std::optional<std::vector<std::string>> const&;
+    [[nodiscard]] auto term_ids() const noexcept -> std::optional<std::vector<std::uint32_t>> const&;


lets not use std::uint32_t unless we decide to prepend std to all the pods

Fixed width integers like uint32_t are part of the standard library and are located in std namespace. The fact that some headers export them at the root level is not standard. These types are defined without namespaces (for obvious reasons) in the C standard. Compare example in https://en.cppreference.com/w/cpp/types/integer with https://en.cppreference.com/w/c/types/integer In either case, they are not part of the set of fundamental integer types: https://en.cppreference.com/w/cpp/language/types

This has nothing to do with being a POD. struct CustomStruct { int x; } is a POD, yet you would use it just the same as class Complex { /* magic heap stuff going on */ }.

include/pisa/query/parser.hpp

amallia · 2020-04-30T19:32:01Z

include/pisa/query/parser.hpp

+  private:
+    [[nodiscard]] auto is_stopword(std::uint32_t const term) const -> bool;
+
+    std::unique_ptr<StandardTermResolverParams> m_self;


Why unique_ptr?

tools/filter_queries.cpp

include/pisa/query/parser.hpp

amallia · 2020-04-30T19:51:21Z

src/query.cpp

+{
+    auto pos = std::find(line.begin(), line.end(), ':');
+    QueryContainer query;
+    QueryContainerInner& data = *query.m_data;


why a reference? Why a pointer? I am not very comfortable with this.

I can only assume by pointer you mean why m_data is a pointer, which I explain in a different comment. As to the reference, it's just a reference. I just want a shorter alias to m_data, that's all. I'm not following what the issue is here.

elshize · 2020-05-01T12:44:53Z

Let me explain why we want to use a pointer in QueryContainer (the same goes for the term resolver, although to a lesser extend, as in it's not quite as important there as it is in this case). There are two major reasons, and let me tackle it one by one.

Move overhead.

Once you start putting many members in a structure, it gets quite big. Any time you move it, all of that data needs to be copied. I'm referring to the members, i.e., std::optional, not the vector data. So each time you move a query, you copy five fields. So imagine you have a vector of queries and you want to sort them. Each swap will have to move about 4 pointers, 4 sizes and one float.

And although I agree this on its own could be less of an issue considering that this is not the most important, it strengthens the following argument.

Compilation

Let me start by saying that I know that query container itself is a small object that doesn't require much time to compile. Although the argument that you need to compile it about two dozen times is still valid, this is not about that.

The real problem that we've struggled for a long time is that each small change causes recompilation of so much code that is not as fast to compile. Consider this: query is an essential part that will be used in a majority of compilation units. Each time you change something minor in the implementation, it will need to recompile most of the binaries. And yes, you still need to link it, but that's considerably faster than parsing and generating template code and everything else is so time consuming. And yes, the header still might change but it will change much less often than the implementation. So I say we should put to such hot header as query.hpp as little as we need to, which means also the definition of types we don't need to expose anywhere don't need to be in the header.

Downsides

All that said, let's talk about the actual downsides. Here's the additional boilerplate:

Header
1. One line of forward declaration.
2. std::unique_ptr field (but replacing a number of fields)
Cpp
1. Additional struct QueryContainerInner { and } lines (all inside would be just in a different place).
2. Three additional lines for copy constructor and copy assignment operator compared to = default.
3. m_data indirection when accessing data in methods.

Note that outside of that one cpp file you never deal with the fact that there's any kind of pointer inside it. The class still has normal value semantics, move is faster, copies have exactly the same semantics, there's literally nothing tricky about using this class.

The only real downside is m_data indirection and it is in one small file only. It doesn't affect any other piece of code.

As far as "additional complexity" argument goes, I don't buy it because it just makes certain fields private. And if one wants to explore the implementation, it just goes one level down, there's no tricks here, it's just a pointer, it's not rocket science. Especially if you compare to some stuff we do with templates, this is just so damn uncomplicated.

elshize · 2020-05-01T13:47:05Z

I thought about having threshold, and I agree that it's not ideal just having a threshold. I think we should have thresholds:

thresholds: [
    {"k": 10, "score": 5.3},
    {"k": 100, "score": 3.3},
    {"k": 1000, "score": 1.3},
]

This way, if we ask to run an algorithm that depends on a threshold for k = 20 and we don't have these thresholds, then we know to fail and what to print out as an error.

Note that just like we can pass term lexicon and parser and stemmer to produce IDs, we should be eventually be able to ask to estimate (one way or another) the thresholds if they are missing or if we want to force re-computation.

codecov · 2020-06-15T18:23:48Z

Codecov Report

Merging #373 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #373   +/-   ##
=======================================
  Coverage   92.42%   92.42%           
=======================================
  Files          91       93    +2     
  Lines        4921     4923    +2     
=======================================
+ Hits         4548     4550    +2     
  Misses        373      373

Impacted Files	Coverage Δ
include/pisa/query.hpp	`100.00% <100.00%> (ø)`
include/pisa/query/query_parser.hpp	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dde6095...8e6da84. Read the comment docs.

elshize added 2 commits April 26, 2020 19:45

Query contaier

12d5fb6

Query container parsing

1d3aa88

elshize added the wip label Apr 26, 2020

elshize requested review from JMMackenzie and amallia April 26, 2020 23:29

elshize self-assigned this Apr 26, 2020

elshize marked this pull request as draft April 26, 2020 23:29

JMMackenzie reviewed Apr 27, 2020

View reviewed changes

src/query.cpp Show resolved Hide resolved

test/test_query.cpp Show resolved Hide resolved

elshize added 3 commits April 26, 2020 22:00

Merge branch 'master' into query-container

ce2b720

CLI test

cdc17f3

Merge branch 'query-container' of github.com:pisa-engine/pisa into qu…

98fe8b1

…ery-container

elshize marked this pull request as ready for review April 27, 2020 15:56

elshize marked this pull request as draft April 27, 2020 15:57

elshize added 5 commits April 27, 2020 15:57

Fix .travis.yml syntax

b0e5d1a

Fix .travis.yml syntax

6e2ab62

Fix when cli test are executed

2cce2cd

Merge branch 'master' into query-container

982d316

Refactor out common code from tool

1838258

elshize marked this pull request as ready for review April 28, 2020 23:49

elshize added enhancement New feature or request refactoring and removed wip labels Apr 29, 2020

Merge branch 'master' into query-container

3ba4588

JMMackenzie approved these changes Apr 29, 2020

View reviewed changes

amallia requested changes Apr 30, 2020

View reviewed changes

elshize added 3 commits May 1, 2020 14:53

Small refactoring and term resolver tests

7107f65

Fix tool description

ede9c98

Multiple thresholds per query

b8f625c

Return program with 1 if fails

78cf15c

elshize mentioned this pull request May 6, 2020

Query container usage #382

Draft

elshize added 10 commits May 19, 2020 17:30

Merge branch 'master' into query-container

d4e63cf

Merge branch 'master' into query-container

ebf1acb

Fix merging issue

83f9c74

Merge branch 'master' into query-container

4b0b05c

Merge branch 'master' into query-container

a22f794

Merge branch 'master' into query-container

e0b052a

Merge branch 'master' into query-container

6382a38

Merge branch 'master' into query-container

ca09597

Merge branch 'master' into query-container

f173ea3

Merge branch 'master' into query-container

104d310

elshize added 2 commits June 18, 2020 13:05

Merge branch 'master' into query-container

af1871d

Merge branch 'master' into query-container

8e6da84

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query container #373

Query container #373

elshize commented Apr 26, 2020

JMMackenzie left a comment

JMMackenzie left a comment

JMMackenzie Apr 29, 2020

elshize Apr 29, 2020

amallia left a comment

amallia Apr 30, 2020

elshize May 1, 2020

amallia Apr 30, 2020

amallia Apr 30, 2020

elshize May 1, 2020

elshize commented May 1, 2020

elshize commented May 1, 2020

codecov bot commented Jun 15, 2020 •

edited

Loading

Query container #373

Are you sure you want to change the base?

Query container #373

Conversation

elshize commented Apr 26, 2020

JMMackenzie left a comment

Choose a reason for hiding this comment

JMMackenzie left a comment

Choose a reason for hiding this comment

JMMackenzie Apr 29, 2020

Choose a reason for hiding this comment

elshize Apr 29, 2020

Choose a reason for hiding this comment

amallia left a comment

Choose a reason for hiding this comment

amallia Apr 30, 2020

Choose a reason for hiding this comment

elshize May 1, 2020

Choose a reason for hiding this comment

amallia Apr 30, 2020

Choose a reason for hiding this comment

amallia Apr 30, 2020

Choose a reason for hiding this comment

elshize May 1, 2020

Choose a reason for hiding this comment

elshize commented May 1, 2020

Move overhead.

Compilation

Downsides

elshize commented May 1, 2020

codecov bot commented Jun 15, 2020 • edited Loading

Codecov Report

codecov bot commented Jun 15, 2020 •

edited

Loading