Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the patterns from the permutations and no longer load ql:has-pattern into RAM #1223

Merged
merged 132 commits into from
Jan 18, 2024

Conversation

joka921
Copy link
Member

@joka921 joka921 commented Jan 12, 2024

PRs #1168 and #1177 have added the subject patterns as two additional columns to the OSP&OPS and PSO&POS permutations. PR #1226 has added the triples of the ql:has-pattern predicate to the PSO&POS permutations. Now use this information instead of the old patterns, which did cost a lot of RAM. We tried a few queries involving patterns and the speed is very similar to that of the previous implementation.

NOTE: This is an index-breaking change. The old .index.patterns file stored the ql:has-pattern predicate (for each subject its pattern) and the information which pattern consists of which predicates. Now the .index.patterns file only stores the latter information. The file size therefore is significantly reduced and no longer depends on the size of the dataset (but only on how many distinct patterns there are, typically few). For example, for Wikidata, the file size reduced from 17 GB to 2.8 GB. For UniProt, the reduction is from 152 GB (which does not fit into the RAM of our standard machines) to something very small (because UniProt is very regular and there are only very few distinct patterns).

Next step:
neither write nor read the old subject-to-pattern-matching.
Next step:
Prepare a preliminary PR to let Hannah try it out on real world knowledge graphs.
TODO<joka921> update the ddate as soon as we know on which day we merge.
# Conflicts:
#	.github/workflows/code-coverage.yml
#	test/ExceptionHandlingTest.cpp
#	test/IndexTestHelpers.h
TODO
Actually write them during CreatePermutations, and then also retrieve them during the pattern processing.
Missing piece (probably)
During the index-Building we need an optional join to handle the `noPattern` case for objects that don't appear as subjects.
join in a batched fashion.
join in a batched fashion.
Copy link
Member

@hannahbast hannahbast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1-1 with Johannes, I am amazed at the amount of work (and that it works, of course)

@hannahbast hannahbast marked this pull request as ready for review January 18, 2024 17:45
Copy link
Member

@hannahbast hannahbast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, another major milestone taken!

@hannahbast hannahbast changed the title Actually use the new pattern implementation (just a draft) Use the new pattern implementation Jan 18, 2024
Copy link

Quality Gate Passed Quality Gate passed

The SonarCloud Quality Gate passed, but some issues were introduced.

16 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@hannahbast hannahbast changed the title Use the new pattern implementation Use the patterns from the permutations and no longer load ql:has-pattern into RAM Jan 18, 2024
@hannahbast hannahbast merged commit d7635f0 into ad-freiburg:master Jan 18, 2024
17 of 18 checks passed
@joka921 joka921 deleted the use-new-patterns branch January 19, 2024 07:39
hannahbast pushed a commit that referenced this pull request Jan 29, 2024
So far, the method `Vocabulary::prefix_range` returned a range of
indexes of words in the internal vocabulary that match a given prefix.
This is now replaced by a method `Vocabulary::prefixRange`, which
returns two ranges, one for the internal and one for the external
vocabulary. This can easily be extended to more ranges when needed.

Based on this, our efficient implementation for REGEX when the regular
expression is a prefix now finds items in both the internal and the
external vocabulary (so far: only in the internal vocabulary).

On the side, identified and fixed a bug in the previous code, where the
special predicates starting with `@` would not be found if
`prefixes-external` matched `@`. As a conesequence, QLever now works as
it should even with `"prefixes-external": [""]` (that is, everything
expect QLever-internal predicates, ends up in the external vocabulary)
in the `.settings.json` file. In conjunction with #1223, this makes it
now possible to start a QLever server with very little RAM.
hannahbast added a commit that referenced this pull request Jan 30, 2024
So far, the method `Vocabulary::prefix_range` returned a range of indexes of words in the internal vocabulary that match a given prefix. This is now replaced by a method `Vocabulary::prefixRange`, which returns two ranges, one for the internal and one for the external vocabulary. This can easily be extended to more ranges when needed.

Based on this, our efficient implementation for `REGEX` when the regular expression is a prefix, now finds items in both the internal and the external vocabulary (so far: only in the internal vocabulary).

On the side, identified and fixed a bug in the previous code, where the special predicates starting with `@` would not be found if `prefixes-external` matched `@`. As a consequence, QLever now works as it should even with `"prefixes-external": [""]`, that is, even when everything except the QLever-internal predicates ends up in the external vocabulary. In conjunction with #1223, this makes it now possible to start a QLever server with very little RAM.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants