Qlever Integration in the RDF Processing Toolkit #1818

Aklakan · 2025-02-18T15:24:05Z

Aklakan
Feb 18, 2025

I am the developer of the RDF Processing Toolkit (RPT), which in a nutshell a Java-based CLI wrapper for ad-hoc loading of RDF data and running SPARQL queries against it. It is useful both for scripting as well as rapid prototyping of data integration tasks.

With basic usage you can just provide data, update and query statements as arguments and they will be run in the given order, and the results will be printed to the console (i.e. multiple construct queries can be supplied to produce an RDF document).

By default, RPT uses the engine based on Apache Jena as because we wrote many SPARQL extension functions for it.

rpt integrate data1.ttl dataN.ttl query1.rq queryN.rq

RPT supports different engines via the -e argument, such as tdb2 and now also 🎉 qlever 🎉.

rpt integrate -e qlever data.ttl query.rq

This will start a qlever docker container and run the data loading and querying against it.
Data loading is optimized and will use qlever's index builder. Also, compressed data such as bzip2 is automatically decompressed on the host (if lbzip2 or bzip2 is available) and supplied to the container via named pipes - so JVM overhead is avoided.

Use --loc to specify a folder from where to load/store the database and --db-keep to retain a created database. Without these options, the data will be stored in a temporary directory and deleted when rpt exits - recall, the main use case is scripting and rapid prototyping - for production you'd rather write e.g. a docker compose setup.

There are probably many things that could be further improved, but:
If you are looking for quick way to try out qlever and/or compare/mix it with other engines, you may want to give RPT a try 😃

Cheers,
Claus

hannahbast · 2025-02-18T15:35:45Z

hannahbast
Feb 18, 2025
Maintainer

@Aklakan Thanks, Claus, that looks very useful and we will look into it, especially since we have been developing a similar tool, see https://github.com/ad-freiburg/qlever-control/pulls. Maybe there are some synergy effects.

One question: Docker can have a significant performance overhead, especially when high IO and multi-threading is involved. In particular, that is the case for the index building (data loading). Have you thought about this?

0 replies

Aklakan · 2025-02-18T16:17:59Z

Aklakan
Feb 18, 2025
Author

Have you thought about this?

I used the python qlever tool to load wikidata truthy with ~8B triples in 4-5 hours on my notebook from 2022 - which is pretty impressive - and if I did not overlook anything, then this tool is also just a wrapper for docker (I extracted the IndexBuilderMain and ServerMain invocations from there). So even if the performance with a native qlever binary on the host was even better - it is still highly usable with the docker-containerized approach.
And certainly, using system tools such as lbzip2 to pipe data to the index-builder-container (or maybe also bind-mount the file to the container and run lbzip2 from within) will be significantly faster than e.g. relying on a Java implementation for that purpose (e.g. Hadoop's splittable Bzip2 Codec).

In short: it's good enough for me :)

Maybe there are some synergy effects.

If something emerges, I am glad to contribute.
So with my work I am mostly tied to the the Java / Apache Jena world, and the current goal of my efforts is to make qlever usable from this side - one direction of future work I am thinking about is to extend the work on RPT into a Fuseki plugin that wraps Qlever.
(Fuseki is the server framework of Apache Jena.)

0 replies

hannahbast · 2025-02-18T16:40:22Z

hannahbast
Feb 18, 2025
Maintainer

@Aklakan Thanks for the reply + some clarifications and questions:

The qlever script can use Docker, but it does not have to use Docker. This is controlled by the variable SYSTEM or the option --system, which can be set to native, docker, or podman. For QLever's index building, the overhead when using Docker is roughly 25%. That is, for example, with Docker it takes 5 hours when it could take 4 hours. Not a big issue in practice, but it is an issue when comparing the performance of different engines (the slow-down due to Docker depends on the engine).
Regarding decompression, QLever can be fed multiple input streams via arbitrary commands. For example, you can feed a bzip2-compressed file via lbzcat -n 4 <name>.ttl.bz2. That way, you can delegate the decompression to separate tools, in this base lbzcat, which can decompress in parallel. Also, it allows the processing of arbitrary input formats.
Can you explain why the programming language is important for this kind of work? Isn't this about writing tools that call other programs and then measure their performance (loading time, query times, space consumption, etc). The programming language in which these tools are written is secondary (as long as they can be used reasonably easily).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qlever Integration in the RDF Processing Toolkit #1818

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Qlever Integration in the RDF Processing Toolkit #1818

Aklakan Feb 18, 2025

Replies: 3 comments

hannahbast Feb 18, 2025 Maintainer

Aklakan Feb 18, 2025 Author

hannahbast Feb 18, 2025 Maintainer

Aklakan
Feb 18, 2025

hannahbast
Feb 18, 2025
Maintainer

Aklakan
Feb 18, 2025
Author

hannahbast
Feb 18, 2025
Maintainer