-
Notifications
You must be signed in to change notification settings - Fork 21
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Documentation revamp to stress the new compute engine more
- Loading branch information
1 parent
256313f
commit 1e64ebd
Showing
4 changed files
with
164 additions
and
119 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,8 +2,8 @@ | |
Python-Blosc2 | ||
============= | ||
|
||
A fast & compressed ndarray library with a flexible computational engine | ||
======================================================================== | ||
A fast & compressed ndarray library with a flexible compute engine | ||
================================================================== | ||
|
||
:Author: The Blosc development team | ||
:Contact: [email protected] | ||
|
@@ -26,58 +26,46 @@ A fast & compressed ndarray library with a flexible computational engine | |
What it is | ||
========== | ||
|
||
`C-Blosc2 <https://github.com/Blosc/c-blosc2>`_ is a blocking, shuffling and | ||
lossless compression library meant for numerical data written in C. Blosc2 | ||
is the next generation of Blosc, an | ||
`award-winning <https://www.blosc.org/posts/prize-push-Blosc2/>`_ | ||
Python-Blosc2 is a high-performance compressed ndarray library with a flexible | ||
compute engine. It uses the C-Blosc2 library as the compression backend. | ||
`C-Blosc2 <https://github.com/Blosc/c-blosc2>`_ is the next generation of | ||
Blosc, an `award-winning <https://www.blosc.org/posts/prize-push-Blosc2/>`_ | ||
library that has been around for more than a decade, and that is been used | ||
by many projects, including `PyTables <https://www.pytables.org/>`_ or | ||
`Zarr <https://zarr.readthedocs.io/en/stable/>`_. | ||
|
||
On top of C-Blosc2 we built Python-Blosc2, a Python wrapper that exposes the | ||
C-Blosc2 API, plus many extensions that allow it to work transparently with | ||
NumPy arrays, while performing advanced computations on compressed data that | ||
Python-Blosc2 is Python wrapper that exposes the C-Blosc2 API, *plus* a | ||
compute engine that allow it to work transparently with NumPy arrays, | ||
while performing advanced computations on compressed data that | ||
can be stored either in-memory, on-disk or on the network (via the | ||
`Caterva2 library <https://github.com/Blosc/Caterva2>`_). | ||
`Caterva2 library <https://github.com/ironArray/Caterva2>`_). | ||
|
||
Python-Blosc2 leverages both NumPy and numexpr for achieving great performance, | ||
but with a twist. Among the main differences between the new computing engine | ||
and NumPy or numexpr, you can find: | ||
Python-Blosc2 makes special emphasis on interacting well with existing | ||
libraries and tools. In particular, it provides: | ||
|
||
* Support for n-dim arrays that are compressed in-memory, on-disk or on the | ||
network. | ||
* High performance compression codecs, for integer, floating point, complex | ||
booleans, string and structured data. | ||
* Support for NumPy `universal functions mechanism <https://numpy.org/doc/2.1/reference/ufuncs.html>`_, | ||
allowing to mix and match NumPy and Blosc2 computation engines. | ||
* Excellent integration with Numba and Cython via | ||
`User Defined Functions <https://www.blosc.org/python-blosc2/getting_started/tutorials/03.lazyarray-udf.html>`_. | ||
* Lazy expressions that are computed only when needed, and that can be stored | ||
for later use. | ||
|
||
Python-Blosc2 leverages both `NumPy <https://numpy.org>`_ and | ||
`NumExpr <https://numexpr.readthedocs.io/en/latest/>`_ for achieving great | ||
performance, but with a twist. Among the main differences between the new | ||
computing engine and NumPy or numexpr, you can find: | ||
|
||
* Support for ndarrays that can be compressed and stored in-memory, on-disk | ||
or `on the network <https://github.com/ironArray/Caterva2>`_. | ||
* Can perform many kind of math expressions, including reductions, indexing, | ||
filters and more. | ||
* Support for NumPy ufunc mechanism, allowing to mix and match NumPy and | ||
Blosc2 computations. | ||
* Excellent integration with Numba and Cython via User Defined Functions. | ||
* Support for broadcasting operations. This is a powerful feature that | ||
allows to perform operations on arrays of different shapes. | ||
* Support for broadcasting operations. Allows to perform operations on arrays | ||
of different shapes. | ||
* Much better adherence to the NumPy casting rules than numexpr. | ||
* Lazy expressions that are computed only when needed, and can be stored for | ||
later use. | ||
* Persistent reductions that can be updated incrementally. | ||
* Persistent reductions where ndarrays that can be updated incrementally. | ||
* Support for proxies that allow to work with compressed data on local or | ||
remote machines. | ||
|
||
You can read some of our tutorials on how to perform advanced computations at: | ||
|
||
https://www.blosc.org/python-blosc2/getting_started/tutorials | ||
|
||
As well as the full documentation at: | ||
|
||
https://www.blosc.org/python-blosc2 | ||
|
||
Finally, Python-Blosc2 aims to leverage the full C-Blosc2 functionality to | ||
support a wide range of compression and decompression needs, including | ||
metadata, serialization and other bells and whistles. | ||
|
||
**Note:** Blosc2 is meant to be backward compatible with Blosc(1) data. | ||
That means that it can read data generated with Blosc, but the opposite | ||
is not true (i.e. there is no *forward* compatibility). | ||
|
||
NDArray: an N-Dimensional store | ||
=============================== | ||
|
||
|
@@ -132,21 +120,19 @@ Here it is a simple example: | |
As you can see, the ``NDArray`` instances are very similar to NumPy arrays, | ||
but behind the scenes, they store compressed data that can be processed | ||
efficiently using the new computing engine included in Python-Blosc2. | ||
[Although not exercised above, broadcasting and reductions also work, as well as | ||
filtering, indexing and sorting operations for structured arrays (tables).] | ||
|
||
To pique your interest, here is the performance (measured on a modern desktop machine) | ||
To wet your appetite, here is the performance (measured on a modern desktop machine) | ||
that you can achieve when the operands in the expression above fit comfortably in memory | ||
(20_000 x 20_000): | ||
|
||
.. image:: https://github.com/Blosc/python-blosc2/blob/main/images/lazyarray-expr.png?raw=true | ||
:width: 90% | ||
:alt: Performance when operands fit in-memory | ||
|
||
In this case, the performance is somewhat below that of top-tier libraries like Numexpr, | ||
but it is still quite good, specially when compared with plain NumPy. For these short | ||
benchmarks, numba normally loses because its relatively large compiling overhead cannot be | ||
amortized. | ||
In this case, the performance is somewhat below that of top-tier libraries like | ||
Numexpr, but still quite good, specially when compared with plain NumPy. For | ||
these short benchmarks, numba normally loses because its relatively large | ||
compiling overhead cannot be amortized. | ||
|
||
One important point is that the memory consumption when using the ``LazyArray.compute()`` | ||
method is pretty low (does not exceed 100 MB) because the output is an ``NDArray`` object, | ||
|
@@ -159,26 +145,29 @@ Another point is that, when using the Blosc2 engine, computation with compressio | |
actually faster than without it (not by a large margin, but still). To understand why, | ||
you may want to read `this paper <https://www.blosc.org/docs/StarvingCPUs-CISE-2010.pdf>`_. | ||
|
||
And here it is the performance when the operands barely fit in memory (50_000 x 50_000): | ||
And here it is the performance when the operands and result (50_000 x 50_000) barely fit in memory | ||
(a machine with 64 GB of RAM, for a working set of 60 GB): | ||
|
||
.. image:: https://github.com/Blosc/python-blosc2/blob/main/images/lazyarray-expr-large.png?raw=true | ||
:width: 90% | ||
:alt: Performance when operands do not fit well in-memory | ||
|
||
In this latter case, the memory consumption figures does not seem extreme, but this is because | ||
the displayed values represent *actual* memory consumption *during* the computation | ||
(not virtual memory); in addition, the resulting array is boolean, so it does not take too much | ||
space to store (just 2.4 GB uncompressed). In this scenario, the performance compared to top-tier | ||
libraries like Numexpr or Numba is quite competitive. | ||
In this latter case, the memory consumption figures do not seem extreme; this | ||
is because the displayed values represent *actual* memory consumption *during* | ||
the computation, and not virtual memory; in addition, the resulting array is | ||
boolean, so it does not take too much space to store (just 2.4 GB uncompressed). | ||
|
||
You can find the benchmark for the examples above at: | ||
In this later scenario, the performance compared to Numexpr or Numba is quite | ||
competitive, and actually faster than those. This is because the Blosc2 | ||
compute engine is is able to perform the computation streaming over the | ||
compressed chunks and blocks, for a better use of the memory and CPU caches. | ||
|
||
You can find the notebooks for these benchmarks at: | ||
|
||
https://github.com/Blosc/python-blosc2/blob/main/bench/ndarray/lazyarray-expr.ipynb | ||
|
||
https://github.com/Blosc/python-blosc2/blob/main/bench/ndarray/lazyarray-expr-large.ipynb | ||
|
||
Feel free to run them in your own machine and compare the results. | ||
|
||
Installing | ||
========== | ||
|
||
|
@@ -189,12 +178,17 @@ You can install the binary packages from PyPi using ``pip``: | |
pip install blosc2 | ||
We are in the process of releasing 3.0.0, along with wheels for various | ||
versions. For example, to install the first release candidate version, you can use: | ||
If you want to install the latest release, you can do it with pip: | ||
|
||
.. code-block:: console | ||
pip install blosc2==3.0.0rc2 | ||
pip install blosc2 --upgrade | ||
For conda users, you can install the package from the conda-forge channel: | ||
|
||
.. code-block:: console | ||
conda install -c conda-forge blosc2 | ||
Documentation | ||
============= | ||
|
@@ -209,7 +203,7 @@ https://github.com/Blosc/python-blosc2/tree/main/examples | |
|
||
Finally, we taught a tutorial at the `PyData Global 2024 <https://pydata.org/global2024/>`_ | ||
that you can find at: https://github.com/Blosc/Python-Blosc2-3.0-tutorial. There you will | ||
find differents Jupyter notebook that explains the main features of Python-Blosc2. | ||
find different Jupyter notebook that explains the main features of Python-Blosc2. | ||
|
||
Building from sources | ||
===================== | ||
|
@@ -233,18 +227,7 @@ correctly by running the tests: | |
.. code-block:: console | ||
pip install .[test] | ||
pytest (add -v for verbose mode) | ||
Benchmarking | ||
============ | ||
|
||
If you are curious, you may want to run a small benchmark that compares a plain | ||
NumPy array copy against compression using different compressors in your Blosc2 | ||
build: | ||
|
||
.. code-block:: console | ||
python bench/pack_compress.py | ||
pytest # add -v for verbose mode | ||
License | ||
======= | ||
|
@@ -287,11 +270,11 @@ to the core development of the Blosc2 library: | |
- Ivan Vilata i Balaguer | ||
- Oumaima Ech.Chdig | ||
|
||
In addition, other people have contributed to the project in different | ||
In addition, other people have participated to the project in different | ||
aspects: | ||
|
||
- Jan Sellner, who contributed the mmap support for NDArray/SChunk objects. | ||
- Dimitri Papadopoulos, who contributed a large bunch of improvements to the | ||
- Jan Sellner, contributed the mmap support for NDArray/SChunk objects. | ||
- Dimitri Papadopoulos, contributed a large bunch of improvements to the | ||
in many aspects of the project. His attention to detail is remarkable. | ||
- And many others that have contributed with bug reports, suggestions and | ||
improvements. | ||
|
@@ -319,4 +302,4 @@ organization, which is a non-profit that supports many open-source projects. | |
Thank you! | ||
|
||
|
||
**Make compression better!** | ||
**Compress Better, Compute Bigger** |
Oops, something went wrong.