Skip to content

Commit 1e64ebd

Browse files
committed
Documentation revamp to stress the new compute engine more
1 parent 256313f commit 1e64ebd

File tree

4 files changed

+164
-119
lines changed

4 files changed

+164
-119
lines changed

ANNOUNCE.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ On top of C-Blosc2 we built Python-Blosc2, a Python wrapper that exposes the
3434
C-Blosc2 API, plus many extensions that allow it to work transparently with
3535
NumPy arrays, while performing advanced computations on compressed data that
3636
can be stored either in-memory, on-disk or on the network (via the
37-
`Caterva2 library <https://github.com/Blosc/Caterva2>`_).
37+
`Caterva2 library <https://github.com/ironArray/Caterva2>`_).
3838

3939
Python-Blosc2 leverages both NumPy and numexpr for achieving great performance,
4040
but with a twist. Among the main differences between the new computing engine

README.rst

Lines changed: 60 additions & 77 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@
22
Python-Blosc2
33
=============
44

5-
A fast & compressed ndarray library with a flexible computational engine
6-
========================================================================
5+
A fast & compressed ndarray library with a flexible compute engine
6+
==================================================================
77

88
:Author: The Blosc development team
99
@@ -26,58 +26,46 @@ A fast & compressed ndarray library with a flexible computational engine
2626
What it is
2727
==========
2828

29-
`C-Blosc2 <https://github.com/Blosc/c-blosc2>`_ is a blocking, shuffling and
30-
lossless compression library meant for numerical data written in C. Blosc2
31-
is the next generation of Blosc, an
32-
`award-winning <https://www.blosc.org/posts/prize-push-Blosc2/>`_
29+
Python-Blosc2 is a high-performance compressed ndarray library with a flexible
30+
compute engine. It uses the C-Blosc2 library as the compression backend.
31+
`C-Blosc2 <https://github.com/Blosc/c-blosc2>`_ is the next generation of
32+
Blosc, an `award-winning <https://www.blosc.org/posts/prize-push-Blosc2/>`_
3333
library that has been around for more than a decade, and that is been used
3434
by many projects, including `PyTables <https://www.pytables.org/>`_ or
3535
`Zarr <https://zarr.readthedocs.io/en/stable/>`_.
3636

37-
On top of C-Blosc2 we built Python-Blosc2, a Python wrapper that exposes the
38-
C-Blosc2 API, plus many extensions that allow it to work transparently with
39-
NumPy arrays, while performing advanced computations on compressed data that
37+
Python-Blosc2 is Python wrapper that exposes the C-Blosc2 API, *plus* a
38+
compute engine that allow it to work transparently with NumPy arrays,
39+
while performing advanced computations on compressed data that
4040
can be stored either in-memory, on-disk or on the network (via the
41-
`Caterva2 library <https://github.com/Blosc/Caterva2>`_).
41+
`Caterva2 library <https://github.com/ironArray/Caterva2>`_).
4242

43-
Python-Blosc2 leverages both NumPy and numexpr for achieving great performance,
44-
but with a twist. Among the main differences between the new computing engine
45-
and NumPy or numexpr, you can find:
43+
Python-Blosc2 makes special emphasis on interacting well with existing
44+
libraries and tools. In particular, it provides:
4645

47-
* Support for n-dim arrays that are compressed in-memory, on-disk or on the
48-
network.
49-
* High performance compression codecs, for integer, floating point, complex
50-
booleans, string and structured data.
46+
* Support for NumPy `universal functions mechanism <https://numpy.org/doc/2.1/reference/ufuncs.html>`_,
47+
allowing to mix and match NumPy and Blosc2 computation engines.
48+
* Excellent integration with Numba and Cython via
49+
`User Defined Functions <https://www.blosc.org/python-blosc2/getting_started/tutorials/03.lazyarray-udf.html>`_.
50+
* Lazy expressions that are computed only when needed, and that can be stored
51+
for later use.
52+
53+
Python-Blosc2 leverages both `NumPy <https://numpy.org>`_ and
54+
`NumExpr <https://numexpr.readthedocs.io/en/latest/>`_ for achieving great
55+
performance, but with a twist. Among the main differences between the new
56+
computing engine and NumPy or numexpr, you can find:
57+
58+
* Support for ndarrays that can be compressed and stored in-memory, on-disk
59+
or `on the network <https://github.com/ironArray/Caterva2>`_.
5160
* Can perform many kind of math expressions, including reductions, indexing,
5261
filters and more.
53-
* Support for NumPy ufunc mechanism, allowing to mix and match NumPy and
54-
Blosc2 computations.
55-
* Excellent integration with Numba and Cython via User Defined Functions.
56-
* Support for broadcasting operations. This is a powerful feature that
57-
allows to perform operations on arrays of different shapes.
62+
* Support for broadcasting operations. Allows to perform operations on arrays
63+
of different shapes.
5864
* Much better adherence to the NumPy casting rules than numexpr.
59-
* Lazy expressions that are computed only when needed, and can be stored for
60-
later use.
61-
* Persistent reductions that can be updated incrementally.
65+
* Persistent reductions where ndarrays that can be updated incrementally.
6266
* Support for proxies that allow to work with compressed data on local or
6367
remote machines.
6468

65-
You can read some of our tutorials on how to perform advanced computations at:
66-
67-
https://www.blosc.org/python-blosc2/getting_started/tutorials
68-
69-
As well as the full documentation at:
70-
71-
https://www.blosc.org/python-blosc2
72-
73-
Finally, Python-Blosc2 aims to leverage the full C-Blosc2 functionality to
74-
support a wide range of compression and decompression needs, including
75-
metadata, serialization and other bells and whistles.
76-
77-
**Note:** Blosc2 is meant to be backward compatible with Blosc(1) data.
78-
That means that it can read data generated with Blosc, but the opposite
79-
is not true (i.e. there is no *forward* compatibility).
80-
8169
NDArray: an N-Dimensional store
8270
===============================
8371

@@ -132,21 +120,19 @@ Here it is a simple example:
132120
As you can see, the ``NDArray`` instances are very similar to NumPy arrays,
133121
but behind the scenes, they store compressed data that can be processed
134122
efficiently using the new computing engine included in Python-Blosc2.
135-
[Although not exercised above, broadcasting and reductions also work, as well as
136-
filtering, indexing and sorting operations for structured arrays (tables).]
137123

138-
To pique your interest, here is the performance (measured on a modern desktop machine)
124+
To wet your appetite, here is the performance (measured on a modern desktop machine)
139125
that you can achieve when the operands in the expression above fit comfortably in memory
140126
(20_000 x 20_000):
141127

142128
.. image:: https://github.com/Blosc/python-blosc2/blob/main/images/lazyarray-expr.png?raw=true
143129
:width: 90%
144130
:alt: Performance when operands fit in-memory
145131

146-
In this case, the performance is somewhat below that of top-tier libraries like Numexpr,
147-
but it is still quite good, specially when compared with plain NumPy. For these short
148-
benchmarks, numba normally loses because its relatively large compiling overhead cannot be
149-
amortized.
132+
In this case, the performance is somewhat below that of top-tier libraries like
133+
Numexpr, but still quite good, specially when compared with plain NumPy. For
134+
these short benchmarks, numba normally loses because its relatively large
135+
compiling overhead cannot be amortized.
150136

151137
One important point is that the memory consumption when using the ``LazyArray.compute()``
152138
method is pretty low (does not exceed 100 MB) because the output is an ``NDArray`` object,
@@ -159,26 +145,29 @@ Another point is that, when using the Blosc2 engine, computation with compressio
159145
actually faster than without it (not by a large margin, but still). To understand why,
160146
you may want to read `this paper <https://www.blosc.org/docs/StarvingCPUs-CISE-2010.pdf>`_.
161147

162-
And here it is the performance when the operands barely fit in memory (50_000 x 50_000):
148+
And here it is the performance when the operands and result (50_000 x 50_000) barely fit in memory
149+
(a machine with 64 GB of RAM, for a working set of 60 GB):
163150

164151
.. image:: https://github.com/Blosc/python-blosc2/blob/main/images/lazyarray-expr-large.png?raw=true
165152
:width: 90%
166153
:alt: Performance when operands do not fit well in-memory
167154

168-
In this latter case, the memory consumption figures does not seem extreme, but this is because
169-
the displayed values represent *actual* memory consumption *during* the computation
170-
(not virtual memory); in addition, the resulting array is boolean, so it does not take too much
171-
space to store (just 2.4 GB uncompressed). In this scenario, the performance compared to top-tier
172-
libraries like Numexpr or Numba is quite competitive.
155+
In this latter case, the memory consumption figures do not seem extreme; this
156+
is because the displayed values represent *actual* memory consumption *during*
157+
the computation, and not virtual memory; in addition, the resulting array is
158+
boolean, so it does not take too much space to store (just 2.4 GB uncompressed).
173159

174-
You can find the benchmark for the examples above at:
160+
In this later scenario, the performance compared to Numexpr or Numba is quite
161+
competitive, and actually faster than those. This is because the Blosc2
162+
compute engine is is able to perform the computation streaming over the
163+
compressed chunks and blocks, for a better use of the memory and CPU caches.
164+
165+
You can find the notebooks for these benchmarks at:
175166

176167
https://github.com/Blosc/python-blosc2/blob/main/bench/ndarray/lazyarray-expr.ipynb
177168

178169
https://github.com/Blosc/python-blosc2/blob/main/bench/ndarray/lazyarray-expr-large.ipynb
179170

180-
Feel free to run them in your own machine and compare the results.
181-
182171
Installing
183172
==========
184173

@@ -189,12 +178,17 @@ You can install the binary packages from PyPi using ``pip``:
189178
190179
pip install blosc2
191180
192-
We are in the process of releasing 3.0.0, along with wheels for various
193-
versions. For example, to install the first release candidate version, you can use:
181+
If you want to install the latest release, you can do it with pip:
194182

195183
.. code-block:: console
196184
197-
pip install blosc2==3.0.0rc2
185+
pip install blosc2 --upgrade
186+
187+
For conda users, you can install the package from the conda-forge channel:
188+
189+
.. code-block:: console
190+
191+
conda install -c conda-forge blosc2
198192
199193
Documentation
200194
=============
@@ -209,7 +203,7 @@ https://github.com/Blosc/python-blosc2/tree/main/examples
209203

210204
Finally, we taught a tutorial at the `PyData Global 2024 <https://pydata.org/global2024/>`_
211205
that you can find at: https://github.com/Blosc/Python-Blosc2-3.0-tutorial. There you will
212-
find differents Jupyter notebook that explains the main features of Python-Blosc2.
206+
find different Jupyter notebook that explains the main features of Python-Blosc2.
213207

214208
Building from sources
215209
=====================
@@ -233,18 +227,7 @@ correctly by running the tests:
233227
.. code-block:: console
234228
235229
pip install .[test]
236-
pytest (add -v for verbose mode)
237-
238-
Benchmarking
239-
============
240-
241-
If you are curious, you may want to run a small benchmark that compares a plain
242-
NumPy array copy against compression using different compressors in your Blosc2
243-
build:
244-
245-
.. code-block:: console
246-
247-
python bench/pack_compress.py
230+
pytest # add -v for verbose mode
248231
249232
License
250233
=======
@@ -287,11 +270,11 @@ to the core development of the Blosc2 library:
287270
- Ivan Vilata i Balaguer
288271
- Oumaima Ech.Chdig
289272

290-
In addition, other people have contributed to the project in different
273+
In addition, other people have participated to the project in different
291274
aspects:
292275

293-
- Jan Sellner, who contributed the mmap support for NDArray/SChunk objects.
294-
- Dimitri Papadopoulos, who contributed a large bunch of improvements to the
276+
- Jan Sellner, contributed the mmap support for NDArray/SChunk objects.
277+
- Dimitri Papadopoulos, contributed a large bunch of improvements to the
295278
in many aspects of the project. His attention to detail is remarkable.
296279
- And many others that have contributed with bug reports, suggestions and
297280
improvements.
@@ -319,4 +302,4 @@ organization, which is a non-profit that supports many open-source projects.
319302
Thank you!
320303

321304

322-
**Make compression better!**
305+
**Compress Better, Compute Bigger**

0 commit comments

Comments
 (0)