2
2
Python-Blosc2
3
3
=============
4
4
5
- A fast & compressed ndarray library with a flexible computational engine
6
- ========================================================================
5
+ A fast & compressed ndarray library with a flexible compute engine
6
+ ==================================================================
7
7
8
8
:Author: The Blosc development team
9
9
@@ -26,58 +26,46 @@ A fast & compressed ndarray library with a flexible computational engine
26
26
What it is
27
27
==========
28
28
29
- ` C -Blosc2 < https://github.com/Blosc/c-blosc2 >`_ is a blocking, shuffling and
30
- lossless compression library meant for numerical data written in C. Blosc2
31
- is the next generation of Blosc, an
32
- `award-winning <https://www.blosc.org/posts/prize-push-Blosc2/ >`_
29
+ Python -Blosc2 is a high-performance compressed ndarray library with a flexible
30
+ compute engine. It uses the C-Blosc2 library as the compression backend.
31
+ ` C-Blosc2 < https://github.com/Blosc/c-blosc2 >`_ is the next generation of
32
+ Blosc, an `award-winning <https://www.blosc.org/posts/prize-push-Blosc2/ >`_
33
33
library that has been around for more than a decade, and that is been used
34
34
by many projects, including `PyTables <https://www.pytables.org/ >`_ or
35
35
`Zarr <https://zarr.readthedocs.io/en/stable/ >`_.
36
36
37
- On top of C-Blosc2 we built Python-Blosc2, a Python wrapper that exposes the
38
- C-Blosc2 API, plus many extensions that allow it to work transparently with
39
- NumPy arrays, while performing advanced computations on compressed data that
37
+ Python-Blosc2 is Python wrapper that exposes the C-Blosc2 API, * plus * a
38
+ compute engine that allow it to work transparently with NumPy arrays,
39
+ while performing advanced computations on compressed data that
40
40
can be stored either in-memory, on-disk or on the network (via the
41
- `Caterva2 library <https://github.com/Blosc /Caterva2 >`_).
41
+ `Caterva2 library <https://github.com/ironArray /Caterva2 >`_).
42
42
43
- Python-Blosc2 leverages both NumPy and numexpr for achieving great performance,
44
- but with a twist. Among the main differences between the new computing engine
45
- and NumPy or numexpr, you can find:
43
+ Python-Blosc2 makes special emphasis on interacting well with existing
44
+ libraries and tools. In particular, it provides:
46
45
47
- * Support for n-dim arrays that are compressed in-memory, on-disk or on the
48
- network.
49
- * High performance compression codecs, for integer, floating point, complex
50
- booleans, string and structured data.
46
+ * Support for NumPy `universal functions mechanism <https://numpy.org/doc/2.1/reference/ufuncs.html >`_,
47
+ allowing to mix and match NumPy and Blosc2 computation engines.
48
+ * Excellent integration with Numba and Cython via
49
+ `User Defined Functions <https://www.blosc.org/python-blosc2/getting_started/tutorials/03.lazyarray-udf.html >`_.
50
+ * Lazy expressions that are computed only when needed, and that can be stored
51
+ for later use.
52
+
53
+ Python-Blosc2 leverages both `NumPy <https://numpy.org >`_ and
54
+ `NumExpr <https://numexpr.readthedocs.io/en/latest/ >`_ for achieving great
55
+ performance, but with a twist. Among the main differences between the new
56
+ computing engine and NumPy or numexpr, you can find:
57
+
58
+ * Support for ndarrays that can be compressed and stored in-memory, on-disk
59
+ or `on the network <https://github.com/ironArray/Caterva2 >`_.
51
60
* Can perform many kind of math expressions, including reductions, indexing,
52
61
filters and more.
53
- * Support for NumPy ufunc mechanism, allowing to mix and match NumPy and
54
- Blosc2 computations.
55
- * Excellent integration with Numba and Cython via User Defined Functions.
56
- * Support for broadcasting operations. This is a powerful feature that
57
- allows to perform operations on arrays of different shapes.
62
+ * Support for broadcasting operations. Allows to perform operations on arrays
63
+ of different shapes.
58
64
* Much better adherence to the NumPy casting rules than numexpr.
59
- * Lazy expressions that are computed only when needed, and can be stored for
60
- later use.
61
- * Persistent reductions that can be updated incrementally.
65
+ * Persistent reductions where ndarrays that can be updated incrementally.
62
66
* Support for proxies that allow to work with compressed data on local or
63
67
remote machines.
64
68
65
- You can read some of our tutorials on how to perform advanced computations at:
66
-
67
- https://www.blosc.org/python-blosc2/getting_started/tutorials
68
-
69
- As well as the full documentation at:
70
-
71
- https://www.blosc.org/python-blosc2
72
-
73
- Finally, Python-Blosc2 aims to leverage the full C-Blosc2 functionality to
74
- support a wide range of compression and decompression needs, including
75
- metadata, serialization and other bells and whistles.
76
-
77
- **Note: ** Blosc2 is meant to be backward compatible with Blosc(1) data.
78
- That means that it can read data generated with Blosc, but the opposite
79
- is not true (i.e. there is no *forward * compatibility).
80
-
81
69
NDArray: an N-Dimensional store
82
70
===============================
83
71
@@ -132,21 +120,19 @@ Here it is a simple example:
132
120
As you can see, the ``NDArray `` instances are very similar to NumPy arrays,
133
121
but behind the scenes, they store compressed data that can be processed
134
122
efficiently using the new computing engine included in Python-Blosc2.
135
- [Although not exercised above, broadcasting and reductions also work, as well as
136
- filtering, indexing and sorting operations for structured arrays (tables).]
137
123
138
- To pique your interest , here is the performance (measured on a modern desktop machine)
124
+ To wet your appetite , here is the performance (measured on a modern desktop machine)
139
125
that you can achieve when the operands in the expression above fit comfortably in memory
140
126
(20_000 x 20_000):
141
127
142
128
.. image :: https://github.com/Blosc/python-blosc2/blob/main/images/lazyarray-expr.png?raw=true
143
129
:width: 90%
144
130
:alt: Performance when operands fit in-memory
145
131
146
- In this case, the performance is somewhat below that of top-tier libraries like Numexpr,
147
- but it is still quite good, specially when compared with plain NumPy. For these short
148
- benchmarks, numba normally loses because its relatively large compiling overhead cannot be
149
- amortized.
132
+ In this case, the performance is somewhat below that of top-tier libraries like
133
+ Numexpr, but still quite good, specially when compared with plain NumPy. For
134
+ these short benchmarks, numba normally loses because its relatively large
135
+ compiling overhead cannot be amortized.
150
136
151
137
One important point is that the memory consumption when using the ``LazyArray.compute() ``
152
138
method is pretty low (does not exceed 100 MB) because the output is an ``NDArray `` object,
@@ -159,26 +145,29 @@ Another point is that, when using the Blosc2 engine, computation with compressio
159
145
actually faster than without it (not by a large margin, but still). To understand why,
160
146
you may want to read `this paper <https://www.blosc.org/docs/StarvingCPUs-CISE-2010.pdf >`_.
161
147
162
- And here it is the performance when the operands barely fit in memory (50_000 x 50_000):
148
+ And here it is the performance when the operands and result (50_000 x 50_000) barely fit in memory
149
+ (a machine with 64 GB of RAM, for a working set of 60 GB):
163
150
164
151
.. image :: https://github.com/Blosc/python-blosc2/blob/main/images/lazyarray-expr-large.png?raw=true
165
152
:width: 90%
166
153
:alt: Performance when operands do not fit well in-memory
167
154
168
- In this latter case, the memory consumption figures does not seem extreme, but this is because
169
- the displayed values represent *actual * memory consumption *during * the computation
170
- (not virtual memory); in addition, the resulting array is boolean, so it does not take too much
171
- space to store (just 2.4 GB uncompressed). In this scenario, the performance compared to top-tier
172
- libraries like Numexpr or Numba is quite competitive.
155
+ In this latter case, the memory consumption figures do not seem extreme; this
156
+ is because the displayed values represent *actual * memory consumption *during *
157
+ the computation, and not virtual memory; in addition, the resulting array is
158
+ boolean, so it does not take too much space to store (just 2.4 GB uncompressed).
173
159
174
- You can find the benchmark for the examples above at:
160
+ In this later scenario, the performance compared to Numexpr or Numba is quite
161
+ competitive, and actually faster than those. This is because the Blosc2
162
+ compute engine is is able to perform the computation streaming over the
163
+ compressed chunks and blocks, for a better use of the memory and CPU caches.
164
+
165
+ You can find the notebooks for these benchmarks at:
175
166
176
167
https://github.com/Blosc/python-blosc2/blob/main/bench/ndarray/lazyarray-expr.ipynb
177
168
178
169
https://github.com/Blosc/python-blosc2/blob/main/bench/ndarray/lazyarray-expr-large.ipynb
179
170
180
- Feel free to run them in your own machine and compare the results.
181
-
182
171
Installing
183
172
==========
184
173
@@ -189,12 +178,17 @@ You can install the binary packages from PyPi using ``pip``:
189
178
190
179
pip install blosc2
191
180
192
- We are in the process of releasing 3.0.0, along with wheels for various
193
- versions. For example, to install the first release candidate version, you can use:
181
+ If you want to install the latest release, you can do it with pip:
194
182
195
183
.. code-block :: console
196
184
197
- pip install blosc2==3.0.0rc2
185
+ pip install blosc2 --upgrade
186
+
187
+ For conda users, you can install the package from the conda-forge channel:
188
+
189
+ .. code-block :: console
190
+
191
+ conda install -c conda-forge blosc2
198
192
199
193
Documentation
200
194
=============
@@ -209,7 +203,7 @@ https://github.com/Blosc/python-blosc2/tree/main/examples
209
203
210
204
Finally, we taught a tutorial at the `PyData Global 2024 <https://pydata.org/global2024/ >`_
211
205
that you can find at: https://github.com/Blosc/Python-Blosc2-3.0-tutorial. There you will
212
- find differents Jupyter notebook that explains the main features of Python-Blosc2.
206
+ find different Jupyter notebook that explains the main features of Python-Blosc2.
213
207
214
208
Building from sources
215
209
=====================
@@ -233,18 +227,7 @@ correctly by running the tests:
233
227
.. code-block :: console
234
228
235
229
pip install .[test]
236
- pytest (add -v for verbose mode)
237
-
238
- Benchmarking
239
- ============
240
-
241
- If you are curious, you may want to run a small benchmark that compares a plain
242
- NumPy array copy against compression using different compressors in your Blosc2
243
- build:
244
-
245
- .. code-block :: console
246
-
247
- python bench/pack_compress.py
230
+ pytest # add -v for verbose mode
248
231
249
232
License
250
233
=======
@@ -287,11 +270,11 @@ to the core development of the Blosc2 library:
287
270
- Ivan Vilata i Balaguer
288
271
- Oumaima Ech.Chdig
289
272
290
- In addition, other people have contributed to the project in different
273
+ In addition, other people have participated to the project in different
291
274
aspects:
292
275
293
- - Jan Sellner, who contributed the mmap support for NDArray/SChunk objects.
294
- - Dimitri Papadopoulos, who contributed a large bunch of improvements to the
276
+ - Jan Sellner, contributed the mmap support for NDArray/SChunk objects.
277
+ - Dimitri Papadopoulos, contributed a large bunch of improvements to the
295
278
in many aspects of the project. His attention to detail is remarkable.
296
279
- And many others that have contributed with bug reports, suggestions and
297
280
improvements.
@@ -319,4 +302,4 @@ organization, which is a non-profit that supports many open-source projects.
319
302
Thank you!
320
303
321
304
322
- **Make compression better! **
305
+ **Compress Better, Compute Bigger **
0 commit comments