22Python-Blosc2
33=============
44
5- A fast & compressed ndarray library with a flexible computational engine
6- ========================================================================
5+ A fast & compressed ndarray library with a flexible compute engine
6+ ==================================================================
77
88:Author: The Blosc development team
99:Contact: blosc@blosc.org
@@ -26,58 +26,46 @@ A fast & compressed ndarray library with a flexible computational engine
2626What it is
2727==========
2828
29- ` C -Blosc2 < https://github.com/Blosc/c-blosc2 >`_ is a blocking, shuffling and
30- lossless compression library meant for numerical data written in C. Blosc2
31- is the next generation of Blosc, an
32- `award-winning <https://www.blosc.org/posts/prize-push-Blosc2/ >`_
29+ Python -Blosc2 is a high-performance compressed ndarray library with a flexible
30+ compute engine. It uses the C-Blosc2 library as the compression backend.
31+ ` C-Blosc2 < https://github.com/Blosc/c-blosc2 >`_ is the next generation of
32+ Blosc, an `award-winning <https://www.blosc.org/posts/prize-push-Blosc2/ >`_
3333library that has been around for more than a decade, and that is been used
3434by many projects, including `PyTables <https://www.pytables.org/ >`_ or
3535`Zarr <https://zarr.readthedocs.io/en/stable/ >`_.
3636
37- On top of C-Blosc2 we built Python-Blosc2, a Python wrapper that exposes the
38- C-Blosc2 API, plus many extensions that allow it to work transparently with
39- NumPy arrays, while performing advanced computations on compressed data that
37+ Python-Blosc2 is Python wrapper that exposes the C-Blosc2 API, * plus * a
38+ compute engine that allow it to work transparently with NumPy arrays,
39+ while performing advanced computations on compressed data that
4040can be stored either in-memory, on-disk or on the network (via the
41- `Caterva2 library <https://github.com/Blosc /Caterva2 >`_).
41+ `Caterva2 library <https://github.com/ironArray /Caterva2 >`_).
4242
43- Python-Blosc2 leverages both NumPy and numexpr for achieving great performance,
44- but with a twist. Among the main differences between the new computing engine
45- and NumPy or numexpr, you can find:
43+ Python-Blosc2 makes special emphasis on interacting well with existing
44+ libraries and tools. In particular, it provides:
4645
47- * Support for n-dim arrays that are compressed in-memory, on-disk or on the
48- network.
49- * High performance compression codecs, for integer, floating point, complex
50- booleans, string and structured data.
46+ * Support for NumPy `universal functions mechanism <https://numpy.org/doc/2.1/reference/ufuncs.html >`_,
47+ allowing to mix and match NumPy and Blosc2 computation engines.
48+ * Excellent integration with Numba and Cython via
49+ `User Defined Functions <https://www.blosc.org/python-blosc2/getting_started/tutorials/03.lazyarray-udf.html >`_.
50+ * Lazy expressions that are computed only when needed, and that can be stored
51+ for later use.
52+
53+ Python-Blosc2 leverages both `NumPy <https://numpy.org >`_ and
54+ `NumExpr <https://numexpr.readthedocs.io/en/latest/ >`_ for achieving great
55+ performance, but with a twist. Among the main differences between the new
56+ computing engine and NumPy or numexpr, you can find:
57+
58+ * Support for ndarrays that can be compressed and stored in-memory, on-disk
59+ or `on the network <https://github.com/ironArray/Caterva2 >`_.
5160* Can perform many kind of math expressions, including reductions, indexing,
5261 filters and more.
53- * Support for NumPy ufunc mechanism, allowing to mix and match NumPy and
54- Blosc2 computations.
55- * Excellent integration with Numba and Cython via User Defined Functions.
56- * Support for broadcasting operations. This is a powerful feature that
57- allows to perform operations on arrays of different shapes.
62+ * Support for broadcasting operations. Allows to perform operations on arrays
63+ of different shapes.
5864* Much better adherence to the NumPy casting rules than numexpr.
59- * Lazy expressions that are computed only when needed, and can be stored for
60- later use.
61- * Persistent reductions that can be updated incrementally.
65+ * Persistent reductions where ndarrays that can be updated incrementally.
6266* Support for proxies that allow to work with compressed data on local or
6367 remote machines.
6468
65- You can read some of our tutorials on how to perform advanced computations at:
66-
67- https://www.blosc.org/python-blosc2/getting_started/tutorials
68-
69- As well as the full documentation at:
70-
71- https://www.blosc.org/python-blosc2
72-
73- Finally, Python-Blosc2 aims to leverage the full C-Blosc2 functionality to
74- support a wide range of compression and decompression needs, including
75- metadata, serialization and other bells and whistles.
76-
77- **Note: ** Blosc2 is meant to be backward compatible with Blosc(1) data.
78- That means that it can read data generated with Blosc, but the opposite
79- is not true (i.e. there is no *forward * compatibility).
80-
8169NDArray: an N-Dimensional store
8270===============================
8371
@@ -132,21 +120,19 @@ Here it is a simple example:
132120 As you can see, the ``NDArray `` instances are very similar to NumPy arrays,
133121but behind the scenes, they store compressed data that can be processed
134122efficiently using the new computing engine included in Python-Blosc2.
135- [Although not exercised above, broadcasting and reductions also work, as well as
136- filtering, indexing and sorting operations for structured arrays (tables).]
137123
138- To pique your interest , here is the performance (measured on a modern desktop machine)
124+ To wet your appetite , here is the performance (measured on a modern desktop machine)
139125that you can achieve when the operands in the expression above fit comfortably in memory
140126(20_000 x 20_000):
141127
142128.. image :: https://github.com/Blosc/python-blosc2/blob/main/images/lazyarray-expr.png?raw=true
143129 :width: 90%
144130 :alt: Performance when operands fit in-memory
145131
146- In this case, the performance is somewhat below that of top-tier libraries like Numexpr,
147- but it is still quite good, specially when compared with plain NumPy. For these short
148- benchmarks, numba normally loses because its relatively large compiling overhead cannot be
149- amortized.
132+ In this case, the performance is somewhat below that of top-tier libraries like
133+ Numexpr, but still quite good, specially when compared with plain NumPy. For
134+ these short benchmarks, numba normally loses because its relatively large
135+ compiling overhead cannot be amortized.
150136
151137One important point is that the memory consumption when using the ``LazyArray.compute() ``
152138method is pretty low (does not exceed 100 MB) because the output is an ``NDArray `` object,
@@ -159,26 +145,29 @@ Another point is that, when using the Blosc2 engine, computation with compressio
159145actually faster than without it (not by a large margin, but still). To understand why,
160146you may want to read `this paper <https://www.blosc.org/docs/StarvingCPUs-CISE-2010.pdf >`_.
161147
162- And here it is the performance when the operands barely fit in memory (50_000 x 50_000):
148+ And here it is the performance when the operands and result (50_000 x 50_000) barely fit in memory
149+ (a machine with 64 GB of RAM, for a working set of 60 GB):
163150
164151.. image :: https://github.com/Blosc/python-blosc2/blob/main/images/lazyarray-expr-large.png?raw=true
165152 :width: 90%
166153 :alt: Performance when operands do not fit well in-memory
167154
168- In this latter case, the memory consumption figures does not seem extreme, but this is because
169- the displayed values represent *actual * memory consumption *during * the computation
170- (not virtual memory); in addition, the resulting array is boolean, so it does not take too much
171- space to store (just 2.4 GB uncompressed). In this scenario, the performance compared to top-tier
172- libraries like Numexpr or Numba is quite competitive.
155+ In this latter case, the memory consumption figures do not seem extreme; this
156+ is because the displayed values represent *actual * memory consumption *during *
157+ the computation, and not virtual memory; in addition, the resulting array is
158+ boolean, so it does not take too much space to store (just 2.4 GB uncompressed).
173159
174- You can find the benchmark for the examples above at:
160+ In this later scenario, the performance compared to Numexpr or Numba is quite
161+ competitive, and actually faster than those. This is because the Blosc2
162+ compute engine is is able to perform the computation streaming over the
163+ compressed chunks and blocks, for a better use of the memory and CPU caches.
164+
165+ You can find the notebooks for these benchmarks at:
175166
176167https://github.com/Blosc/python-blosc2/blob/main/bench/ndarray/lazyarray-expr.ipynb
177168
178169https://github.com/Blosc/python-blosc2/blob/main/bench/ndarray/lazyarray-expr-large.ipynb
179170
180- Feel free to run them in your own machine and compare the results.
181-
182171Installing
183172==========
184173
@@ -189,12 +178,17 @@ You can install the binary packages from PyPi using ``pip``:
189178
190179 pip install blosc2
191180
192- We are in the process of releasing 3.0.0, along with wheels for various
193- versions. For example, to install the first release candidate version, you can use:
181+ If you want to install the latest release, you can do it with pip:
194182
195183.. code-block :: console
196184
197- pip install blosc2==3.0.0rc2
185+ pip install blosc2 --upgrade
186+
187+ For conda users, you can install the package from the conda-forge channel:
188+
189+ .. code-block :: console
190+
191+ conda install -c conda-forge blosc2
198192
199193 Documentation
200194=============
@@ -209,7 +203,7 @@ https://github.com/Blosc/python-blosc2/tree/main/examples
209203
210204Finally, we taught a tutorial at the `PyData Global 2024 <https://pydata.org/global2024/ >`_
211205that you can find at: https://github.com/Blosc/Python-Blosc2-3.0-tutorial. There you will
212- find differents Jupyter notebook that explains the main features of Python-Blosc2.
206+ find different Jupyter notebook that explains the main features of Python-Blosc2.
213207
214208Building from sources
215209=====================
@@ -233,18 +227,7 @@ correctly by running the tests:
233227.. code-block :: console
234228
235229 pip install .[test]
236- pytest (add -v for verbose mode)
237-
238- Benchmarking
239- ============
240-
241- If you are curious, you may want to run a small benchmark that compares a plain
242- NumPy array copy against compression using different compressors in your Blosc2
243- build:
244-
245- .. code-block :: console
246-
247- python bench/pack_compress.py
230+ pytest # add -v for verbose mode
248231
249232 License
250233=======
@@ -287,11 +270,11 @@ to the core development of the Blosc2 library:
287270- Ivan Vilata i Balaguer
288271- Oumaima Ech.Chdig
289272
290- In addition, other people have contributed to the project in different
273+ In addition, other people have participated to the project in different
291274aspects:
292275
293- - Jan Sellner, who contributed the mmap support for NDArray/SChunk objects.
294- - Dimitri Papadopoulos, who contributed a large bunch of improvements to the
276+ - Jan Sellner, contributed the mmap support for NDArray/SChunk objects.
277+ - Dimitri Papadopoulos, contributed a large bunch of improvements to the
295278 in many aspects of the project. His attention to detail is remarkable.
296279- And many others that have contributed with bug reports, suggestions and
297280 improvements.
@@ -319,4 +302,4 @@ organization, which is a non-profit that supports many open-source projects.
319302Thank you!
320303
321304
322- **Make compression better! **
305+ **Compress Better, Compute Bigger **
0 commit comments