Skip to content

Commit 81a9c09

Browse files
committed
Getting ready for release 4.2.0
1 parent c4d64b0 commit 81a9c09

5 files changed

Lines changed: 107 additions & 50 deletions

File tree

ANNOUNCE.rst

Lines changed: 37 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,36 @@
1-
Announcing Python-Blosc2 4.0.0
1+
Announcing Python-Blosc2 4.2.0
22
===============================
33

4-
This is patch release which updates the ``miniexpr`` version to fix a bug for ubuntu ARM64 failure.
4+
We are happy to announce Python-Blosc2 4.2.0. This is a large feature
5+
release, with the new compressed columnar table container, ``CTable``, as the
6+
main development.
7+
8+
``CTable`` brings typed, compressed, column-oriented tables to Python-Blosc2.
9+
It supports persistent ``.b2d`` and ``.b2z`` storage, schema-driven columns,
10+
nullable data, variable-length strings/bytes and object columns, computed
11+
columns, table views, mutations, sorting, filtering and persistent indexes. It
12+
also includes Arrow, Parquet and CSV interoperability, plus a new
13+
``parquet-to-blosc2`` command-line utility.
14+
15+
For a deeper introduction to CTable and its motivation, see our recent blog
16+
post:
17+
18+
https://blosc.org/posts/ctable-blosc2-columnar-table/
19+
20+
Other highlights in 4.2.0 include:
21+
22+
- A new indexing subsystem for NDArrays and CTables, including persistent
23+
sidecar indexes, expression indexes, sorted iteration and query caching.
24+
- New structured serialization facilities for persisted ``C2Array``,
25+
``LazyExpr`` and DSL ``LazyUDF`` objects, plus ``blosc2.Ref`` and
26+
``blosc2.load()``.
27+
- New schema helpers such as ``blosc2.struct()`` and ``blosc2.object()``.
28+
- Object/ListArray improvements for variable-length and general object data.
29+
- Faster and lower-memory ``fromiter()`` construction, improved ``BatchArray``
30+
defaults and continued linalg/matmul optimizations.
31+
- Many documentation, tutorial, example and benchmark additions.
32+
- Numerous fixes for Windows mmap/file-locking behavior, Python 3.14 GC/thread
33+
interactions, ``.b2z`` persistence, indexed queries and NumPy compatibility.
534

635
You can think of Python-Blosc2 4.x as an extension of NumPy/numexpr that:
736

@@ -13,6 +42,7 @@ You can think of Python-Blosc2 4.x as an extension of NumPy/numexpr that:
1342
- Integrates with Numba and Cython via UDFs (User Defined Functions).
1443
- Adheres to modern array API standard conventions (https://data-apis.org/array-api/).
1544
- Can perform linear algebra operations (like ``blosc2.tensordot()``).
45+
- Can store and query compressed columnar tables via ``blosc2.CTable``.
1646

1747
Install it with::
1848

@@ -23,52 +53,15 @@ For more info, you can have a look at the release notes in:
2353

2454
https://github.com/Blosc/python-blosc2/releases
2555

26-
Code example::
56+
Small CTable example::
2757

28-
from time import time
2958
import blosc2
30-
import numpy as np
3159

32-
# Create some data operands
33-
N = 20_000
34-
a = blosc2.linspace(0, 1, N * N, dtype="float32", shape=(N, N))
35-
b = blosc2.linspace(1, 2, N * N, shape=(N, N))
36-
c = blosc2.linspace(-10, 10, N) # broadcasting is supported
60+
table = blosc2.CTable.from_parquet("measurements.parquet", urlpath="measurements.b2z")
61+
table.create_index("station_id")
3762

38-
# Expression
39-
t0 = time()
40-
expr = ((a**3 + blosc2.sin(c * 2)) < b) & (c > 0)
41-
print(f"Time to create expression: {time()-t0:.5f}")
42-
43-
# Evaluate while reducing (yep, reductions are in) along axis 1
44-
t0 = time()
45-
out = blosc2.sum(expr, axis=1)
46-
t1 = time() - t0
47-
print(f"Time to compute with Blosc2: {t1:.5f}")
48-
49-
# Evaluate using NumPy
50-
na, nb, nc = a[:], b[:], c[:]
51-
t0 = time()
52-
nout = np.sum(((na**3 + np.sin(nc * 2)) < nb) & (nc > 0), axis=1)
53-
t2 = time() - t0
54-
print(f"Time to compute with NumPy: {t2:.5f}")
55-
print(f"Speedup: {t2/t1:.2f}x")
56-
57-
assert np.all(out == nout)
58-
print("All results are equal!")
59-
60-
61-
This will output something like (using an Intel i9-13900K CPU here)::
62-
63-
Time to create expression: 0.00033
64-
Time to compute with Blosc2: 0.46387
65-
Time to compute with NumPy: 2.57469
66-
Speedup: 5.55x
67-
All results are equal!
68-
69-
See a more in-depth example, explaining why Python-Blosc2 is so fast, at:
70-
71-
https://www.blosc.org/python-blosc2/getting_started/overview.html#operating-with-ndarrays
63+
hot = table.where("temperature > 30")
64+
print(hot.head())
7265

7366
Sources repository
7467
------------------

RELEASE_NOTES.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,71 @@
11
# Release notes
22

3+
## Changes from 4.1.2 to 4.2.0
4+
5+
### CTable: columnar compressed tables
6+
7+
- Introduced `blosc2.CTable`, a new columnar table container for compressed, typed columns. CTables support dataclass- and schema-based construction, row iteration, column access, table views, `head()` / `tail()` / `sample()`, sorting, selection and compact `where` expressions.
8+
- Added persistent CTables backed by `TreeStore`, with support for `blosc2.open()`, `CTable.open()`, `CTable.load()`, `CTable.save()`, `CTable.to_b2d()` and `CTable.to_b2z()`. CTable views can be saved too, and `.b2z`/`.b2d` path handling has been tightened.
9+
- Added mutation operations for CTables, including `append()`, `extend()`, `delete()`, `compact()`, `add_column()`, `drop_column()`, `rename_column()` and related schema validation.
10+
- Added computed columns, including virtual computed columns backed by lazy expressions, materialized computed columns and automatic filling of materialized computed columns during inserts.
11+
- Added CTable indexing support, including persistent indexes, direct expression indexes, ordered index reuse, boolean `LazyExpr`/`NDArray` masks in `CTable.__getitem__`, `iter_sorted()` and indexing support for `.b2z` tables.
12+
- Added nullable schema support and null policies for CTable scalar columns, preserving nullable scalar Parquet round-trips.
13+
- Added variable-length CTable column support via `ListArray` / `ObjectArray`, including `vlstring` and `vlbytes` schema specs, fixed-length string/bytes import support and list/struct Arrow/Parquet round-trips.
14+
- Added Arrow, Parquet and CSV interoperability for CTables, including batch-wise Arrow/Parquet import/export, Arrow schema metadata preservation, `CTable.from_arrow_batches()` improvements and a new `parquet-to-blosc2` CLI utility.
15+
- Added CTable documentation, tutorials, examples and benchmarks covering schema definition, persistence, querying, indexing, mutations, nullable columns, computed columns and variable-length columns.
16+
17+
### Indexing and ordering
18+
19+
- Added a new indexing subsystem for NDArrays and CTables, including full, partial/bucket, light/medium and OPSI-style index kinds, out-of-core index builders and sidecar storage.
20+
- Added `blosc2.Index` as the unified public index handle, plus APIs such as `create_index()`, `compact_index()`, `iter_sorted()`, `will_use_index()` and related query explanation support.
21+
- Added materialized expression indexes for NDArrays and direct expression indexes for CTables.
22+
- Added persistent query-result caching for indexed lookups, with FIFO pruning and cache accounting.
23+
- Added `blosc2.argsort()` and refactored indexing APIs around explicit index enums and sorting helpers.
24+
- Improved indexed query performance with Cython accelerators, threaded chunk batching, zero-copy/cached mmap reads, chunk-aware and reduced-order layouts and faster scattered row gathering.
25+
- Reduced memory usage during index creation and lookup by avoiding full sidecar materialization, replacing memmap staging with Blosc2 scratch arrays and adding `tmpdir` support for full out-of-core indexes.
26+
27+
### Persistence, stores and serialization
28+
29+
- Added structured Blosc2 serialization based on b2object carriers, including persisted `C2Array`, `LazyExpr` and DSL `LazyUDF` objects.
30+
- Added `blosc2.Ref` for serializing external references, plus examples for b2object bundles and persisted expressions/UDFs.
31+
- Added `blosc2.load()` as a convenience loader.
32+
- Added `vlmeta` support to `LazyArray` objects.
33+
- Improved store handling by preserving lazy b2object carriers in `DictStore`, allowing reopened proxies to refill caches after read-only opens, relaxing `DictStore`/`TreeStore` suffix requirements and adding `DictStore.to_b2d()`.
34+
- Accelerated `blosc2.open()` by trying standard opens first and warning on implicit append mode.
35+
36+
### Arrays, computation and containers
37+
38+
- Added `ObjectArray` for fully general object data and renamed the earlier `VLArray` work accordingly; added `ListArray` docstrings and Arrow integration improvements.
39+
- Added schema helpers including numeric specs, `blosc2.struct()` and `blosc2.object()` for nested/fully general column declarations.
40+
- Improved `fromiter()` with direct chunked construction and substantially lower peak memory use.
41+
- Improved `asarray()` behavior for NDArray inputs when copy-inducing keyword arguments are supplied.
42+
- Added `SChunk.reorder_offsets()`.
43+
- Improved `BatchArray` defaults and documentation; the default compression level is now tuned for faster lookup/scan behavior.
44+
- Continued matmul/linalg optimization work and shared-thread-pool integration.
45+
46+
### CLI, docs and examples
47+
48+
- Added the `parquet-to-blosc2` command with options such as `--max-rows`, `--parquet-batch-size`, `--blosc2-items-per-block` and `--use-dict`.
49+
- Added new CTable, ObjectArray, BatchArray, containers, indexing and serialization tutorials and examples.
50+
- Reorganized and expanded the API reference for CTable, Column, schema specs, Index, save/load helpers and miscellaneous APIs.
51+
- Updated benchmark suites for CTables, indexing, Parquet import/export, BatchArray and NDArray construction/indexing.
52+
53+
### Fixes and compatibility
54+
55+
- Updated bundled C-Blosc2 to v3.0.2 and require C-Blosc2 >= 3.0.0 when building against a system library.
56+
- Updated bundled C-Blosc2 and miniexpr sources multiple times.
57+
- Restored compatibility with NumPy < 2.
58+
- Fixed Windows and mmap/file-locking issues in index creation, rebuilds and temporary file cleanup.
59+
- Fixed full-index query failures for large CTable columns and full out-of-core merge failures on systems with small `/tmp`.
60+
- Fixed stale sidecar/cache reuse and targeted cache invalidation when persistent sidecars are replaced.
61+
- Fixed `.b2z` double-open corruption caused by GC-triggered repacking and made temporary `.b2z` unpacking default to the source file directory.
62+
- Fixed a regression when reopening persisted proxies in read-only mode.
63+
- Fixed GC-induced thread hangs on macOS with Python 3.14 and hardened async chunk reading/cache cleanup paths.
64+
- Fixed lazy-chunk source-size handling in decode/getitem callers.
65+
- Fixed nullable validation, dictionary extend validation, CTable close propagation, print alignment and NumPy mask support.
66+
- Fixed `arange()` regressions and several pre-existing `set_slice` error-handling issues.
67+
- Clamped indexing/thread defaults for wasm32.
68+
369
## Changes from 4.1.1 to 4.1.2
470

571
- A new fast path for src/blosc2/linalg.py that uses the matmul prefilter machinery in src/blosc2/blosc2_ext.pyx.

RELEASING.rst

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,8 @@ Preliminaries
1515
* Make sure that the c-blosc2 repository is updated to the latest version (or a specific
1616
version that will be documented in the ``RELEASE_NOTES.md``). In ``CMakeLists.txt`` edit::
1717

18-
FetchContent_Declare(blosc2
19-
GIT_REPOSITORY https://github.com/Blosc/c-blosc2
20-
GIT_TAG b179abf1132dfa5a263b2ebceb6ef7a3c2890c64
21-
)
18+
set(BLOSC2_MIN_VERSION 3.0.0)
19+
set(BLOSC2_BUNDLED_VERSION v3.0.2)
2220

2321
to point to the desired commit/tag in the c-blosc2 repo. Note that ``conda-forge`` only selects the latest release, so it may be necessary to do a formal release of ``c-blosc2`` to ensure that the package is correctly generated in ```conda-forge``.
2422

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ dependencies = [
4141
"requests",
4242
"threadpoolctl; platform_machine != 'wasm32'",
4343
]
44-
version = "4.1.1.dev0"
44+
version = "4.2.0"
4545
[project.entry-points."array_api"]
4646
blosc2 = "blosc2"
4747

src/blosc2/version.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
1-
__version__ = "4.1.3.dev0"
1+
__version__ = "4.2.0"
22
__array_api_version__ = "2024.12"

0 commit comments

Comments
 (0)