Getting ready for release 4.2.0

FrancescAlted · FrancescAlted · commit 81a9c09bd897 · 2026-05-07T12:40:32.000+02:00
diff --git a/ANNOUNCE.rst b/ANNOUNCE.rst
@@ -1,7 +1,36 @@
-Announcing Python-Blosc2 4.0.0
+Announcing Python-Blosc2 4.2.0
 ===============================
 
-This is patch release which updates the ``miniexpr`` version to fix a bug for ubuntu ARM64 failure.
+We are happy to announce Python-Blosc2 4.2.0.  This is a large feature
+release, with the new compressed columnar table container, ``CTable``, as the
+main development.
+
+``CTable`` brings typed, compressed, column-oriented tables to Python-Blosc2.
+It supports persistent ``.b2d`` and ``.b2z`` storage, schema-driven columns,
+nullable data, variable-length strings/bytes and object columns, computed
+columns, table views, mutations, sorting, filtering and persistent indexes.  It
+also includes Arrow, Parquet and CSV interoperability, plus a new
+``parquet-to-blosc2`` command-line utility.
+
+For a deeper introduction to CTable and its motivation, see our recent blog
+post:
+
+https://blosc.org/posts/ctable-blosc2-columnar-table/
+
+Other highlights in 4.2.0 include:
+
+- A new indexing subsystem for NDArrays and CTables, including persistent
+  sidecar indexes, expression indexes, sorted iteration and query caching.
+- New structured serialization facilities for persisted ``C2Array``,
+  ``LazyExpr`` and DSL ``LazyUDF`` objects, plus ``blosc2.Ref`` and
+  ``blosc2.load()``.
+- New schema helpers such as ``blosc2.struct()`` and ``blosc2.object()``.
+- Object/ListArray improvements for variable-length and general object data.
+- Faster and lower-memory ``fromiter()`` construction, improved ``BatchArray``
+  defaults and continued linalg/matmul optimizations.
+- Many documentation, tutorial, example and benchmark additions.
+- Numerous fixes for Windows mmap/file-locking behavior, Python 3.14 GC/thread
+  interactions, ``.b2z`` persistence, indexed queries and NumPy compatibility.
 
 You can think of Python-Blosc2 4.x as an extension of NumPy/numexpr that:
 
@@ -13,6 +42,7 @@ You can think of Python-Blosc2 4.x as an extension of NumPy/numexpr that:
 - Integrates with Numba and Cython via UDFs (User Defined Functions).
 - Adheres to modern array API standard conventions (https://data-apis.org/array-api/).
 - Can perform linear algebra operations (like ``blosc2.tensordot()``).
+- Can store and query compressed columnar tables via ``blosc2.CTable``.
 
 Install it with::
 
@@ -23,52 +53,15 @@ For more info, you can have a look at the release notes in:
 
 https://github.com/Blosc/python-blosc2/releases
 
-Code example::
+Small CTable example::
 
-    from time import time
     import blosc2
-    import numpy as np
 
-    # Create some data operands
-    N = 20_000
-    a = blosc2.linspace(0, 1, N * N, dtype="float32", shape=(N, N))
-    b = blosc2.linspace(1, 2, N * N, shape=(N, N))
-    c = blosc2.linspace(-10, 10, N)  # broadcasting is supported
+    table = blosc2.CTable.from_parquet("measurements.parquet", urlpath="measurements.b2z")
+    table.create_index("station_id")
 
-    # Expression
-    t0 = time()
-    expr = ((a**3 + blosc2.sin(c * 2)) < b) & (c > 0)
-    print(f"Time to create expression: {time()-t0:.5f}")
-
-    # Evaluate while reducing (yep, reductions are in) along axis 1
-    t0 = time()
-    out = blosc2.sum(expr, axis=1)
-    t1 = time() - t0
-    print(f"Time to compute with Blosc2: {t1:.5f}")
-
-    # Evaluate using NumPy
-    na, nb, nc = a[:], b[:], c[:]
-    t0 = time()
-    nout = np.sum(((na**3 + np.sin(nc * 2)) < nb) & (nc > 0), axis=1)
-    t2 = time() - t0
-    print(f"Time to compute with NumPy: {t2:.5f}")
-    print(f"Speedup: {t2/t1:.2f}x")
-
-    assert np.all(out == nout)
-    print("All results are equal!")
-
-
-This will output something like (using an Intel i9-13900K CPU here)::
-
-    Time to create expression: 0.00033
-    Time to compute with Blosc2: 0.46387
-    Time to compute with NumPy: 2.57469
-    Speedup: 5.55x
-    All results are equal!
-
-See a more in-depth example, explaining why Python-Blosc2 is so fast, at:
-
-https://www.blosc.org/python-blosc2/getting_started/overview.html#operating-with-ndarrays
+    hot = table.where("temperature > 30")
+    print(hot.head())
 
 Sources repository
 ------------------
diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
@@ -1,5 +1,71 @@
 # Release notes
 
+## Changes from 4.1.2 to 4.2.0
+
+### CTable: columnar compressed tables
+
+- Introduced `blosc2.CTable`, a new columnar table container for compressed, typed columns.  CTables support dataclass- and schema-based construction, row iteration, column access, table views, `head()` / `tail()` / `sample()`, sorting, selection and compact `where` expressions.
+- Added persistent CTables backed by `TreeStore`, with support for `blosc2.open()`, `CTable.open()`, `CTable.load()`, `CTable.save()`, `CTable.to_b2d()` and `CTable.to_b2z()`.  CTable views can be saved too, and `.b2z`/`.b2d` path handling has been tightened.
+- Added mutation operations for CTables, including `append()`, `extend()`, `delete()`, `compact()`, `add_column()`, `drop_column()`, `rename_column()` and related schema validation.
+- Added computed columns, including virtual computed columns backed by lazy expressions, materialized computed columns and automatic filling of materialized computed columns during inserts.
+- Added CTable indexing support, including persistent indexes, direct expression indexes, ordered index reuse, boolean `LazyExpr`/`NDArray` masks in `CTable.__getitem__`, `iter_sorted()` and indexing support for `.b2z` tables.
+- Added nullable schema support and null policies for CTable scalar columns, preserving nullable scalar Parquet round-trips.
+- Added variable-length CTable column support via `ListArray` / `ObjectArray`, including `vlstring` and `vlbytes` schema specs, fixed-length string/bytes import support and list/struct Arrow/Parquet round-trips.
+- Added Arrow, Parquet and CSV interoperability for CTables, including batch-wise Arrow/Parquet import/export, Arrow schema metadata preservation, `CTable.from_arrow_batches()` improvements and a new `parquet-to-blosc2` CLI utility.
+- Added CTable documentation, tutorials, examples and benchmarks covering schema definition, persistence, querying, indexing, mutations, nullable columns, computed columns and variable-length columns.
+
+### Indexing and ordering
+
+- Added a new indexing subsystem for NDArrays and CTables, including full, partial/bucket, light/medium and OPSI-style index kinds, out-of-core index builders and sidecar storage.
+- Added `blosc2.Index` as the unified public index handle, plus APIs such as `create_index()`, `compact_index()`, `iter_sorted()`, `will_use_index()` and related query explanation support.
+- Added materialized expression indexes for NDArrays and direct expression indexes for CTables.
+- Added persistent query-result caching for indexed lookups, with FIFO pruning and cache accounting.
+- Added `blosc2.argsort()` and refactored indexing APIs around explicit index enums and sorting helpers.
+- Improved indexed query performance with Cython accelerators, threaded chunk batching, zero-copy/cached mmap reads, chunk-aware and reduced-order layouts and faster scattered row gathering.
+- Reduced memory usage during index creation and lookup by avoiding full sidecar materialization, replacing memmap staging with Blosc2 scratch arrays and adding `tmpdir` support for full out-of-core indexes.
+
+### Persistence, stores and serialization
+
+- Added structured Blosc2 serialization based on b2object carriers, including persisted `C2Array`, `LazyExpr` and DSL `LazyUDF` objects.
+- Added `blosc2.Ref` for serializing external references, plus examples for b2object bundles and persisted expressions/UDFs.
+- Added `blosc2.load()` as a convenience loader.
+- Added `vlmeta` support to `LazyArray` objects.
+- Improved store handling by preserving lazy b2object carriers in `DictStore`, allowing reopened proxies to refill caches after read-only opens, relaxing `DictStore`/`TreeStore` suffix requirements and adding `DictStore.to_b2d()`.
+- Accelerated `blosc2.open()` by trying standard opens first and warning on implicit append mode.
+
+### Arrays, computation and containers
+
+- Added `ObjectArray` for fully general object data and renamed the earlier `VLArray` work accordingly; added `ListArray` docstrings and Arrow integration improvements.
+- Added schema helpers including numeric specs, `blosc2.struct()` and `blosc2.object()` for nested/fully general column declarations.
+- Improved `fromiter()` with direct chunked construction and substantially lower peak memory use.
+- Improved `asarray()` behavior for NDArray inputs when copy-inducing keyword arguments are supplied.
+- Added `SChunk.reorder_offsets()`.
+- Improved `BatchArray` defaults and documentation; the default compression level is now tuned for faster lookup/scan behavior.
+- Continued matmul/linalg optimization work and shared-thread-pool integration.
+
+### CLI, docs and examples
+
+- Added the `parquet-to-blosc2` command with options such as `--max-rows`, `--parquet-batch-size`, `--blosc2-items-per-block` and `--use-dict`.
+- Added new CTable, ObjectArray, BatchArray, containers, indexing and serialization tutorials and examples.
+- Reorganized and expanded the API reference for CTable, Column, schema specs, Index, save/load helpers and miscellaneous APIs.
+- Updated benchmark suites for CTables, indexing, Parquet import/export, BatchArray and NDArray construction/indexing.
+
+### Fixes and compatibility
+
+- Updated bundled C-Blosc2 to v3.0.2 and require C-Blosc2 >= 3.0.0 when building against a system library.
+- Updated bundled C-Blosc2 and miniexpr sources multiple times.
+- Restored compatibility with NumPy < 2.
+- Fixed Windows and mmap/file-locking issues in index creation, rebuilds and temporary file cleanup.
+- Fixed full-index query failures for large CTable columns and full out-of-core merge failures on systems with small `/tmp`.
+- Fixed stale sidecar/cache reuse and targeted cache invalidation when persistent sidecars are replaced.
+- Fixed `.b2z` double-open corruption caused by GC-triggered repacking and made temporary `.b2z` unpacking default to the source file directory.
+- Fixed a regression when reopening persisted proxies in read-only mode.
+- Fixed GC-induced thread hangs on macOS with Python 3.14 and hardened async chunk reading/cache cleanup paths.
+- Fixed lazy-chunk source-size handling in decode/getitem callers.
+- Fixed nullable validation, dictionary extend validation, CTable close propagation, print alignment and NumPy mask support.
+- Fixed `arange()` regressions and several pre-existing `set_slice` error-handling issues.
+- Clamped indexing/thread defaults for wasm32.
+
 ## Changes from 4.1.1 to 4.1.2
 
 - A new fast path for src/blosc2/linalg.py that uses the matmul prefilter machinery in src/blosc2/blosc2_ext.pyx.
diff --git a/RELEASING.rst b/RELEASING.rst
@@ -15,10 +15,8 @@ Preliminaries
 * Make sure that the c-blosc2 repository is updated to the latest version (or a specific
   version that will be documented in the ``RELEASE_NOTES.md``). In ``CMakeLists.txt`` edit::
 
-    FetchContent_Declare(blosc2
-        GIT_REPOSITORY https://github.com/Blosc/c-blosc2
-        GIT_TAG b179abf1132dfa5a263b2ebceb6ef7a3c2890c64
-    )
+    set(BLOSC2_MIN_VERSION 3.0.0)
+    set(BLOSC2_BUNDLED_VERSION v3.0.2)
 
   to point to the desired commit/tag in the c-blosc2 repo. Note that ``conda-forge`` only selects the latest release, so it may be necessary to do a formal release of ``c-blosc2`` to ensure that the package is correctly generated in ```conda-forge``.
 
diff --git a/pyproject.toml b/pyproject.toml
@@ -41,7 +41,7 @@ dependencies = [
     "requests",
     "threadpoolctl; platform_machine != 'wasm32'",
 ]
-version = "4.1.1.dev0"
+version = "4.2.0"
 [project.entry-points."array_api"]
 blosc2 = "blosc2"
 
diff --git a/src/blosc2/version.py b/src/blosc2/version.py
@@ -1,2 +1,2 @@
-__version__ = "4.1.3.dev0"
+__version__ = "4.2.0"
 __array_api_version__ = "2024.12"

Original file line number	Diff line number	Diff line change
`@@ -41,7 +41,7 @@ dependencies = [`
`41`	`41`	`"requests",`
`42`	`42`	`"threadpoolctl; platform_machine != 'wasm32'",`
`43`	`43`	`]`
`44`		`-version = "4.1.1.dev0"`
	`44`	`+version = "4.2.0"`
`45`	`45`	`[project.entry-points."array_api"]`
`46`	`46`	`blosc2 = "blosc2"`
`47`	`47`
Original file line number	Diff line number	Diff line change
`@@ -1,2 +1,2 @@`
`1`		`-__version__ = "4.1.3.dev0"`
	`1`	`+__version__ = "4.2.0"`
`2`	`2`	`__array_api_version__ = "2024.12"`