|
5 | 5 | "id": "cell-01", |
6 | 6 | "metadata": {}, |
7 | 7 | "source": [ |
8 | | - "# Working with Containers\n", |
9 | | - "\n", |
10 | | - "This notebook is a guided tour of the main data containers in `python-blosc2`.\n", |
11 | | - "\n", |
12 | | - "The goal is to build a practical mental model first: what each container is, how the containers relate, and when each one is the right tool.\n", |
13 | | - "\n", |
14 | | - "We will cover these containers in this order:\n", |
15 | | - "\n", |
16 | | - "1. `SChunk`\n", |
17 | | - "2. `NDArray`\n", |
18 | | - "3. `ObjectArray`\n", |
19 | | - "4. `BatchArray`\n", |
20 | | - "5. `EmbedStore`\n", |
21 | | - "6. `DictStore`\n", |
22 | | - "7. `TreeStore`\n", |
23 | | - "8. `C2Array`" |
| 8 | + "# Working with Containers\n\nThis notebook is a guided tour of the main data containers in `python-blosc2`.\n\nThe goal is to build a practical mental model first: what each container is, how the containers relate, and when each one is the right tool.\n\nWe will cover these containers in this order:\n\n1. `SChunk`\n2. `NDArray`\n3. `ObjectArray`\n4. `BatchArray`\n5. `EmbedStore`\n6. `DictStore`\n7. `TreeStore` (including inline `CTable` support)\n8. `C2Array`" |
24 | 9 | ] |
25 | 10 | }, |
26 | 11 | { |
|
444 | 429 | " show(\"/exp/run2/data\", tstore[\"/exp/run2/data\"][:])" |
445 | 430 | ] |
446 | 431 | }, |
| 432 | + { |
| 433 | + "cell_type": "markdown", |
| 434 | + "id": "cell-17b", |
| 435 | + "metadata": {}, |
| 436 | + "source": [ |
| 437 | + "### Storing CTables inside a TreeStore\n", |
| 438 | + "\n", |
| 439 | + "A `TreeStore` can hold **both NDArrays and CTables** in the same bundle. A `CTable` is stored inline as a named subtree — all its columns, metadata, and index sidecars live as ordinary Blosc2 leaves inside the outer store. From the outside it appears as a single key, exactly like any other leaf:\n", |
| 440 | + "\n", |
| 441 | + "* `ts[\"/table\"] = ctable` — stores the CTable inline (same syntax as NDArray).\n", |
| 442 | + "* `ts[\"/table\"]` — returns a `CTable` object transparently.\n", |
| 443 | + "* `\"/table/_meta\" not in ts` — internal keys are hidden from normal traversal.\n", |
| 444 | + "* `del ts[\"/table\"]` — removes the whole object and all its leaves at once.\n", |
| 445 | + "\n", |
| 446 | + "The inline layout means there are **no nested ZIP files**: all leaves are flat members of the outer `.b2z` archive and can be opened by offset without extraction." |
| 447 | + ] |
| 448 | + }, |
| 449 | + { |
| 450 | + "cell_type": "code", |
| 451 | + "execution_count": null, |
| 452 | + "id": "cell-17c", |
| 453 | + "metadata": {}, |
| 454 | + "outputs": [], |
| 455 | + "source": [ |
| 456 | + "from dataclasses import dataclass\n", |
| 457 | + "\n", |
| 458 | + "\n", |
| 459 | + "@dataclass\n", |
| 460 | + "class Reading:\n", |
| 461 | + " sensor_id: int = 0\n", |
| 462 | + " value: float = 0.0\n", |
| 463 | + "\n", |
| 464 | + "\n", |
| 465 | + "bundle_path = reset(\"bundle.b2z\")\n", |
| 466 | + "\n", |
| 467 | + "# --- Write: mix NDArrays and CTables in one bundle ----------------------\n", |
| 468 | + "t = blosc2.CTable(Reading)\n", |
| 469 | + "for i in range(6):\n", |
| 470 | + " t.append(Reading(sensor_id=i, value=round(i * 1.1, 2)))\n", |
| 471 | + "\n", |
| 472 | + "with blosc2.TreeStore(bundle_path, mode=\"w\") as ts:\n", |
| 473 | + " ts[\"/raw/signal\"] = np.arange(8, dtype=np.float32)\n", |
| 474 | + " ts[\"/tables/readings\"] = t # CTable stored inline\n", |
| 475 | + " show(\"keys after write\", sorted(ts.keys()))\n", |
| 476 | + " show(\"/tables/readings/_meta in ts (hidden)\", \"/tables/readings/_meta\" in ts)\n", |
| 477 | + "\n", |
| 478 | + "# --- Read back from the .b2z archive ------------------------------------\n", |
| 479 | + "with blosc2.open(bundle_path, mode=\"r\") as ts:\n", |
| 480 | + " readings = ts[\"/tables/readings\"] # returns CTable transparently\n", |
| 481 | + " show(\"type\", type(readings).__name__)\n", |
| 482 | + " show(\"rows\", len(readings))\n", |
| 483 | + " show(\"sensor_id\", list(readings[\"sensor_id\"][:]))\n", |
| 484 | + " show(\"value\", list(readings[\"value\"][:]))\n", |
| 485 | + "\n", |
| 486 | + "# --- Append a row in-place (append mode) --------------------------------\n", |
| 487 | + "with blosc2.TreeStore(bundle_path, mode=\"a\") as ts:\n", |
| 488 | + " r = ts[\"/tables/readings\"]\n", |
| 489 | + " r.append(Reading(sensor_id=99, value=-1.0))\n", |
| 490 | + " r.close() # optional; outer store also closes it on __exit__\n", |
| 491 | + " show(\"rows after append\", len(ts[\"/tables/readings\"]))\n", |
| 492 | + "\n", |
| 493 | + "# --- Delete the CTable (all internal leaves removed) -------------------\n", |
| 494 | + "with blosc2.TreeStore(bundle_path, mode=\"a\") as ts:\n", |
| 495 | + " del ts[\"/tables/readings\"]\n", |
| 496 | + " show(\"keys after delete\", sorted(ts.keys()))" |
| 497 | + ] |
| 498 | + }, |
447 | 499 | { |
448 | 500 | "cell_type": "markdown", |
449 | 501 | "id": "cell-18", |
|
494 | 546 | "id": "cell-20", |
495 | 547 | "metadata": {}, |
496 | 548 | "source": [ |
497 | | - "## Choosing The Right Container\n", |
498 | | - "\n", |
499 | | - "| Container | Backing idea | Best for |\n", |
500 | | - "| --- | --- | --- |\n", |
501 | | - "| `SChunk` | raw compressed chunks | direct chunk-level storage control |\n", |
502 | | - "| `NDArray` | `SChunk` plus array metadata | dense numeric arrays |\n", |
503 | | - "| `ObjectArray` | one variable-length entry per chunk | ragged or heterogeneous Python values |\n", |
504 | | - "| `BatchArray` | one batch per chunk | batch-oriented ingestion and access |\n", |
505 | | - "| `EmbedStore` | one bundled object store | packaging a few Blosc2 objects together |\n", |
506 | | - "| `DictStore` | keyed collection of leaves | portable multi-object datasets |\n", |
507 | | - "| `TreeStore` | hierarchical keyed collection | tree-structured datasets |\n", |
508 | | - "| `C2Array` | remote array handle | arrays hosted by a remote Caterva2 service |\n", |
509 | | - "\n", |
510 | | - "A simple rule of thumb is:\n", |
511 | | - "\n", |
512 | | - "- start with `NDArray` for dense numeric data\n", |
513 | | - "- drop down to `SChunk` if you need chunk-level control\n", |
514 | | - "- use `ObjectArray` or `BatchArray` for variable-length Python objects\n", |
515 | | - "- use `EmbedStore`, `DictStore`, or `TreeStore` when your dataset contains multiple objects" |
| 549 | + "## Choosing The Right Container\n\n| Container | Backing idea | Best for |\n| --- | --- | --- |\n| `SChunk` | raw compressed chunks | direct chunk-level storage control |\n| `NDArray` | `SChunk` plus array metadata | dense numeric arrays |\n| `ObjectArray` | one variable-length entry per chunk | ragged or heterogeneous Python values |\n| `BatchArray` | one batch per chunk | batch-oriented ingestion and access |\n| `EmbedStore` | one bundled object store | packaging a few Blosc2 objects together |\n| `DictStore` | keyed collection of leaves | portable multi-object datasets |\n| `TreeStore` | hierarchical keyed collection | tree-structured datasets with NDArrays and/or CTables |\n| `C2Array` | remote array handle | arrays hosted by a remote Caterva2 service |\n\nA simple rule of thumb is:\n\n- start with `NDArray` for dense numeric data\n- drop down to `SChunk` if you need chunk-level control\n- use `ObjectArray` or `BatchArray` for variable-length Python objects\n- use `EmbedStore`, `DictStore`, or `TreeStore` when your dataset contains multiple objects" |
516 | 550 | ] |
517 | 551 | }, |
518 | 552 | { |
519 | 553 | "cell_type": "markdown", |
520 | 554 | "id": "cell-21", |
521 | 555 | "metadata": {}, |
522 | 556 | "source": [ |
523 | | - "## Final Notes\n", |
524 | | - "\n", |
525 | | - "This notebook is intentionally organized from low-level storage to higher-level organization:\n", |
526 | | - "\n", |
527 | | - "- understand `SChunk` first\n", |
528 | | - "- use `NDArray` for most dense numeric workloads\n", |
529 | | - "- move to `ObjectArray` or `BatchArray` when entries stop being fixed-size arrays\n", |
530 | | - "- use `EmbedStore`, `DictStore`, or `TreeStore` when you need to package multiple objects together\n", |
531 | | - "- use `C2Array` when the data lives on a remote service\n", |
532 | | - "\n", |
533 | | - "For deeper details on a specific class, continue with the reference docs and the dedicated tutorials for `ObjectArray`, `BatchArray`, and indexing." |
| 557 | + "## Final Notes\n\nThis notebook is intentionally organized from low-level storage to higher-level organization:\n\n- understand `SChunk` first\n- use `NDArray` for most dense numeric workloads\n- move to `ObjectArray` or `BatchArray` when entries stop being fixed-size arrays\n- use `EmbedStore`, `DictStore`, or `TreeStore` when you need to package multiple objects together\n- use `TreeStore` + `CTable` together when your bundle mixes dense arrays with structured tables\n- use `C2Array` when the data lives on a remote service\n\nFor deeper details on a specific class, continue with the reference docs and the dedicated tutorials for `ObjectArray`, `BatchArray`, and indexing." |
534 | 558 | ] |
535 | 559 | }, |
536 | 560 | { |
|
0 commit comments