Skip to content

Commit cc55754

Browse files
authored
Write some docs (#2080)
1 parent 81d3bf4 commit cc55754

File tree

31 files changed

+1230
-474
lines changed

31 files changed

+1230
-474
lines changed

.github/workflows/ci.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -240,6 +240,8 @@ jobs:
240240
- uses: ./.github/actions/setup-flatc
241241
- name: Install Protoc
242242
uses: arduino/setup-protoc@v3
243+
with:
244+
repo-token: ${{ secrets.GITHUB_TOKEN }}
243245
- name: "regenerate all .fbs/.proto Rust code"
244246
run: |
245247
cargo xtask generate-fbs

.github/workflows/docs.yml

Lines changed: 2 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -22,25 +22,12 @@ jobs:
2222

2323
- name: build Python and Rust docs
2424
run: |
25-
uv run make -C docs python-and-rust-html
26-
- name: commit python docs to gh-pages-bench
27-
run: |
28-
set -ex
29-
30-
built_sha=$(git rev-parse HEAD)
31-
32-
rm -rf docs/_build/html/rust/CACHETAG.DIR docs/_build/html/rust/debug
33-
34-
mkdir /tmp/html
35-
mv docs/_build/html /tmp/html/docs
36-
37-
mkdir -p /tmp/html/dev
38-
mv benchmarks-website /tmp/html/dev/bench
25+
uv run make -C docs html
3926
- name: Upload static files as artifact
4027
id: deployment
4128
uses: actions/upload-pages-artifact@v3
4229
with:
43-
path: /tmp/html/
30+
path: docs/_build/html
4431
deploy:
4532
environment:
4633
name: github-pages

docs/Makefile

Lines changed: 2 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ SPHINXBUILD ?= sphinx-build
88
SOURCEDIR = .
99
BUILDDIR = _build
1010

11-
.PHONY: Makefile help rust-html
11+
.PHONY: Makefile help
1212

1313
# Put it first so that "make" without argument is like "make help".
1414
help:
@@ -19,24 +19,5 @@ help:
1919
%: Makefile
2020
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
2121

22-
python-and-rust-html: html rust-html
23-
true # need a non-empty rule to prevent matching the %: rule.
24-
2522
serve:
26-
mkdir -p _build/vortex
27-
-ln -s ../html _build/vortex/docs # makes absolute links like /vortex/docs/rust/html work correctly
28-
echo The docs are served at http://localhost:8000/vortex/docs/
29-
(cd _build/ && python3 -m http.server)
30-
31-
watch:
32-
fswatch -o -e '#[^#]*#' -e '__pycache__' -e '\.#.*' -e target -e .git -e .venv -e pyvortex/python/vortex/_lib.abi3.so -e docs/_build -e docs/a.vortex ../ | xargs -L1 /bin/bash -c 'make python-and-rust-html'
33-
34-
rust-html:
35-
RUSTDOCFLAGS="--enable-index-page -Z unstable-options" cargo doc \
36-
--no-deps \
37-
--workspace \
38-
--exclude bench-vortex \
39-
--exclude xtask \
40-
--all-features \
41-
--target-dir \
42-
$(BUILDDIR)/html/rust
23+
sphinx-autobuild "$(SOURCEDIR)" "$(BUILDDIR)/html"

docs/README.md

Lines changed: 4 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -9,24 +9,18 @@ inherits some of its doc strings from Rust docstrings:
99
cd ../pyvortex && uv run maturin develop
1010
```
1111

12-
Build just the Python docs:
12+
Build the Vortex docs:
1313

1414
```
1515
uv run make html
1616
```
1717

18-
Build the Python and Rust docs and place the rust docs at `_build/rust/html`:
18+
## Development
1919

20-
```
21-
uv run make python-and-rust-html
22-
```
23-
24-
## Viewing
25-
26-
After building:
20+
Live-reloading (ish) build of the docs:
2721

2822
```
29-
open pyvortex/_build/html/index.html
23+
uv run make serve
3024
```
3125

3226
## Python Doctests

docs/concepts/arrays.md

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
# Vortex Arrays
2+
3+
An array is the in-memory representation of data in Vortex. It has a [length](#length), a [data type](#data-type), an
4+
[encoding](#encodings), some number of [children](#children), and some number of [buffers](#buffers).
5+
All arrays in Vortex are represented by an `ArrayData`, which in psuedo-code looks something like this:
6+
7+
```rust
8+
struct ArrayData {
9+
encoding: Encoding,
10+
dtype: DType,
11+
len: usize,
12+
metadata: ByteBuffer,
13+
children: [ArrayData],
14+
buffers: [ByteBuffer],
15+
statistics: Statistics,
16+
}
17+
```
18+
19+
This document goes into detail about each of these fields as well as the mechanics behind the encoding vtables.
20+
21+
**Owned vs Viewed**
22+
23+
As with other possibly large recursive data structures in Vortex, arrays can be either _owned_ or _viewed_.
24+
Owned arrays are heap-allocated, while viewed arrays are lazily unwrapped from an underlying FlatBuffer representation.
25+
This allows Vortex to efficiently load and work with very wide schemas without needing to deserialize the full array
26+
in memory.
27+
28+
This abstraction is hidden from users inside an `ArrayData` object.
29+
30+
## Encodings
31+
32+
An encoding acts as the virtual function table (vtable) for an `ArrayData`.
33+
34+
### VTable
35+
36+
The full vtable definition is quite expansive, is split across many Rust traits, and has many optional functions. Here
37+
is an overview:
38+
39+
* `id`: returns the unique identifier for the encoding.
40+
* `validate`: validates the array's buffers and children after loading from disk.
41+
* `accept`: a function for accepting an `ArrayVisitor` and walking the arrays children.
42+
* `into_canonical`: decodes the array into a canonical encoding.
43+
* `into_arrow`: decodes the array into an Arrow array.
44+
* `metadata`
45+
* `validate`: validates the array's metadata buffer.
46+
* `display`: returns a human-readable representation of the array metadata.
47+
* `validity`
48+
* `is_valid`: returns whether the element at a given row is valid.
49+
* `logical_validity`: returns the validity bit-mask for an array, indicating which values are non-null.
50+
* `compute`: a collection of compute functions vtables.
51+
* `filter`: a function for filtering the array using a given selection mask.
52+
* ...
53+
* `statistics`: a function for computing a statistic for the array data, for example `min`.
54+
* `variants`: a collection of optional DType-specific functions for operation over the array.
55+
* `struct`: functions for operating over arrays with a `StructDType`.
56+
* `get_field`: returns the array for a given field of the struct.
57+
* ...
58+
* ...
59+
60+
Encoding vtables can even be constructed from non-static sources, such as _WebAssembly_ modules, which enables the
61+
[forward compatibility](/specs/file-format.md#forward-compatibility) feature of the Vortex File Format.
62+
63+
See the [Writing an Encoding](/rust/writing-an-encoding) guide for more information.
64+
65+
### Canonical Encodings
66+
67+
Each logical data type in Vortex has an associated canonical encoding. All encodings must support decompression into
68+
their canonical form.
69+
70+
Note that Vortex also supports decompressing into intermediate encodings, such as dictionary encoding, which may be
71+
better suited to a particular operation or compute engine.
72+
73+
The canonical encodings are support **zero-copy** conversion to and from _Apache Arrow_ arrays.
74+
75+
| Data Type | Canonical Encoding |
76+
|--------------------|----------------------|
77+
| `DType::Null` | `NullEncoding` |
78+
| `DType::Bool` | `BoolEncoding` |
79+
| `DType::Primitive` | `PrimitiveEncoding` |
80+
| `DType::UTF8` | `VarBinViewEncoding` |
81+
| `DType::Binary` | `VarBinViewEncoding` |
82+
| `DType::Struct` | `StructEncoding` |
83+
| `DType::List` | `ListEncoding` |
84+
| `DType::Extension` | `ExtensionEncoding` |
85+
86+
(data-type)=
87+
88+
## Data Type
89+
90+
The array's [data type](/concepts/dtypes) is a logical definition of the data held within the array and does not
91+
confer any specific meaning on the array's children or buffers.
92+
93+
Another way to think about logical data types is that they represent the type of the scalar value you might read
94+
out of the array.
95+
96+
## Length
97+
98+
The length of an array can almost always be inferred by encoding from its children and buffers. But given how
99+
important the length is for many operations, it is stored directly in the `ArrayData` object for faster access.
100+
101+
## Metadata
102+
103+
Each array can store a small amount of metadata in the form of a byte buffer. This is typically not much more than
104+
8 bytes and does not have any alignment guarantees. This is used by encodings to store any additional information they
105+
might need in order to access their children or buffers.
106+
107+
For example, a dictionary encoding stores the length of its `values` child, and the primitive type of its `codes` child.
108+
109+
## Children
110+
111+
Arrays can have some number of child arrays. These differ from buffers in that they are logically typed, meaning the
112+
encoding cannot make assumptions about the layout of these children when implementing its vtable.
113+
114+
Dictionary encoding is an example of where child arrays might be used, with one array representing the unique
115+
dictionary values and another array representing the codes indexing into those values.
116+
117+
## Buffers
118+
119+
Buffers store binary data with a declared alignment. They act as the terminal nodes in the recursive structure of
120+
an array.
121+
122+
They are not considered by the recursive compressor, although general-purpose compression may still be used
123+
at write-time.
124+
125+
For example, a bit-packed array stores packed integers in binary form. These would be stored in a buffer with an
126+
alignment sufficient for SIMD unpacking operations.
127+
128+
## Statistics
129+
130+
Arrays carry their own statistics with them, allowing many compute functions to short-circuit or optimise their
131+
implementations. Currently, the available statistics are:
132+
133+
- `null_count`: The number of null values in the array.
134+
- `true_count`: The number of `true` values in a boolean array.
135+
- `run_count`: The number of consecutive runs in an array.
136+
- `is_constant`: Whether the array only holds a single unique value
137+
- `is_sorted`: Whether the array values are sorted.
138+
- `is_strict_sorted`: Whether the array values are sorted and unique.
139+
- `min`: The minimum value in the array.
140+
- `max`: The maximum value in the array.
141+
- `uncompressed_size`: The size of the array in memory before any compression.
142+

docs/concepts/compute.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# Vortex Compute
2+
3+
Encoding vtables can define optional implementations of compute functions where it's possible to optimize the
4+
implementation beyond the default behavior of canonicalizing the array and then performing the operation.
5+
6+
For example, `DictEncoding` defines an implementation of compare where given a constant right-hand side argument,
7+
the operation is performed only over the dictionary values and the result is wrapped up with the original dictionary
8+
codes.
9+
10+
## Compute Functions
11+
12+
* `binary_boolean(lhs: ArrayData, rhs: ArrayData, BinaryOperator) -> ArrayData`
13+
* Compute `And`, `AndKleene`, `Or`, `OrKleene` operations over two boolean arrays.
14+
* `binary_numeric(lhs: ArrayData, rhs: ArrayData, BinaryOperator) -> ArrayData`
15+
* Compute `Add`, `Sub`, `RSub`, `Mul`, `Div`, `RDiv` operations over two numeric arrays.
16+
* `compare(lhs: ArrayData, rhs: ArrayData, CompareOperator) -> ArrayData`
17+
* Compute `Eq`, `NotEq`, `Gt`, `Gte`, `Lt`, `Lte` operations over two arrays.
18+
* `try_cast(ArrayData, DType) -> ArrayData`
19+
* Try to cast the array to the specified data type.
20+
* `fill_forward(ArrayData) -> ArrayData`
21+
* Fill forward null values with the most recent non-null value.
22+
* `fill_null(ArrayData, Scalar) -> ArrayData`
23+
* Fill null values with the specified scalar value.
24+
* `invert_fn(ArrayData) -> ArrayData`
25+
* Invert the boolean values of the array.
26+
* `like(ArrayData, pattern: ArrayData) -> ArrayData`
27+
* Perform a `LIKE` operation over two arrays.
28+
* `scalar_at(ArrayData, index) -> Scalar`
29+
* Get the scalar value at the specified index.
30+
* `search_sorted(ArrayData, Scalar) -> SearchResult`
31+
* Search for the specified scalar value in the sorted array.
32+
* `slice(ArrayData, start, end) -> ArrayData`
33+
* Slice the array from the start to the end index.
34+
* `take(ArrayData, indices: ArrayData) -> ArrayData`
35+
* Take the specified nullable indices from the array.
36+
* `filter(ArrayData, mask: Mask) -> ArrayData`
37+
* Filter the array based on the given mask.

docs/concepts/dtypes.md

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
# Vortex Data Types
2+
3+
A core principle of Vortex is that its data types (or `dtypes`) are _logical_ rather than _physical_.
4+
This means that the dtype has no bearing on how the data is actually stored in memory, and is instead used to define
5+
the domain of values an array may hold.
6+
7+
For example, a `u32` dtype represents an unsigned integer domain with values between `0` and `2^32 - 1`, even though
8+
the underlying array may store values dictionary-encoded, run-length encoded (RLE), or in any other format!
9+
10+
This principle enables many of Vortex's advanced features. For example, performing compute directly on
11+
compressed data.
12+
13+
:::{admonition} What is a schema?!
14+
:class: tip
15+
It is worth noting now that Vortex has no concept of a _schema_, instead preferring to use a struct dtype to represent
16+
columnar data. This means you can write a Vortex file containing a single integer array just as well as writing one
17+
with many columns.
18+
:::
19+
20+
**Owned vs Viewed**
21+
22+
As with other possibly large recursive data structures in Vortex, dtypes can be either _owned_ or _viewed_.
23+
Owned dtypes are heap-allocated, while viewed dtypes are lazily unwrapped from an underlying FlatBuffer representation.
24+
This allows Vortex to efficiently load and work with very wide data types without needing to deserialize the full type
25+
in memory.
26+
27+
## Logical Types
28+
29+
The following table lists the built-in dtypes in Vortex, each of which can be marked as either nullable or non-nullable.
30+
31+
| Name | Domain |
32+
|-------------|---------------------------------------------|
33+
| `Null` | `null` |
34+
| `Bool` | `true`, `false` |
35+
| `Primitive` | See [Primitive](#primitive) |
36+
| `UTF8` | Variable length valid utf-8 encoded strings |
37+
| `Binary` | Arbitrary variable length bytes |
38+
| `Struct` | See [Struct](#struct) |
39+
| `List` | See [List](#list) |
40+
| `Extension` | See [Extension](#extension) |
41+
42+
:::{note}
43+
There are additional logical types that Vortex does not yet support, for example fixed-length binary, utf-8, and list
44+
types, as well as a map type. These may be added in future versions.
45+
:::
46+
47+
### Primitive
48+
49+
Primitive dtypes are an enumeration of different fixed-width primitive values.
50+
51+
| Name | Domain |
52+
|-------|-------------------------|
53+
| `I8` | 8-bit signed integer |
54+
| `I16` | 16-bit signed integer |
55+
| `I32` | 32-bit signed integer |
56+
| `I64` | 64-bit signed integer |
57+
| `U8` | 8-bit unsigned integer |
58+
| `U16` | 16-bit unsigned integer |
59+
| `U32` | 32-bit unsigned integer |
60+
| `U64` | 64-bit unsigned integer |
61+
| `F16` | IEEE 754-2008 half |
62+
| `F32` | IEEE 754-1985 single |
63+
| `F64` | IEEE 754-1985 double |
64+
65+
### Struct
66+
67+
A `Struct` dtype is an ordered collection of named fields, each of which has its own logical dtype.
68+
69+
### List
70+
71+
A `List` dtype has a single _element type_, itself a logical dtype, and represents an array of variable-length
72+
sequences of elements of that type.
73+
74+
### Extension
75+
76+
An `Extension` dtype is a logical dtype with an `id`, a `storage` dtype, and a `metadata` field. The `id` and `metadata`
77+
fields together may implicitly restrict the domain of values of the `storage` dtype.
78+
79+
For example, a `vortex.date` type is logically stored as a `U32` representing the number of days since the Unix epoch.
80+
81+
## Vs. Arrow
82+
83+
This section helps those familiar with Apache Arrow to quickly understand the differences vs. Vortex's dtypes.
84+
85+
* In Arrow, nullability is tied to a {obj}`pyarrow.Field` rather than the data type.
86+
Data types in Vortex instead always define explicit `nullability`.
87+
* In Arrow, there are multiple ways to describe the same logical data type, for example {func}`pyarrow.string` and
88+
{func}`pyarrow.large_string` both represent UTF-8 values. In Vortex, there is a single `UTF8` dtype.
89+
* In Arrow, encoded data is described with additional data types, for example {func}`pyarrow.dictionary`. In Vortex,
90+
encodings are a distinct concept from dtypes.
91+
* In Arrow, date and time types are defined as first-class data types. In Vortex, these are represented as `Extension`
92+
dtypes since that can be composed of other more primitive logical dtypes.
93+
* In Arrow, tables and record batches have a _schema_ that defines the types of the columns. Vortex makes no
94+
distinction between a data type and a schema. Columnar data can be stored with a struct dtype, and integer data can
95+
be stored equally well without a top-level struct.

0 commit comments

Comments
 (0)