-
Notifications
You must be signed in to change notification settings - Fork 33
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
31 changed files
with
1,230 additions
and
474 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,142 @@ | ||
# Vortex Arrays | ||
|
||
An array is the in-memory representation of data in Vortex. It has a [length](#length), a [data type](#data-type), an | ||
[encoding](#encodings), some number of [children](#children), and some number of [buffers](#buffers). | ||
All arrays in Vortex are represented by an `ArrayData`, which in psuedo-code looks something like this: | ||
|
||
```rust | ||
struct ArrayData { | ||
encoding: Encoding, | ||
dtype: DType, | ||
len: usize, | ||
metadata: ByteBuffer, | ||
children: [ArrayData], | ||
buffers: [ByteBuffer], | ||
statistics: Statistics, | ||
} | ||
``` | ||
|
||
This document goes into detail about each of these fields as well as the mechanics behind the encoding vtables. | ||
|
||
**Owned vs Viewed** | ||
|
||
As with other possibly large recursive data structures in Vortex, arrays can be either _owned_ or _viewed_. | ||
Owned arrays are heap-allocated, while viewed arrays are lazily unwrapped from an underlying FlatBuffer representation. | ||
This allows Vortex to efficiently load and work with very wide schemas without needing to deserialize the full array | ||
in memory. | ||
|
||
This abstraction is hidden from users inside an `ArrayData` object. | ||
|
||
## Encodings | ||
|
||
An encoding acts as the virtual function table (vtable) for an `ArrayData`. | ||
|
||
### VTable | ||
|
||
The full vtable definition is quite expansive, is split across many Rust traits, and has many optional functions. Here | ||
is an overview: | ||
|
||
* `id`: returns the unique identifier for the encoding. | ||
* `validate`: validates the array's buffers and children after loading from disk. | ||
* `accept`: a function for accepting an `ArrayVisitor` and walking the arrays children. | ||
* `into_canonical`: decodes the array into a canonical encoding. | ||
* `into_arrow`: decodes the array into an Arrow array. | ||
* `metadata` | ||
* `validate`: validates the array's metadata buffer. | ||
* `display`: returns a human-readable representation of the array metadata. | ||
* `validity` | ||
* `is_valid`: returns whether the element at a given row is valid. | ||
* `logical_validity`: returns the validity bit-mask for an array, indicating which values are non-null. | ||
* `compute`: a collection of compute functions vtables. | ||
* `filter`: a function for filtering the array using a given selection mask. | ||
* ... | ||
* `statistics`: a function for computing a statistic for the array data, for example `min`. | ||
* `variants`: a collection of optional DType-specific functions for operation over the array. | ||
* `struct`: functions for operating over arrays with a `StructDType`. | ||
* `get_field`: returns the array for a given field of the struct. | ||
* ... | ||
* ... | ||
|
||
Encoding vtables can even be constructed from non-static sources, such as _WebAssembly_ modules, which enables the | ||
[forward compatibility](/specs/file-format.md#forward-compatibility) feature of the Vortex File Format. | ||
|
||
See the [Writing an Encoding](/rust/writing-an-encoding) guide for more information. | ||
|
||
### Canonical Encodings | ||
|
||
Each logical data type in Vortex has an associated canonical encoding. All encodings must support decompression into | ||
their canonical form. | ||
|
||
Note that Vortex also supports decompressing into intermediate encodings, such as dictionary encoding, which may be | ||
better suited to a particular operation or compute engine. | ||
|
||
The canonical encodings are support **zero-copy** conversion to and from _Apache Arrow_ arrays. | ||
|
||
| Data Type | Canonical Encoding | | ||
|--------------------|----------------------| | ||
| `DType::Null` | `NullEncoding` | | ||
| `DType::Bool` | `BoolEncoding` | | ||
| `DType::Primitive` | `PrimitiveEncoding` | | ||
| `DType::UTF8` | `VarBinViewEncoding` | | ||
| `DType::Binary` | `VarBinViewEncoding` | | ||
| `DType::Struct` | `StructEncoding` | | ||
| `DType::List` | `ListEncoding` | | ||
| `DType::Extension` | `ExtensionEncoding` | | ||
|
||
(data-type)= | ||
|
||
## Data Type | ||
|
||
The array's [data type](/concepts/dtypes) is a logical definition of the data held within the array and does not | ||
confer any specific meaning on the array's children or buffers. | ||
|
||
Another way to think about logical data types is that they represent the type of the scalar value you might read | ||
out of the array. | ||
|
||
## Length | ||
|
||
The length of an array can almost always be inferred by encoding from its children and buffers. But given how | ||
important the length is for many operations, it is stored directly in the `ArrayData` object for faster access. | ||
|
||
## Metadata | ||
|
||
Each array can store a small amount of metadata in the form of a byte buffer. This is typically not much more than | ||
8 bytes and does not have any alignment guarantees. This is used by encodings to store any additional information they | ||
might need in order to access their children or buffers. | ||
|
||
For example, a dictionary encoding stores the length of its `values` child, and the primitive type of its `codes` child. | ||
|
||
## Children | ||
|
||
Arrays can have some number of child arrays. These differ from buffers in that they are logically typed, meaning the | ||
encoding cannot make assumptions about the layout of these children when implementing its vtable. | ||
|
||
Dictionary encoding is an example of where child arrays might be used, with one array representing the unique | ||
dictionary values and another array representing the codes indexing into those values. | ||
|
||
## Buffers | ||
|
||
Buffers store binary data with a declared alignment. They act as the terminal nodes in the recursive structure of | ||
an array. | ||
|
||
They are not considered by the recursive compressor, although general-purpose compression may still be used | ||
at write-time. | ||
|
||
For example, a bit-packed array stores packed integers in binary form. These would be stored in a buffer with an | ||
alignment sufficient for SIMD unpacking operations. | ||
|
||
## Statistics | ||
|
||
Arrays carry their own statistics with them, allowing many compute functions to short-circuit or optimise their | ||
implementations. Currently, the available statistics are: | ||
|
||
- `null_count`: The number of null values in the array. | ||
- `true_count`: The number of `true` values in a boolean array. | ||
- `run_count`: The number of consecutive runs in an array. | ||
- `is_constant`: Whether the array only holds a single unique value | ||
- `is_sorted`: Whether the array values are sorted. | ||
- `is_strict_sorted`: Whether the array values are sorted and unique. | ||
- `min`: The minimum value in the array. | ||
- `max`: The maximum value in the array. | ||
- `uncompressed_size`: The size of the array in memory before any compression. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
# Vortex Compute | ||
|
||
Encoding vtables can define optional implementations of compute functions where it's possible to optimize the | ||
implementation beyond the default behavior of canonicalizing the array and then performing the operation. | ||
|
||
For example, `DictEncoding` defines an implementation of compare where given a constant right-hand side argument, | ||
the operation is performed only over the dictionary values and the result is wrapped up with the original dictionary | ||
codes. | ||
|
||
## Compute Functions | ||
|
||
* `binary_boolean(lhs: ArrayData, rhs: ArrayData, BinaryOperator) -> ArrayData` | ||
* Compute `And`, `AndKleene`, `Or`, `OrKleene` operations over two boolean arrays. | ||
* `binary_numeric(lhs: ArrayData, rhs: ArrayData, BinaryOperator) -> ArrayData` | ||
* Compute `Add`, `Sub`, `RSub`, `Mul`, `Div`, `RDiv` operations over two numeric arrays. | ||
* `compare(lhs: ArrayData, rhs: ArrayData, CompareOperator) -> ArrayData` | ||
* Compute `Eq`, `NotEq`, `Gt`, `Gte`, `Lt`, `Lte` operations over two arrays. | ||
* `try_cast(ArrayData, DType) -> ArrayData` | ||
* Try to cast the array to the specified data type. | ||
* `fill_forward(ArrayData) -> ArrayData` | ||
* Fill forward null values with the most recent non-null value. | ||
* `fill_null(ArrayData, Scalar) -> ArrayData` | ||
* Fill null values with the specified scalar value. | ||
* `invert_fn(ArrayData) -> ArrayData` | ||
* Invert the boolean values of the array. | ||
* `like(ArrayData, pattern: ArrayData) -> ArrayData` | ||
* Perform a `LIKE` operation over two arrays. | ||
* `scalar_at(ArrayData, index) -> Scalar` | ||
* Get the scalar value at the specified index. | ||
* `search_sorted(ArrayData, Scalar) -> SearchResult` | ||
* Search for the specified scalar value in the sorted array. | ||
* `slice(ArrayData, start, end) -> ArrayData` | ||
* Slice the array from the start to the end index. | ||
* `take(ArrayData, indices: ArrayData) -> ArrayData` | ||
* Take the specified nullable indices from the array. | ||
* `filter(ArrayData, mask: Mask) -> ArrayData` | ||
* Filter the array based on the given mask. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,95 @@ | ||
# Vortex Data Types | ||
|
||
A core principle of Vortex is that its data types (or `dtypes`) are _logical_ rather than _physical_. | ||
This means that the dtype has no bearing on how the data is actually stored in memory, and is instead used to define | ||
the domain of values an array may hold. | ||
|
||
For example, a `u32` dtype represents an unsigned integer domain with values between `0` and `2^32 - 1`, even though | ||
the underlying array may store values dictionary-encoded, run-length encoded (RLE), or in any other format! | ||
|
||
This principle enables many of Vortex's advanced features. For example, performing compute directly on | ||
compressed data. | ||
|
||
:::{admonition} What is a schema?! | ||
:class: tip | ||
It is worth noting now that Vortex has no concept of a _schema_, instead preferring to use a struct dtype to represent | ||
columnar data. This means you can write a Vortex file containing a single integer array just as well as writing one | ||
with many columns. | ||
::: | ||
|
||
**Owned vs Viewed** | ||
|
||
As with other possibly large recursive data structures in Vortex, dtypes can be either _owned_ or _viewed_. | ||
Owned dtypes are heap-allocated, while viewed dtypes are lazily unwrapped from an underlying FlatBuffer representation. | ||
This allows Vortex to efficiently load and work with very wide data types without needing to deserialize the full type | ||
in memory. | ||
|
||
## Logical Types | ||
|
||
The following table lists the built-in dtypes in Vortex, each of which can be marked as either nullable or non-nullable. | ||
|
||
| Name | Domain | | ||
|-------------|---------------------------------------------| | ||
| `Null` | `null` | | ||
| `Bool` | `true`, `false` | | ||
| `Primitive` | See [Primitive](#primitive) | | ||
| `UTF8` | Variable length valid utf-8 encoded strings | | ||
| `Binary` | Arbitrary variable length bytes | | ||
| `Struct` | See [Struct](#struct) | | ||
| `List` | See [List](#list) | | ||
| `Extension` | See [Extension](#extension) | | ||
|
||
:::{note} | ||
There are additional logical types that Vortex does not yet support, for example fixed-length binary, utf-8, and list | ||
types, as well as a map type. These may be added in future versions. | ||
::: | ||
|
||
### Primitive | ||
|
||
Primitive dtypes are an enumeration of different fixed-width primitive values. | ||
|
||
| Name | Domain | | ||
|-------|-------------------------| | ||
| `I8` | 8-bit signed integer | | ||
| `I16` | 16-bit signed integer | | ||
| `I32` | 32-bit signed integer | | ||
| `I64` | 64-bit signed integer | | ||
| `U8` | 8-bit unsigned integer | | ||
| `U16` | 16-bit unsigned integer | | ||
| `U32` | 32-bit unsigned integer | | ||
| `U64` | 64-bit unsigned integer | | ||
| `F16` | IEEE 754-2008 half | | ||
| `F32` | IEEE 754-1985 single | | ||
| `F64` | IEEE 754-1985 double | | ||
|
||
### Struct | ||
|
||
A `Struct` dtype is an ordered collection of named fields, each of which has its own logical dtype. | ||
|
||
### List | ||
|
||
A `List` dtype has a single _element type_, itself a logical dtype, and represents an array of variable-length | ||
sequences of elements of that type. | ||
|
||
### Extension | ||
|
||
An `Extension` dtype is a logical dtype with an `id`, a `storage` dtype, and a `metadata` field. The `id` and `metadata` | ||
fields together may implicitly restrict the domain of values of the `storage` dtype. | ||
|
||
For example, a `vortex.date` type is logically stored as a `U32` representing the number of days since the Unix epoch. | ||
|
||
## Vs. Arrow | ||
|
||
This section helps those familiar with Apache Arrow to quickly understand the differences vs. Vortex's dtypes. | ||
|
||
* In Arrow, nullability is tied to a {obj}`pyarrow.Field` rather than the data type. | ||
Data types in Vortex instead always define explicit `nullability`. | ||
* In Arrow, there are multiple ways to describe the same logical data type, for example {func}`pyarrow.string` and | ||
{func}`pyarrow.large_string` both represent UTF-8 values. In Vortex, there is a single `UTF8` dtype. | ||
* In Arrow, encoded data is described with additional data types, for example {func}`pyarrow.dictionary`. In Vortex, | ||
encodings are a distinct concept from dtypes. | ||
* In Arrow, date and time types are defined as first-class data types. In Vortex, these are represented as `Extension` | ||
dtypes since that can be composed of other more primitive logical dtypes. | ||
* In Arrow, tables and record batches have a _schema_ that defines the types of the columns. Vortex makes no | ||
distinction between a data type and a schema. Columnar data can be stored with a struct dtype, and integer data can | ||
be stored equally well without a top-level struct. |
Oops, something went wrong.