|
| 1 | +# Vortex Arrays |
| 2 | + |
| 3 | +An array is the in-memory representation of data in Vortex. It has a [length](#length), a [data type](#data-type), an |
| 4 | +[encoding](#encodings), some number of [children](#children), and some number of [buffers](#buffers). |
| 5 | +All arrays in Vortex are represented by an `ArrayData`, which in psuedo-code looks something like this: |
| 6 | + |
| 7 | +```rust |
| 8 | +struct ArrayData { |
| 9 | + encoding: Encoding, |
| 10 | + dtype: DType, |
| 11 | + len: usize, |
| 12 | + metadata: ByteBuffer, |
| 13 | + children: [ArrayData], |
| 14 | + buffers: [ByteBuffer], |
| 15 | + statistics: Statistics, |
| 16 | +} |
| 17 | +``` |
| 18 | + |
| 19 | +This document goes into detail about each of these fields as well as the mechanics behind the encoding vtables. |
| 20 | + |
| 21 | +**Owned vs Viewed** |
| 22 | + |
| 23 | +As with other possibly large recursive data structures in Vortex, arrays can be either _owned_ or _viewed_. |
| 24 | +Owned arrays are heap-allocated, while viewed arrays are lazily unwrapped from an underlying FlatBuffer representation. |
| 25 | +This allows Vortex to efficiently load and work with very wide schemas without needing to deserialize the full array |
| 26 | +in memory. |
| 27 | + |
| 28 | +This abstraction is hidden from users inside an `ArrayData` object. |
| 29 | + |
| 30 | +## Encodings |
| 31 | + |
| 32 | +An encoding acts as the virtual function table (vtable) for an `ArrayData`. |
| 33 | + |
| 34 | +### VTable |
| 35 | + |
| 36 | +The full vtable definition is quite expansive, is split across many Rust traits, and has many optional functions. Here |
| 37 | +is an overview: |
| 38 | + |
| 39 | +* `id`: returns the unique identifier for the encoding. |
| 40 | +* `validate`: validates the array's buffers and children after loading from disk. |
| 41 | +* `accept`: a function for accepting an `ArrayVisitor` and walking the arrays children. |
| 42 | +* `into_canonical`: decodes the array into a canonical encoding. |
| 43 | +* `into_arrow`: decodes the array into an Arrow array. |
| 44 | +* `metadata` |
| 45 | + * `validate`: validates the array's metadata buffer. |
| 46 | + * `display`: returns a human-readable representation of the array metadata. |
| 47 | +* `validity` |
| 48 | + * `is_valid`: returns whether the element at a given row is valid. |
| 49 | + * `logical_validity`: returns the validity bit-mask for an array, indicating which values are non-null. |
| 50 | +* `compute`: a collection of compute functions vtables. |
| 51 | + * `filter`: a function for filtering the array using a given selection mask. |
| 52 | + * ... |
| 53 | +* `statistics`: a function for computing a statistic for the array data, for example `min`. |
| 54 | +* `variants`: a collection of optional DType-specific functions for operation over the array. |
| 55 | + * `struct`: functions for operating over arrays with a `StructDType`. |
| 56 | + * `get_field`: returns the array for a given field of the struct. |
| 57 | + * ... |
| 58 | + * ... |
| 59 | + |
| 60 | +Encoding vtables can even be constructed from non-static sources, such as _WebAssembly_ modules, which enables the |
| 61 | +[forward compatibility](/specs/file-format.md#forward-compatibility) feature of the Vortex File Format. |
| 62 | + |
| 63 | +See the [Writing an Encoding](/rust/writing-an-encoding) guide for more information. |
| 64 | + |
| 65 | +### Canonical Encodings |
| 66 | + |
| 67 | +Each logical data type in Vortex has an associated canonical encoding. All encodings must support decompression into |
| 68 | +their canonical form. |
| 69 | + |
| 70 | +Note that Vortex also supports decompressing into intermediate encodings, such as dictionary encoding, which may be |
| 71 | +better suited to a particular operation or compute engine. |
| 72 | + |
| 73 | +The canonical encodings are support **zero-copy** conversion to and from _Apache Arrow_ arrays. |
| 74 | + |
| 75 | +| Data Type | Canonical Encoding | |
| 76 | +|--------------------|----------------------| |
| 77 | +| `DType::Null` | `NullEncoding` | |
| 78 | +| `DType::Bool` | `BoolEncoding` | |
| 79 | +| `DType::Primitive` | `PrimitiveEncoding` | |
| 80 | +| `DType::UTF8` | `VarBinViewEncoding` | |
| 81 | +| `DType::Binary` | `VarBinViewEncoding` | |
| 82 | +| `DType::Struct` | `StructEncoding` | |
| 83 | +| `DType::List` | `ListEncoding` | |
| 84 | +| `DType::Extension` | `ExtensionEncoding` | |
| 85 | + |
| 86 | +(data-type)= |
| 87 | + |
| 88 | +## Data Type |
| 89 | + |
| 90 | +The array's [data type](/concepts/dtypes) is a logical definition of the data held within the array and does not |
| 91 | +confer any specific meaning on the array's children or buffers. |
| 92 | + |
| 93 | +Another way to think about logical data types is that they represent the type of the scalar value you might read |
| 94 | +out of the array. |
| 95 | + |
| 96 | +## Length |
| 97 | + |
| 98 | +The length of an array can almost always be inferred by encoding from its children and buffers. But given how |
| 99 | +important the length is for many operations, it is stored directly in the `ArrayData` object for faster access. |
| 100 | + |
| 101 | +## Metadata |
| 102 | + |
| 103 | +Each array can store a small amount of metadata in the form of a byte buffer. This is typically not much more than |
| 104 | +8 bytes and does not have any alignment guarantees. This is used by encodings to store any additional information they |
| 105 | +might need in order to access their children or buffers. |
| 106 | + |
| 107 | +For example, a dictionary encoding stores the length of its `values` child, and the primitive type of its `codes` child. |
| 108 | + |
| 109 | +## Children |
| 110 | + |
| 111 | +Arrays can have some number of child arrays. These differ from buffers in that they are logically typed, meaning the |
| 112 | +encoding cannot make assumptions about the layout of these children when implementing its vtable. |
| 113 | + |
| 114 | +Dictionary encoding is an example of where child arrays might be used, with one array representing the unique |
| 115 | +dictionary values and another array representing the codes indexing into those values. |
| 116 | + |
| 117 | +## Buffers |
| 118 | + |
| 119 | +Buffers store binary data with a declared alignment. They act as the terminal nodes in the recursive structure of |
| 120 | +an array. |
| 121 | + |
| 122 | +They are not considered by the recursive compressor, although general-purpose compression may still be used |
| 123 | +at write-time. |
| 124 | + |
| 125 | +For example, a bit-packed array stores packed integers in binary form. These would be stored in a buffer with an |
| 126 | +alignment sufficient for SIMD unpacking operations. |
| 127 | + |
| 128 | +## Statistics |
| 129 | + |
| 130 | +Arrays carry their own statistics with them, allowing many compute functions to short-circuit or optimise their |
| 131 | +implementations. Currently, the available statistics are: |
| 132 | + |
| 133 | +- `null_count`: The number of null values in the array. |
| 134 | +- `true_count`: The number of `true` values in a boolean array. |
| 135 | +- `run_count`: The number of consecutive runs in an array. |
| 136 | +- `is_constant`: Whether the array only holds a single unique value |
| 137 | +- `is_sorted`: Whether the array values are sorted. |
| 138 | +- `is_strict_sorted`: Whether the array values are sorted and unique. |
| 139 | +- `min`: The minimum value in the array. |
| 140 | +- `max`: The maximum value in the array. |
| 141 | +- `uncompressed_size`: The size of the array in memory before any compression. |
| 142 | + |
0 commit comments