-
Notifications
You must be signed in to change notification settings - Fork 979
Tuple Sets
Conceptually, tuple sets flow from one execution graph node to another. In practice, the story is far more complex. Here we discuss the tuple set idea itself and how tuple sets are implemented in Drill. The first thing to realize is that Drill code does not use the term tuple set, instead the code uses a number of lower-level, implementation-focused terms. We use the tuple set concept as a way to conceptualize the details.
In Drill, a tuple set is implemented as a set of related components:
- Record Batch, conceptually, about the same as a tuple set. In practice, a Drill record batch is both a tuple set and and operator, as we'll see below.
- Schema describes the set of columns within each tuple.
- Record the column values for a single tuple. However, records are never realized as such in Drill.
- Value Vector the underlying columar data representation of the values for a single column.
- Vector Container is a data structure that holds the set of value vectors for a tuple set.
- Selection Vector identifies the set of tuples to include when passing a tuple set downstream. (Or, conversely, implies the set of tuples removed by a filtering or other operation.)
The record batch in Drill can be very confusing to those first learning the code. The term seems to imply, well, a batch of records, which would seem to be the same thing as a tuple set. However, for whatever reason, as Drill evolved, a record batch became both a batch of records and a node in the operator graph. For example, the filter record batch does all of the following:
- Holds a link to the upstream node,
- Calls the upstream node to produce a new tuple set,
- Applies the specified filter operation to the incoming tuple set,
- Holds the outgoing tuple set for use by the downstream node.
The details are complex, we'll work up to them slowly. As we proceed, you'll sometimes think of a Record Batch as a processing node, sometimes as the tuple set that emerges from that node.