-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: mask #1900
base: develop
Are you sure you want to change the base?
feat: mask #1900
Conversation
Mask sets entries of an array to null. I like the analogy to light: the array is a sequence of lights (each value might be a different wavelength). Null is represented by the absence of light. Placing a mask (i.e. a piece of plastic with slits) over the array causes those values where the mask is present (i.e. "on", "true") to be dark. An example in pseudo-code: ```rust a = [1, 2, 3, 4, 5] a_mask = [t, f, f, t, f] mask(a, a_mask) == [null, 2, 3, null, 5] ``` Specializations --------------- I only fallback to Arrow for two of the core arrays: - Sparse. I was skeptical that I could do better than decompressing and applying it. - Constant. If the mask is sparse, SparseArray might be a good choice. I didn't investigate. For the non-core arrays, I'm missing the following. I'm not clear that I can beat decompression for run end. The others are easy enough but some amount of typing and testing. - fastlanes - fsst - roaring - runend - runend-bool - zigzag Naming ------ Pandas also calls this operation [`mask`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mask.html) but accepts an optional second argument which is an array of values to use instead of null (which makes Pandas' mask more like an `if_else`). Arrow-rs calls this [`nullif`](https://arrow.apache.org/rust/arrow/compute/fn.nullif.html). Arrow-cpp has [`if_else(condition, consequent, alternate)`](https://arrow.apache.org/docs/cpp/compute.html#cpp-compute-scalar-selections) and [`replace_with_mask(array, mask, replacements)`](https://arrow.apache.org/docs/cpp/compute.html#replace-functions) both of which can implement our `mask` by passing a `NullArray` as the third argument.
Nick, dict array no longer uses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would have been really helpful if this was done one PR per encoding.
array | ||
.patches() | ||
.map(|patches| { | ||
patches.map_values(|values| try_cast(&values, &values.dtype().as_nullable())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to apply the mask to the patches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is gone. When #1951 merges, I'll revisit ALP mask.
}; | ||
|
||
Ok(DateTimePartsArray::try_new( | ||
array.dtype().clone().as_nullable(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic seems wrong? You're check for eq_ignore_nullability, but you don't check which direction the cast is, yet you always return a nullable DType.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe pull this into a PR and add tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: #1946
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And now removed from this PR.
encodings/dict/src/compute/mask.rs
Outdated
FilterIter::Slices(slices) => { | ||
for slice in slices { | ||
for index in slice.0..slice.1 { | ||
codes[index] = T::zero(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this works?
codes[slice].fill(T::zero())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this works: codes[slice.0..slice.1].fill(T::zero())
(it needed to be a Range
). Thanks, pushed!
encodings/dict/src/compute/mask.rs
Outdated
|
||
impl MaskFn<DictArray> for DictEncoding { | ||
fn mask(&self, array: &DictArray, mask: FilterMask) -> VortexResult<ArrayData> { | ||
let new_codes = match_each_integer_ptype!(array.codes_ptype(), |$T| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you also have to check that the array was nullable before? Otherwise zero may not be the null value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. This is fixed in 373ae02 and a test was added.
impl CastFn<BoolArray> for BoolEncoding { | ||
fn cast(&self, array: &BoolArray, dtype: &DType) -> VortexResult<ArrayData> { | ||
let DType::Bool(new_nullability) = dtype else { | ||
vortex_bail!(MismatchedTypes: "bool type", dtype); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an error message saying "cannot cast from X to Y" is probably better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in e6bf8f0
impl MaskFn<ExtensionArray> for ExtensionEncoding { | ||
fn mask(&self, array: &ExtensionArray, filter_mask: FilterMask) -> VortexResult<ArrayData> { | ||
let DType::Extension(ext_dtype) = array.dtype() else { | ||
vortex_bail!("extension array must have extension dtype"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't ExtensionArray have a ext_dtype()
function? If not, it should. And it should panic, rather than return a result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
array.names().clone(), | ||
array | ||
.children() | ||
.zip_eq(sdtype.dtypes()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like wrong (or at the very least, undocumented) behavior. So we cast each field to the positionally equivalent DType in the destination type? But we preserve the names of the original array.
I would expect cast(array, dtype)?.dtype() == dtype
. In fact, can you add this assertion in the cast entry-point function.=
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now require the names to match and added a test that shuffling the name order causes the cast to fail.
VarBinArray::try_new( | ||
array.offsets(), | ||
array.bytes(), | ||
array.dtype().with_nullability(*nullability), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You match on the target DType, but then you just change the nullability of the source DType. So VarBin(Binary).cast(UTF8?) => VarBin(Binary?)
??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
VarBinViewArray::try_new( | ||
array.views(), | ||
array.buffers().collect(), | ||
array.dtype().with_nullability(*nullability), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same problem here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
vortex-array/src/compute/filter.rs
Outdated
@@ -251,7 +275,7 @@ impl FilterMask { | |||
self.boolean_buffer().cloned() | |||
} | |||
|
|||
fn boolean_buffer(&self) -> VortexResult<&BooleanBuffer> { | |||
pub fn boolean_buffer(&self) -> VortexResult<&BooleanBuffer> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this pub now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now already public in develop.
Deploying vortex-bench with
|
Latest commit: |
49d4a0d
|
Status: | ✅ Deploy successful! |
Preview URL: | https://1cd8da2e.vortex-bench.pages.dev |
Branch Preview URL: | https://dk-mask.vortex-bench.pages.dev |
Alright @gatesn , this is ready for another look. |
Mask sets entries of an array to null. I like the analogy to light: the array is a sequence of lights (each value might be a different wavelength). Null is represented by the absence oflight. Placing a mask (i.e. a piece of plastic with slits) over the array causes those values where the mask is present (i.e. "on", "true") to be dark.
An example in pseudo-code:
Specializations
I only fallback to Arrow for two of the core arrays:
For the non-core arrays, I'm missing the following. I'm not clear that I can beat decompression forrun end. The others are easy enough but some amount of typing and testing.
Naming
Pandas also calls this operation
mask
but accepts an optional second argument which is an array of values to use instead of null (which makes Pandas' mask more like anif_else
).Arrow-rs calls this
nullif
.Arrow-cpp has
if_else(condition, consequent, alternate)
andreplace_with_mask(array, mask, replacements)
both of which can implement ourmask
by passing aNullArray
as the third argument.Follow ups: