Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: mask #1900

Open
wants to merge 27 commits into
base: develop
Choose a base branch
from
Open

feat: mask #1900

wants to merge 27 commits into from

Conversation

danking
Copy link
Member

@danking danking commented Jan 10, 2025

Mask sets entries of an array to null. I like the analogy to light: the array is a sequence of lights (each value might be a different wavelength). Null is represented by the absence oflight. Placing a mask (i.e. a piece of plastic with slits) over the array causes those values where the mask is present (i.e. "on", "true") to be dark.

An example in pseudo-code:

a = [1, 2, 3, 4, 5]
a_mask = [t, f, f, t, f]
mask(a, a_mask) == [null, 2, 3, null, 5]

Specializations

I only fallback to Arrow for two of the core arrays:

  • Sparse. I was skeptical that I could do better than decompressing and applying it.
  • Constant. If the mask is sparse, SparseArray might be a good choice. I didn't investigate.
  • Extension. (Now a follow up due to comments)

For the non-core arrays, I'm missing the following. I'm not clear that I can beat decompression forrun end. The others are easy enough but some amount of typing and testing.

  • fastlanes
  • fsst
  • roaring
  • runend
  • runend-bool
  • zigzag
  • ALP (Now a follow up due to comments)

Naming

Pandas also calls this operation mask but accepts an optional second argument which is an array of values to use instead of null (which makes Pandas' mask more like an if_else).

Arrow-rs calls this nullif.

Arrow-cpp has if_else(condition, consequent, alternate) and replace_with_mask(array, mask, replacements) both of which can implement our mask by passing a NullArray as the third argument.

Follow ups:

  1. Extension array masking.
  2. ALP array masking.
  3. Add an expression for masking.
  4. Assess the situation with nested structs in files and make sure we're preserving validity correctly.

Mask sets entries of an array to null. I like the analogy to light: the array is a sequence of
lights (each value might be a different wavelength). Null is represented by the absence of
light. Placing a mask (i.e. a piece of plastic with slits) over the array causes those values where
the mask is present (i.e. "on", "true") to be dark.

An example in pseudo-code:

```rust
a = [1, 2, 3, 4, 5]
a_mask = [t, f, f, t, f]
mask(a, a_mask) == [null, 2, 3, null, 5]
```

Specializations
---------------

I only fallback to Arrow for two of the core arrays:

- Sparse. I was skeptical that I could do better than decompressing and applying it.
- Constant. If the mask is sparse, SparseArray might be a good choice. I didn't investigate.

For the non-core arrays, I'm missing the following. I'm not clear that I can beat decompression for
run end. The others are easy enough but some amount of typing and testing.

- fastlanes
- fsst
- roaring
- runend
- runend-bool
- zigzag

Naming
------

Pandas also calls this operation
[`mask`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mask.html) but accepts an
optional second argument which is an array of values to use instead of null (which makes Pandas'
mask more like an `if_else`).

Arrow-rs calls this [`nullif`](https://arrow.apache.org/rust/arrow/compute/fn.nullif.html).

Arrow-cpp has [`if_else(condition, consequent,
alternate)`](https://arrow.apache.org/docs/cpp/compute.html#cpp-compute-scalar-selections) and
[`replace_with_mask(array, mask,
replacements)`](https://arrow.apache.org/docs/cpp/compute.html#replace-functions) both of which can
implement our `mask` by passing a `NullArray` as the third argument.
@danking danking changed the title Dk/mask feat: mask Jan 13, 2025
@danking
Copy link
Member Author

danking commented Jan 13, 2025

Nick, dict array no longer uses mul. It just iterates the indices of the mask.

@danking danking marked this pull request as ready for review January 13, 2025 17:43
@danking danking enabled auto-merge (squash) January 13, 2025 20:18
Copy link
Contributor

@gatesn gatesn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would have been really helpful if this was done one PR per encoding.

array
.patches()
.map(|patches| {
patches.map_values(|values| try_cast(&values, &values.dtype().as_nullable()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to apply the mask to the patches?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is gone. When #1951 merges, I'll revisit ALP mask.

};

Ok(DateTimePartsArray::try_new(
array.dtype().clone().as_nullable(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic seems wrong? You're check for eq_ignore_nullability, but you don't check which direction the cast is, yet you always return a nullable DType.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe pull this into a PR and add tests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: #1946

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And now removed from this PR.

FilterIter::Slices(slices) => {
for slice in slices {
for index in slice.0..slice.1 {
codes[index] = T::zero();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this works?
codes[slice].fill(T::zero())

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this works: codes[slice.0..slice.1].fill(T::zero()) (it needed to be a Range). Thanks, pushed!


impl MaskFn<DictArray> for DictEncoding {
fn mask(&self, array: &DictArray, mask: FilterMask) -> VortexResult<ArrayData> {
let new_codes = match_each_integer_ptype!(array.codes_ptype(), |$T| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you also have to check that the array was nullable before? Otherwise zero may not be the null value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This is fixed in 373ae02 and a test was added.

impl CastFn<BoolArray> for BoolEncoding {
fn cast(&self, array: &BoolArray, dtype: &DType) -> VortexResult<ArrayData> {
let DType::Bool(new_nullability) = dtype else {
vortex_bail!(MismatchedTypes: "bool type", dtype);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an error message saying "cannot cast from X to Y" is probably better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in e6bf8f0

impl MaskFn<ExtensionArray> for ExtensionEncoding {
fn mask(&self, array: &ExtensionArray, filter_mask: FilterMask) -> VortexResult<ArrayData> {
let DType::Extension(ext_dtype) = array.dtype() else {
vortex_bail!("extension array must have extension dtype");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't ExtensionArray have a ext_dtype() function? If not, it should. And it should panic, rather than return a result.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted the extension changes in 757d1b8. I will revisit when #1965 (casting) merges.

array.names().clone(),
array
.children()
.zip_eq(sdtype.dtypes())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like wrong (or at the very least, undocumented) behavior. So we cast each field to the positionally equivalent DType in the destination type? But we preserve the names of the original array.

I would expect cast(array, dtype)?.dtype() == dtype. In fact, can you add this assertion in the cast entry-point function.=

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now require the names to match and added a test that shuffling the name order causes the cast to fail.

VarBinArray::try_new(
array.offsets(),
array.bytes(),
array.dtype().with_nullability(*nullability),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You match on the target DType, but then you just change the nullability of the source DType. So VarBin(Binary).cast(UTF8?) => VarBin(Binary?) ??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

VarBinViewArray::try_new(
array.views(),
array.buffers().collect(),
array.dtype().with_nullability(*nullability),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same problem here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -251,7 +275,7 @@ impl FilterMask {
self.boolean_buffer().cloned()
}

fn boolean_buffer(&self) -> VortexResult<&BooleanBuffer> {
pub fn boolean_buffer(&self) -> VortexResult<&BooleanBuffer> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this pub now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now already public in develop.

@danking danking requested a review from gatesn January 21, 2025 12:15
Copy link

cloudflare-workers-and-pages bot commented Feb 3, 2025

Deploying vortex-bench with  Cloudflare Pages  Cloudflare Pages

Latest commit: 49d4a0d
Status: ✅  Deploy successful!
Preview URL: https://1cd8da2e.vortex-bench.pages.dev
Branch Preview URL: https://dk-mask.vortex-bench.pages.dev

View logs

@danking
Copy link
Member Author

danking commented Feb 3, 2025

Alright @gatesn , this is ready for another look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants