Skip to content

Implement scalar_min and scalar_max for A: Ord #512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sebasv opened this issue Oct 24, 2018 · 11 comments
Closed

Implement scalar_min and scalar_max for A: Ord #512

sebasv opened this issue Oct 24, 2018 · 11 comments

Comments

@sebasv
Copy link
Contributor

sebasv commented Oct 24, 2018

Suggested by @jturner314:
I'd just do it in terms of fold with something like this (taking advantage of the first method from PR #507):

impl<A, S, D> ArrayBase<S, D>
where
    S: Data<Elem = A>,
    D: Dimension,
{
    /// Returns the minimum element, or `None` if the array is empty.
    fn scalar_min(&self) -> Option<&A>
    where
        A: Ord,
    {
        let first = self.first()?;
        Some(self.fold(first, |acc, x| acc.min(x)))
    }
}

We don't need to manually unroll this because the compiler does a good job automatically (checked with Compiler Explorer using the -O compiler option).

The desired behavior for floating-point types depends on the use-case because of NaN. One option is

arr.fold(::std::f64::NAN, |acc, &x| acc.min(x))

which ignores NaN values. (It returns NaN only if there are no non-NaN values.) The compiler does a decent job automatically unrolling this, so we don't need to manually unroll in this case either.

Originally posted by @jturner314 in #505 (comment)

@sebasv
Copy link
Contributor Author

sebasv commented Oct 24, 2018

I haven't found a neat way yet to implement this both for A: Ord and f64, ideas are welcome!

@sebasv
Copy link
Contributor Author

sebasv commented Oct 24, 2018

Suggested api:

/// return a minimum element from `self`, ignoring unorderable elements 
/// (such as `f64::NAN`)
/// **Panics** if `self` is empty
pub fn scalar_min(&self) -> &A where A: PartialOrd;

/// return a minimum element from `self`, or `None` if `self` is empty or 
/// contains unorderable elements.
pub fn scalar_partial_min(&self) -> Option<&A> where A: PartialOrd;

This does not leverage the additional strictness of A: Ord but offers more flexibility and targets more practical use cases in my opinion.

@LukeMathWalker
Copy link
Member

It would be nice to make the API naming convention consistent with what we are working on here! - rust-ndarray/ndarray-stats#1

@sebasv
Copy link
Contributor Author

sebasv commented Oct 25, 2018

I like the convention (nan)min but these names overlook the fact that ndarray can be used with self-defined types (or anything non-float for that matter). Although I don't see myself ever use ndarray for anything else than floats, I can't speak for others. What fits best with ndarrays philosophy?

@jturner314
Copy link
Member

I've been thinking about the min/max problem for a while. Originally, I planned on providing three methods for each of min and max. After some more thought, though, I think just two would be better:

/// Finds the elementwise minimum of the array.
///
/// Returns `None` if any of the pairwise orderings tested by the function
/// are undefined. (For example, this occurs if there are any
/// floating-point NaN values in the array.)
///
/// Additionally, returns `None` if the array is empty.
fn min(&self) -> Option<&A>
where
    A: PartialOrd;

/// Finds the elementwise minimum of the array, skipping NaN values.
///
/// **Warning** This method will return a NaN value if none of the values
/// in the array are non-NaN values. Note that the NaN value might not be
/// in the array.
fn min_skipnan(&self) -> &A
where
    A: MaybeNan,
    A::NotNan: Ord;

It is tempting to keep that third method

/// Finds the elementwise minimum of the array.
///
/// **Panics** if the array is empty.
fn min(&self) -> &A
where
    A: Ord;

because it returns &A which is more convenient than Option<&A>. However, the panic on empty arrays is a pretty big flaw.

One thing that's always bothered me is that a lot of methods in ndarray panic instead of returning a Result/Option. I didn't have a good way to decide between panicking and returning a Result/Option. After thinking about it some more, I like this rule as a starting point for making that decision:

A method must not panic based solely on the current state of the array (e.g. element values or shape). In other words, the condition for a panic must be based on one or more argument values passed to the method.

Note: Panicking in a method when self.len_of(axis) == 0 is allowed by this rule if axis is a parameter of the method, but should be avoided.

The reasoning behind this rule is that it tries to limit the distance in the code between the root cause of a panic and the place the panic actually occurs. In other words, when calling a method, the programmer generally has a good idea of the possible argument values because they are usually created near the call. In contrast, the array may have been created far away and undergone a complex series of operations before reaching this call. The mental overhead required to meet a constraint like "the length of this axis must be nonzero" or "the array must not be empty" can be fairly large. Of course, the programmer could manually add a check in their own code to avoid the panic, but in that case it would be more convenient for the array method to return a Result/Option so that they wouldn't have to perform the check themselves.

Additionally, I have more trouble remembering constraints like "the array must not be empty" unless I've recently read the docs, because they aren't directly related to the arguments passed into the method. By returning an Option or Result when a constraint like this is violated, we're reminding the user that they have to consider the constraint.

AFAICT, all of the currently existing methods on ArrayBase meet this rule. The same is true for Vec, with the exception of the length overflowing usize on some operations (which should basically never happen in practice).

Anyway, back on topic, that third method shown above violates the rule and IMO should be avoided.

One note of interest:

/// return a minimum element from `self`, ignoring unorderable elements 
/// (such as `f64::NAN`)
/// **Panics** if `self` is empty
pub fn scalar_min(&self) -> &A where A: PartialOrd;

It turns out that it's impossible to implement this with only the A: PartialOrd constraint. Consider the case [NAN, 1., NAN]. Ignoring NANs, the minimum is 1.. However, there's no way to determine that using only PartialOrd because for any pair of elements a and b in this array, (a < b) == false, (a == b) == false, (a > b) == false, and a.partial_cmp(&b) == None.

That's why for this type of method I defined a MaybeNan trait that can be used to convert A to Option<A::NotNan> and required that A::NotNan: Ord.

@sebasv
Copy link
Contributor Author

sebasv commented Oct 28, 2018

Regarding the last issue, a first valid element can be found by a.partial_cmp(&a) because this also returns None for NAN. This might affect the loop unrolling though.

I also like your point on Options vs panic!s. If min_skipnan(&self) -> &A is called on an empty array, it still has to panic though, because we can't be sure what to return (for example when A is u32), so also here an Option<&A> might be nicer. This fits in the panic philosophy: let the user's unwrap do the panic!ing.

@sebasv
Copy link
Contributor Author

sebasv commented Oct 28, 2018

Loop unrolling alone works fine, but my attempts to do an early return upon finding a NAN all break loop unrolling. Since early returns only result in probabilistic speedups, and only in specific use cases, I decided against early returns for now.

@jturner314
Copy link
Member

Regarding the last issue, a first valid element can be found by a.partial_cmp(&a) because this also returns None for NAN. This might affect the loop unrolling though.

Good point! I didn't think about comparing values to themselves. Unfortunately, while that may work for f32 and f64, it's not valid for all A: PartialOrd. Consider a custom type that's similar to f32/f64 but considers all NAN values to be equal to each other:

use std::cmp::{Ordering, PartialEq, PartialOrd};
use std::f64::NAN;

pub struct MyF64(pub f64);

impl PartialEq for MyF64 {
    // This implementation is symmetric and transitive, so it meets the
    // requirements of `PartialEq`.
    fn eq(&self, other: &MyF64) -> bool {
        if self.0.is_nan() && other.0.is_nan() {
            true
        } else {
            self.0.eq(&other.0)
        }
    }
}

impl PartialOrd for MyF64 {
    // This implementation is antisymmetric and transitive, so it meets the
    // requirements of `PartialOrd`.
    fn partial_cmp(&self, other: &MyF64) -> Option<Ordering> {
        if self.0.is_nan() && other.0.is_nan() {
            Some(Ordering::Equal)
        } else {
            self.0.partial_cmp(&other.0)
        }
    }
}

fn main() {
    let a = [MyF64(1.), MyF64(NAN), MyF64(NAN)];
    for i in 0..3 {
        for j in 0..3 {
            println!(
                "a[{}].partial_cmp(&a[{}]) == {:?}",
                i, j, a[i].partial_cmp(&a[j]),
            );
        }
    }
}

The output of main is

a[0].partial_cmp(&a[0]) == Some(Equal)
a[0].partial_cmp(&a[1]) == None
a[0].partial_cmp(&a[2]) == None
a[1].partial_cmp(&a[0]) == None
a[1].partial_cmp(&a[1]) == Some(Equal)
a[1].partial_cmp(&a[2]) == Some(Equal)
a[2].partial_cmp(&a[0]) == None
a[2].partial_cmp(&a[1]) == Some(Equal)
a[2].partial_cmp(&a[2]) == Some(Equal)

So, given the result of all possible comparisons in this example, we can divide the array into two disjoint subsets, {a[0]} and {a[1], a[2]}, such that the elements within each subset are equal to each other. However, we don't know which of the two subsets has the lesser/greater elements. Additionally, while we may suspect that one of the subsets contains all NAN values, we don't have any way to determine which one.

If min_skipnan(&self) -> &A is called on an empty array, it still has to panic though, because we can't be sure what to return (for example when A is u32), so also here an Option<&A> might be nicer.

If min_skipnan requires A: MaybeNan or A: Float, MaybeNan or Float can have a function that returns a NAN value (whatever NaN means for the type A). So, if the array is empty (or if all of the array elements are NAN), min_skipnan can return NAN. In other words, a NAN result from min_skipnan is analogous to a None result from min.

That's how I decided to do it in my initial implementation for ndarray-stats. I'm still not entirely satisfied with that approach, but it is convenient for f32/f64.

For the purpose of ndarray, I lean towards just providing

/// Finds the elementwise minimum of the array.
///
/// Returns `None` if any of the pairwise orderings tested by the function
/// are undefined. (For example, this occurs if there are any
/// floating-point NaN values in the array.)
///
/// Additionally, returns `None` if the array is empty.
fn min(&self) -> Option<&A>
where
    A: PartialOrd,
{
    self.fold(self.first(), |acc, elem| match elem.partial_cmp(acc?)? {
        Ordering::Less => Some(elem),
        _ => acc,
    })
}

(and a similar max) in ndarray itself, and leave sum_skipnan, mean_skipnan, sum_axis_skipnan, mean_axis_skipnan, min_skipnan, max_skipnan, etc., for a separate crate to refine the API. @LukeMathWalker and I are working on an ndarray-stats crate that provides some of this functionality.

By the way, I remembered that a related discussion is #461.

@sebasv
Copy link
Contributor Author

sebasv commented Oct 30, 2018

You raise a good point. The specification of PartialOrd is close, but not equal to the mathematical definition of a partial ordering. Reflexivity is missing from PartialOrd, which is why PartialOrd also allows elements that can't be ordered at all. The problem here thus starts with the definition of what a minimal or maximal element is.

I would argue for the following. It is both the most semantically correct, and arguably the most intuitive for users without a background in order theory:

  • If the elements in the array have a total ordering, a reference to one of the maximum values should be returned for both nanmax and max. There may be multiple candidates to be referenced, but their values will coincide.
  • If the elements are partially ordered in the mathematical sense (like your example), a reference to a maximal value in the mathematical sense should be returned. I.e. the referenced value is the maximum of one of the totally ordered subsets that comprise the array. Both nanmax and max make no promises on which subset they consider: this may or may not be an atomic set like {NAN} in your example. This is a requirement since there is no not-arbitrary way to discriminate between these subsets.
  • If the elements are partially ordered in the loose sense, meaning there are elements that have no order with respect to themselves (such as NANs), then max will return None to inform you of this fact. nanmax will ignore the unorderable elements and treat the array as one of the 2 cases above, whichever is appropriate.

It might be tempting in case 2 to search for the maximal value of the biggest totally ordered subset. There is however no way to determine if that is what the user wants and it will likely introduce a significant performance hit.

If you agree, I wil try to put this in clear wording in the docstrings and write tests to verify this behavior.

@jturner314
Copy link
Member

jturner314 commented Nov 1, 2018

That proposal for max is appealing because the name max is similar to the name of the mathematical concept "maximal element". However, maybe it's because I'm not familiar with order theory, but I would expect a method named max to return the "greatest element" (a.k.a. "maximum" for a totally-ordered set), not a "maximal element". In other words, I would expect a method named max to follow behavior 1, not 2:

  1. Returns the greatest element. If any of the tested comparisons is None or the array is empty, returns None.

    Aside: Checking if any of the tested comparisons is None is sufficient to determine if the result is actually the "greatest element" in the set of elements in the array because PartialOrd requires transitivity.

    Aside2: If all of the tested comparisons are Some, can we say that the set is totally ordered due to transitivity of <, >, and == and antisymmetry of < and >? I think the answer is "yes, except for reflexivity unless we compare all elements to themselves".

    Aside3: A weird edge case is when the array has a single element, and that element is NaN. I think the implementation should make sure to compare the first element to itself to return None in this case.

  2. Returns a maximal element (a value in the array that is not less than any other value in the array). If the comparison of any value with itself is None or the array is empty, returns None.

The behavior is only different when there are elements that are equal to themselves but have undefined ordering with respect to some other elements. The most plausible example I can think of for this case is an enum where the ordering is defined only between values of the same variant. For example, in my py_literal crate, I have an enum Value that represents a Python literal. If I wanted to define PartialOrd for this type, I might do it like this:

use std::cmp::{Ordering, PartialOrd};

#[derive(Clone, Debug, PartialEq)]
pub enum Value {
    String(String),
    Bytes(Vec<u8>),
    Integer(num::BigInt),
    Float(f64),
    // more variants...
}

impl PartialOrd for Value {
    fn partial_cmp(&self, rhs: &Value) -> Option<Ordering> {
        use Value::*;
        match (self, rhs) {
            (String(l), String(r)) => l.partial_cmp(r),
            (Bytes(l), Bytes(r)) => l.partial_cmp(r),
            (Integer(l), Integer(r)) => l.partial_cmp(r),
            (Float(l), Float(r)) => l.partial_cmp(r),
            // more variants...
            _ => None, // differing variants
        }
    }
}

Then, I might want to find the maximum of an array of Value::Integer values. In this case, I'd want max to return None if I accidentally included a Value::String in the array to let me know that something went wrong. In other words, I'd want behavior 1 in order to help catch subtle bugs.

I'm having trouble coming up with a real-world case where I'd want behavior 2 instead.

This reasoning makes me prefer option 1 because I think it's the least surprising and most conservative behavior. Does anyone have a real-world example where behavior 2 would be preferable?

It's tempting to have max/min methods for A: Ord and greatest/least methods for A: PartialOrd to clarify that the PartialOrd case would actually return the greatest/least element (a.k.a. maximum/minimum for totally-ordered sets), not a maximal/minimal element. That might be even more confusing, though.

For nanmax, one other consideration is that there are other related methods including nanmean, nansum, etc., and it would be nice if they all used the same trait to skip NaN values (something like MaybeNan or Float). I think this is more important than allowing nanmax to work for all A: PartialOrd.

jturner314 added a commit to rust-ndarray/ndarray-stats that referenced this issue Nov 18, 2018
The problem with the old methods was that they panicked when the array
was empty, which was very problematic. (See rust-ndarray/ndarray#512
for discussion.) The old `min_partialord` and `max_partialord` have
been renamed to `min`/`max`.
LukeMathWalker pushed a commit to rust-ndarray/ndarray-stats that referenced this issue Nov 19, 2018
…fix docs (#13)

* Remove min/max for A: Ord

The problem with the old methods was that they panicked when the array
was empty, which was very problematic. (See rust-ndarray/ndarray#512
for discussion.) The old `min_partialord` and `max_partialord` have
been renamed to `min`/`max`.

* Document axis_len == 0 panic for quantile_axis_mut

* Make quantile_mut return None for empty arrays

* Fix panic docs for histogram strategies
@jturner314
Copy link
Member

This functionality is now provided by the ndarray-stats crate, and there haven't been any comments on this issue in a few months, so I'm closing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants