Skip to content

FIX array api support for clip param of MinMaxScaler #29615

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

StefanieSenger
Copy link
Contributor

Reference Issues/PRs

closes #29607

What does this implement/fix? Explain your changes.

This fixes Array API support for the clip param of MinMaxScaler and adds testing.

Any other comments?

I am not sure if we should write some custom thing for np.r_, which is used in the test and which is not in the Array API spec.
It is used exclusively in our tests, not in the code, so do we care about it in terms of this PR?

And if so: it is not a function, but does some __getitem__ magic. If we write a custom thing, it still can be a function, correct? Can it also be only some modification of the data within the test itself, so no extra function needs to be used?

Copy link

github-actions bot commented Aug 2, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 553a2e1. Link to the linter CI: here

@ogrisel ogrisel added the CUDA CI label Aug 2, 2024
@github-actions github-actions bot removed the CUDA CI label Aug 2, 2024
@StefanieSenger
Copy link
Contributor Author

Ups, do I need to do something so the CUDA CI label does not get automatically removed?

@StefanieSenger
Copy link
Contributor Author

In Colab, those fail:

FAILED sklearn/preprocessing/tests/test_data.py::test_minmax_scaler_clip[feature_range0-cupy-None-None] - TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construc...
FAILED sklearn/preprocessing/tests/test_data.py::test_minmax_scaler_clip[feature_range0-cupy.array_api-None-None] - ValueError: Expected 2D array, got scalar array instead:
FAILED sklearn/preprocessing/tests/test_data.py::test_minmax_scaler_clip[feature_range1-cupy-None-None] - TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construc...
FAILED sklearn/preprocessing/tests/test_data.py::test_minmax_scaler_clip[feature_range1-cupy.array_api-None-None] - ValueError: Expected 2D array, got scalar array instead:

I will inspect those.

@ogrisel
Copy link
Member

ogrisel commented Aug 2, 2024

Ups, do I need to do something so the CUDA CI label does not get automatically removed?

It's intentional. Each time a maintainer adds the label it's run once and only once. It's the labeling event that's triggering the workflow, not the push events. Since we have to pay for each individual GPU run we don't want to make PRs automatically run the CI on each commit and want to keep it a manual trigger.

Note that you can run the CUDA tests for your PR's branch as much as you like on google colab by using the following notebook:

@betatim
Copy link
Member

betatim commented Aug 5, 2024

What do people think of rewriting the uses of np.r_ with np.concatenate (or equivalent)? I don't often see np.r_ used "in the wild" and/or have to look up how exactly it works for anything but the most trivial use-cases. So maybe rewriting the test code is easier than making a replacement for np.r_?

@StefanieSenger
Copy link
Contributor Author

StefanieSenger commented Aug 5, 2024

What do people think of rewriting the uses of np.r_ with np.concatenate (or equivalent)? I don't often see np.r_ used "in the wild" and/or have to look up how exactly it works for anything but the most trivial use-cases. So maybe rewriting the test code is easier than making a replacement for np.r_?

I think here it's enough to make up X_test in a more convenient way. (Also in general, since I've seen that in scikit-learn, np.r_ is only used in the tests, so with very easy going data).

With running the cupy tests in Colab, it seems that I have reached a limit and cannot use it for a while (a bit intransparent). But here is what I have found out so far:

The failing tests:

FAILED sklearn/preprocessing/tests/test_data.py::test_minmax_scaler_clip[feature_range0-cupy.array_api-None-None] - ValueError: Expected 2D array, got scalar array instead:
FAILED sklearn/preprocessing/tests/test_data.py::test_minmax_scaler_clip[feature_range1-cupy.array_api-None-None] - ValueError: Expected 2D array, got scalar array instead:

are unrelated to this PR and come from somewhere within check_array, where the <class 'cupy.array_api._array_object.Array'> object is converted into a _NumPyAPIWrapper object. I suspect it's in _asarray_with_order(), but something confusing happened around it and I cannot verify until I can run it on gpu again.
That was wrong; the failures were related to me forgetting to switch array_api_dispatch on in the test.

@betatim
Copy link
Member

betatim commented Aug 6, 2024

The fact that you can't debug all the failures without a GPU (or access to one) is another thing we will need to figure out :-/


I ran the tests from this PR on a machine with cupy and got this error (stopped after the first failure)

================================================================================================ FAILURES ================================================================================================
_________________________________________________________________________ test_minmax_scaler_clip[feature_range0-cupy-None-None] _________________________________________________________________________

feature_range = (0, 1), array_namespace = 'cupy', device = None, _ = None

    @pytest.mark.parametrize(
        "array_namespace, device, _", yield_namespace_device_dtype_combinations()
    )
    @pytest.mark.parametrize("feature_range", [(0, 1), (-10, 10)])
    def test_minmax_scaler_clip(feature_range, array_namespace, device, _):
        # test behaviour of the parameter 'clip' in MinMaxScaler
        xp = _array_api_for_tests(array_namespace, device)
        X = xp.asarray(iris.data)
>       scaler = MinMaxScaler(feature_range=feature_range, clip=True).fit(X)

sklearn/preprocessing/tests/test_data.py:2487:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
sklearn/preprocessing/_data.py:444: in fit
    return self.partial_fit(X, y)
sklearn/base.py:1521: in wrapper
    return fit_method(estimator, *args, **kwargs)
sklearn/preprocessing/_data.py:491: in partial_fit
    X = self._validate_data(
sklearn/base.py:640: in _validate_data
    out = check_array(X, input_name="X", **check_params)
sklearn/utils/validation.py:1066: in check_array
    array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
sklearn/utils/_array_api.py:866: in _asarray_with_order
    array = numpy.asarray(array, order=order, dtype=dtype)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.

cupy/_core/core.pyx:1479: TypeError
======================================================================================== short test summary info =========================================================================================
FAILED sklearn/preprocessing/tests/test_data.py::test_minmax_scaler_clip[feature_range0-cupy-None-None] - TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.

I think the reason for this is that the test passes a cupy array as input to fit but doesn't enable array API support. Something like this (with from sklearn import config_context at the top of the file) should solve that:

@pytest.mark.parametrize(
    "array_namespace, device, _", yield_namespace_device_dtype_combinations()
)
@pytest.mark.parametrize("feature_range", [(0, 1), (-10, 10)])
def test_minmax_scaler_clip(feature_range, array_namespace, device, _):
    # test behaviour of the parameter 'clip' in MinMaxScaler
    xp = _array_api_for_tests(array_namespace, device)
    X = xp.asarray(iris.data)
    with config_context(array_api_dispatch=True):
        scaler = MinMaxScaler(feature_range=feature_range, clip=True).fit(X)
        X_min, X_max = xp.min(X, axis=0), xp.max(X, axis=0)
        X_test = xp.asarray([np.r_[X_min[:2] - 10, X_max[2:] + 10]])
        X_transformed = scaler.transform(X_test)
    assert_allclose(
        X_transformed,
        [[feature_range[0], feature_range[0], feature_range[1], feature_range[1]]],
    )

But it finds a new problem, but at least not a spurious one :D. The feature_range is given as integers, but the data in X_test is floating point which leads to an error in xp.clip for array-api-strict for me. I think we can make the range floats to solve this. Maybe though we want to do this in the scaler class? What do people think? It seems silly to error at the user for giving the limit as 1 instead of 1., just because the data is of dtype float.

@betatim
Copy link
Member

betatim commented Aug 7, 2024

https://data-apis.org/array-api/latest/API_specification/generated/array_api.clip.html#clip says in the Notes section that if x and min/max have differing data types the behaviour is unspecified/implementation dependent. So I think the easiest thing to do is update our test data.

@StefanieSenger
Copy link
Contributor Author

Thank you @betatim, I have fixed the issues according to your suggestions.

The fact that you can't debug all the failures without a GPU (or access to one) is another thing we will need to figure out :-/

I totally agree! Knowing that time ticks while having to push your stupidist derailed debugging attempts is pretty stressful. When I come back from vacation, I'll use a VM for tasks like this one.

notes for myself:

  • synchronise feature_range and X's dtypes within MinMaxScaler
  • make a suggestion to re-write np.r_

@StefanieSenger
Copy link
Contributor Author

StefanieSenger commented Aug 8, 2024

I was just able to run the tests on gpu and there is a new failure and one of the old ones.

Unfortunately, it will take me a few weeks until I can come back to it and hopefully fix it.
I will make this a draft PR in the meantime.

@StefanieSenger StefanieSenger marked this pull request as draft August 8, 2024 13:48
scaler = MinMaxScaler(feature_range=feature_range, clip=True).fit(X)
X_min, X_max = xp.min(X, axis=0), xp.max(X, axis=0)
X_test = xp.asarray([np.r_[X_min[:2] - 10, X_max[2:] + 10]])
X_transformed = scaler.transform(X_test)
assert_allclose(
X_transformed,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_allclose in not array API compliant. We should therefore always explicitly convert the arrays to compare to a numpy counterpart.

Suggested change
X_transformed,
_convert_to_numpy(X_transformed),

_convert_to_numpy is imported from sklearn.utils._array_api.

"array_namespace, device, _", yield_namespace_device_dtype_combinations()
)
@pytest.mark.parametrize("feature_range", [(0.0, 1.1), (-10.0, 10.0)])
def test_minmax_scaler_clip(feature_range, array_namespace, device, _):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put array_api in the test name so that it can be picked up by the CUDA CI that only runs tests with "array_api" in their name to avoid spending GPU time running non-array API tests.

Suggested change
def test_minmax_scaler_clip(feature_range, array_namespace, device, _):
def test_minmax_scaler_clip_array_api(feature_range, array_namespace, device, _):

@ogrisel
Copy link
Member

ogrisel commented Aug 8, 2024

This PR is probably impacted by data-apis/array-api-compat#177. In the mean time, it should be possible to call xp.asarray with the right device on the bounds as done in #29639.

@ogrisel
Copy link
Member

ogrisel commented Aug 8, 2024

There is also another problem with cupy.array_api: #29639 (comment)

EDIT: this has been concurrently merged to get rid of cupy.array_api-induced complexity.

@ogrisel
Copy link
Member

ogrisel commented Sep 2, 2024

I think the way to test used in https://github.com/scikit-learn/scikit-learn/pull/29751/files#diff-7c0844db1f200a70b0665df35b84587edfd04b66f02a126c92a2b0e1806eda7aR704 is enough to ensure non-regression.

I would therefore be in favor of closing this PR in favor of #29751.

@StefanieSenger
Copy link
Contributor Author

StefanieSenger commented Sep 3, 2024

I would therefore be in favor of closing this PR in favor of #29751.

Sure, I will close it then, @ogrisel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MinMaxScaler is not array API compliant if clip=True
3 participants