FIX array api support for `clip` param of `MinMaxScaler` #29615

StefanieSenger · 2024-08-02T14:35:50Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This fixes Array API support for the clip param of MinMaxScaler and adds testing.

Any other comments?

I am not sure if we should write some custom thing for np.r_, which is used in the test and which is not in the Array API spec.
It is used exclusively in our tests, not in the code, so do we care about it in terms of this PR?

And if so: it is not a function, but does some __getitem__ magic. If we write a custom thing, it still can be a function, correct? Can it also be only some modification of the data within the test itself, so no extra function needs to be used?

github-actions · 2024-08-02T14:37:11Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 553a2e1. Link to the linter CI: here}

StefanieSenger · 2024-08-02T14:45:14Z

Ups, do I need to do something so the CUDA CI label does not get automatically removed?

StefanieSenger · 2024-08-02T14:50:44Z

In Colab, those fail:

FAILED sklearn/preprocessing/tests/test_data.py::test_minmax_scaler_clip[feature_range0-cupy-None-None] - TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construc...
FAILED sklearn/preprocessing/tests/test_data.py::test_minmax_scaler_clip[feature_range0-cupy.array_api-None-None] - ValueError: Expected 2D array, got scalar array instead:
FAILED sklearn/preprocessing/tests/test_data.py::test_minmax_scaler_clip[feature_range1-cupy-None-None] - TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construc...
FAILED sklearn/preprocessing/tests/test_data.py::test_minmax_scaler_clip[feature_range1-cupy.array_api-None-None] - ValueError: Expected 2D array, got scalar array instead:

I will inspect those.

ogrisel · 2024-08-02T15:02:53Z

Ups, do I need to do something so the CUDA CI label does not get automatically removed?

It's intentional. Each time a maintainer adds the label it's run once and only once. It's the labeling event that's triggering the workflow, not the push events. Since we have to pay for each individual GPU run we don't want to make PRs automatically run the CI on each commit and want to keep it a manual trigger.

Note that you can run the CUDA tests for your PR's branch as much as you like on google colab by using the following notebook:

https://gist.github.com/EdAbati/ff3bdc06bafeb92452b3740686cc8d7c

betatim · 2024-08-05T12:28:39Z

What do people think of rewriting the uses of np.r_ with np.concatenate (or equivalent)? I don't often see np.r_ used "in the wild" and/or have to look up how exactly it works for anything but the most trivial use-cases. So maybe rewriting the test code is easier than making a replacement for np.r_?

StefanieSenger · 2024-08-05T16:03:21Z

What do people think of rewriting the uses of np.r_ with np.concatenate (or equivalent)? I don't often see np.r_ used "in the wild" and/or have to look up how exactly it works for anything but the most trivial use-cases. So maybe rewriting the test code is easier than making a replacement for np.r_?

I think here it's enough to make up X_test in a more convenient way. (Also in general, since I've seen that in scikit-learn, np.r_ is only used in the tests, so with very easy going data).

With running the cupy tests in Colab, it seems that I have reached a limit and cannot use it for a while (a bit intransparent). But here is what I have found out so far:

The failing tests:

FAILED sklearn/preprocessing/tests/test_data.py::test_minmax_scaler_clip[feature_range0-cupy.array_api-None-None] - ValueError: Expected 2D array, got scalar array instead:
FAILED sklearn/preprocessing/tests/test_data.py::test_minmax_scaler_clip[feature_range1-cupy.array_api-None-None] - ValueError: Expected 2D array, got scalar array instead:

are unrelated to this PR and come from somewhere within check_array, where the <class 'cupy.array_api._array_object.Array'> object is converted into a _NumPyAPIWrapper object. I suspect it's in _asarray_with_order(), but something confusing happened around it and I cannot verify until I can run it on gpu again.
That was wrong; the failures were related to me forgetting to switch array_api_dispatch on in the test.

betatim · 2024-08-06T14:04:02Z

The fact that you can't debug all the failures without a GPU (or access to one) is another thing we will need to figure out :-/

I ran the tests from this PR on a machine with cupy and got this error (stopped after the first failure)

================================================================================================ FAILURES ================================================================================================
_________________________________________________________________________ test_minmax_scaler_clip[feature_range0-cupy-None-None] _________________________________________________________________________

feature_range = (0, 1), array_namespace = 'cupy', device = None, _ = None

    @pytest.mark.parametrize(
        "array_namespace, device, _", yield_namespace_device_dtype_combinations()
    )
    @pytest.mark.parametrize("feature_range", [(0, 1), (-10, 10)])
    def test_minmax_scaler_clip(feature_range, array_namespace, device, _):
        # test behaviour of the parameter 'clip' in MinMaxScaler
        xp = _array_api_for_tests(array_namespace, device)
        X = xp.asarray(iris.data)
>       scaler = MinMaxScaler(feature_range=feature_range, clip=True).fit(X)

sklearn/preprocessing/tests/test_data.py:2487:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
sklearn/preprocessing/_data.py:444: in fit
    return self.partial_fit(X, y)
sklearn/base.py:1521: in wrapper
    return fit_method(estimator, *args, **kwargs)
sklearn/preprocessing/_data.py:491: in partial_fit
    X = self._validate_data(
sklearn/base.py:640: in _validate_data
    out = check_array(X, input_name="X", **check_params)
sklearn/utils/validation.py:1066: in check_array
    array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
sklearn/utils/_array_api.py:866: in _asarray_with_order
    array = numpy.asarray(array, order=order, dtype=dtype)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.

cupy/_core/core.pyx:1479: TypeError
======================================================================================== short test summary info =========================================================================================
FAILED sklearn/preprocessing/tests/test_data.py::test_minmax_scaler_clip[feature_range0-cupy-None-None] - TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.

I think the reason for this is that the test passes a cupy array as input to fit but doesn't enable array API support. Something like this (with from sklearn import config_context at the top of the file) should solve that:

@pytest.mark.parametrize(
    "array_namespace, device, _", yield_namespace_device_dtype_combinations()
)
@pytest.mark.parametrize("feature_range", [(0, 1), (-10, 10)])
def test_minmax_scaler_clip(feature_range, array_namespace, device, _):
    # test behaviour of the parameter 'clip' in MinMaxScaler
    xp = _array_api_for_tests(array_namespace, device)
    X = xp.asarray(iris.data)
    with config_context(array_api_dispatch=True):
        scaler = MinMaxScaler(feature_range=feature_range, clip=True).fit(X)
        X_min, X_max = xp.min(X, axis=0), xp.max(X, axis=0)
        X_test = xp.asarray([np.r_[X_min[:2] - 10, X_max[2:] + 10]])
        X_transformed = scaler.transform(X_test)
    assert_allclose(
        X_transformed,
        [[feature_range[0], feature_range[0], feature_range[1], feature_range[1]]],
    )

But it finds a new problem, but at least not a spurious one :D. The feature_range is given as integers, but the data in X_test is floating point which leads to an error in xp.clip for array-api-strict for me. I think we can make the range floats to solve this. Maybe though we want to do this in the scaler class? What do people think? It seems silly to error at the user for giving the limit as 1 instead of 1., just because the data is of dtype float.

betatim · 2024-08-07T13:55:42Z

https://data-apis.org/array-api/latest/API_specification/generated/array_api.clip.html#clip says in the Notes section that if x and min/max have differing data types the behaviour is unspecified/implementation dependent. So I think the easiest thing to do is update our test data.

StefanieSenger · 2024-08-07T15:54:48Z

Thank you @betatim, I have fixed the issues according to your suggestions.

The fact that you can't debug all the failures without a GPU (or access to one) is another thing we will need to figure out :-/

I totally agree! Knowing that time ticks while having to push your stupidist derailed debugging attempts is pretty stressful. When I come back from vacation, I'll use a VM for tasks like this one.

notes for myself:

synchronise feature_range and X's dtypes within MinMaxScaler
make a suggestion to re-write np.r_

StefanieSenger · 2024-08-08T13:43:30Z

I was just able to run the tests on gpu and there is a new failure and one of the old ones.

Unfortunately, it will take me a few weeks until I can come back to it and hopefully fix it.
I will make this a draft PR in the meantime.

ogrisel · 2024-08-08T14:03:33Z

sklearn/preprocessing/tests/test_data.py

+        scaler = MinMaxScaler(feature_range=feature_range, clip=True).fit(X)
+        X_min, X_max = xp.min(X, axis=0), xp.max(X, axis=0)
+        X_test = xp.asarray([np.r_[X_min[:2] - 10, X_max[2:] + 10]])
+        X_transformed = scaler.transform(X_test)
    assert_allclose(
        X_transformed,


assert_allclose in not array API compliant. We should therefore always explicitly convert the arrays to compare to a numpy counterpart.

Suggested change

X_transformed,

_convert_to_numpy(X_transformed),

_convert_to_numpy is imported from sklearn.utils._array_api.

ogrisel · 2024-08-08T14:34:18Z

sklearn/preprocessing/tests/test_data.py

+    "array_namespace, device, _", yield_namespace_device_dtype_combinations()
+)
+@pytest.mark.parametrize("feature_range", [(0.0, 1.1), (-10.0, 10.0)])
+def test_minmax_scaler_clip(feature_range, array_namespace, device, _):


Let's put array_api in the test name so that it can be picked up by the CUDA CI that only runs tests with "array_api" in their name to avoid spending GPU time running non-array API tests.

Suggested change

def test_minmax_scaler_clip(feature_range, array_namespace, device, _):

def test_minmax_scaler_clip_array_api(feature_range, array_namespace, device, _):

ogrisel · 2024-08-08T14:59:00Z

This PR is probably impacted by data-apis/array-api-compat#177. In the mean time, it should be possible to call xp.asarray with the right device on the bounds as done in #29639.

ogrisel · 2024-08-08T16:32:43Z

There is also another problem with cupy.array_api: #29639 (comment)

EDIT: this has been concurrently merged to get rid of cupy.array_api-induced complexity.

ogrisel · 2024-09-02T13:00:33Z

I think the way to test used in https://github.com/scikit-learn/scikit-learn/pull/29751/files#diff-7c0844db1f200a70b0665df35b84587edfd04b66f02a126c92a2b0e1806eda7aR704 is enough to ensure non-regression.

I would therefore be in favor of closing this PR in favor of #29751.

StefanieSenger · 2024-09-03T07:27:45Z

I would therefore be in favor of closing this PR in favor of #29751.

Sure, I will close it then, @ogrisel.

fix array api support for clip param

55a406c

github-actions bot added the module:preprocessing label Aug 2, 2024

ogrisel added the CUDA CI label Aug 2, 2024

github-actions bot removed the CUDA CI label Aug 2, 2024

StefanieSenger added 2 commits August 5, 2024 13:37

for debugging failing cuda test

8e29418

for debugging cuda tests

edd4734

StefanieSenger added 5 commits August 5, 2024 14:42

for debugging cuda tests [ci skip]

f3c1993

for debugging cuda tests [ci skip]

07fb554

for debugging cuda tests [ci skip]

5f05d7e

for debugging cuda tests [ci skip]

08fe48f

for debugging cuda tests [ci skip]

aea8369

for debugging cuda tests [ci skip]

b9501ca

for debugging cuda tests [ci skip]

b734f9b

StefanieSenger added 4 commits August 7, 2024 16:22

for debugging cuda tests [ci skip]

ec74f74

for debugging cuda tests [ci skip]

22b1010

for debugging cuda tests [ci skip]

7416505

changes according to review

553a2e1

StefanieSenger marked this pull request as draft August 8, 2024 13:48

ogrisel reviewed Aug 8, 2024

View reviewed changes

ogrisel mentioned this pull request Aug 8, 2024

Drop support for the redundant and deprecated cupy.array_api in favor of array_api_compat. #29639

Merged

Shree7676 mentioned this pull request Aug 30, 2024

Fixed array api support for MinMaxScaler with clip=True #29751

Merged

StefanieSenger closed this Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX array api support for `clip` param of `MinMaxScaler` #29615

FIX array api support for `clip` param of `MinMaxScaler` #29615

StefanieSenger commented Aug 2, 2024

github-actions bot commented Aug 2, 2024 •

edited

Loading

StefanieSenger commented Aug 2, 2024

StefanieSenger commented Aug 2, 2024

ogrisel commented Aug 2, 2024 •

edited

Loading

betatim commented Aug 5, 2024

StefanieSenger commented Aug 5, 2024 •

edited

Loading

betatim commented Aug 6, 2024

betatim commented Aug 7, 2024

StefanieSenger commented Aug 7, 2024

StefanieSenger commented Aug 8, 2024 •

edited

Loading

ogrisel Aug 8, 2024

ogrisel Aug 8, 2024

ogrisel commented Aug 8, 2024

ogrisel commented Aug 8, 2024 •

edited

Loading

ogrisel commented Sep 2, 2024 •

edited

Loading

StefanieSenger commented Sep 3, 2024 •

edited by ogrisel

Loading

	def test_minmax_scaler_clip(feature_range, array_namespace, device, _):
	def test_minmax_scaler_clip_array_api(feature_range, array_namespace, device, _):

FIX array api support for clip param of MinMaxScaler #29615

FIX array api support for clip param of MinMaxScaler #29615

Conversation

StefanieSenger commented Aug 2, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

github-actions bot commented Aug 2, 2024 • edited Loading

✔️ Linting Passed

StefanieSenger commented Aug 2, 2024

StefanieSenger commented Aug 2, 2024

ogrisel commented Aug 2, 2024 • edited Loading

betatim commented Aug 5, 2024

StefanieSenger commented Aug 5, 2024 • edited Loading

betatim commented Aug 6, 2024

betatim commented Aug 7, 2024

StefanieSenger commented Aug 7, 2024

StefanieSenger commented Aug 8, 2024 • edited Loading

ogrisel Aug 8, 2024

Choose a reason for hiding this comment

ogrisel Aug 8, 2024

Choose a reason for hiding this comment

ogrisel commented Aug 8, 2024

ogrisel commented Aug 8, 2024 • edited Loading

ogrisel commented Sep 2, 2024 • edited Loading

StefanieSenger commented Sep 3, 2024 • edited by ogrisel Loading

FIX array api support for `clip` param of `MinMaxScaler` #29615

FIX array api support for `clip` param of `MinMaxScaler` #29615

github-actions bot commented Aug 2, 2024 •

edited

Loading

ogrisel commented Aug 2, 2024 •

edited

Loading

StefanieSenger commented Aug 5, 2024 •

edited

Loading

StefanieSenger commented Aug 8, 2024 •

edited

Loading

ogrisel commented Aug 8, 2024 •

edited

Loading

ogrisel commented Sep 2, 2024 •

edited

Loading

StefanieSenger commented Sep 3, 2024 •

edited by ogrisel

Loading