-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX array api support for clip
param of MinMaxScaler
#29615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX array api support for clip
param of MinMaxScaler
#29615
Conversation
Ups, do I need to do something so the CUDA CI label does not get automatically removed? |
In Colab, those fail:
I will inspect those. |
It's intentional. Each time a maintainer adds the label it's run once and only once. It's the labeling event that's triggering the workflow, not the push events. Since we have to pay for each individual GPU run we don't want to make PRs automatically run the CI on each commit and want to keep it a manual trigger. Note that you can run the CUDA tests for your PR's branch as much as you like on google colab by using the following notebook: |
What do people think of rewriting the uses of |
I think here it's enough to make up With running the cupy tests in Colab, it seems that I have reached a limit and cannot use it for a while (a bit intransparent). But here is what I have found out so far: The failing tests:
|
The fact that you can't debug all the failures without a GPU (or access to one) is another thing we will need to figure out :-/ I ran the tests from this PR on a machine with cupy and got this error (stopped after the first failure)
I think the reason for this is that the test passes a cupy array as input to @pytest.mark.parametrize(
"array_namespace, device, _", yield_namespace_device_dtype_combinations()
)
@pytest.mark.parametrize("feature_range", [(0, 1), (-10, 10)])
def test_minmax_scaler_clip(feature_range, array_namespace, device, _):
# test behaviour of the parameter 'clip' in MinMaxScaler
xp = _array_api_for_tests(array_namespace, device)
X = xp.asarray(iris.data)
with config_context(array_api_dispatch=True):
scaler = MinMaxScaler(feature_range=feature_range, clip=True).fit(X)
X_min, X_max = xp.min(X, axis=0), xp.max(X, axis=0)
X_test = xp.asarray([np.r_[X_min[:2] - 10, X_max[2:] + 10]])
X_transformed = scaler.transform(X_test)
assert_allclose(
X_transformed,
[[feature_range[0], feature_range[0], feature_range[1], feature_range[1]]],
) But it finds a new problem, but at least not a spurious one :D. The |
https://data-apis.org/array-api/latest/API_specification/generated/array_api.clip.html#clip says in the Notes section that if |
Thank you @betatim, I have fixed the issues according to your suggestions.
I totally agree! Knowing that time ticks while having to push your stupidist derailed debugging attempts is pretty stressful. When I come back from vacation, I'll use a VM for tasks like this one. notes for myself:
|
I was just able to run the tests on gpu and there is a new failure and one of the old ones. Unfortunately, it will take me a few weeks until I can come back to it and hopefully fix it. |
scaler = MinMaxScaler(feature_range=feature_range, clip=True).fit(X) | ||
X_min, X_max = xp.min(X, axis=0), xp.max(X, axis=0) | ||
X_test = xp.asarray([np.r_[X_min[:2] - 10, X_max[2:] + 10]]) | ||
X_transformed = scaler.transform(X_test) | ||
assert_allclose( | ||
X_transformed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert_allclose
in not array API compliant. We should therefore always explicitly convert the arrays to compare to a numpy counterpart.
X_transformed, | |
_convert_to_numpy(X_transformed), |
_convert_to_numpy
is imported from sklearn.utils._array_api
.
"array_namespace, device, _", yield_namespace_device_dtype_combinations() | ||
) | ||
@pytest.mark.parametrize("feature_range", [(0.0, 1.1), (-10.0, 10.0)]) | ||
def test_minmax_scaler_clip(feature_range, array_namespace, device, _): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put array_api
in the test name so that it can be picked up by the CUDA CI that only runs tests with "array_api" in their name to avoid spending GPU time running non-array API tests.
def test_minmax_scaler_clip(feature_range, array_namespace, device, _): | |
def test_minmax_scaler_clip_array_api(feature_range, array_namespace, device, _): |
This PR is probably impacted by data-apis/array-api-compat#177. In the mean time, it should be possible to call |
There is also another problem with EDIT: this has been concurrently merged to get rid of |
I think the way to test used in https://github.com/scikit-learn/scikit-learn/pull/29751/files#diff-7c0844db1f200a70b0665df35b84587edfd04b66f02a126c92a2b0e1806eda7aR704 is enough to ensure non-regression. I would therefore be in favor of closing this PR in favor of #29751. |
Reference Issues/PRs
closes #29607
What does this implement/fix? Explain your changes.
This fixes Array API support for the
clip
param ofMinMaxScaler
and adds testing.Any other comments?
I am not sure if we should write some custom thing for
np.r_
, which is used in the test and which is not in the Array API spec.It is used exclusively in our tests, not in the code, so do we care about it in terms of this PR?
And if so: it is not a function, but does some
__getitem__
magic. If we write a custom thing, it still can be a function, correct? Can it also be only some modification of the data within the test itself, so no extra function needs to be used?