Oversampling modules return a truncated array in the multi-class instance #489

samhardyhey · 2018-10-15T21:17:43Z

Description

Oversampling modules sometimes return a truncated array in the multi-class instance. Apologies if this is a user error. Below example feeds in a multi-label matrix; unsure if this has implications for the algorithm (if so feel free to correct my understanding! :)).

Steps/Code to Reproduce

from imblearn.over_sampling import BorderlineSMOTE

bl = BorderlineSMOTE(random_state=0, n_jobs=8,k_neighbors=1)

x=np.random.randint(5, size=5000).reshape(1000,5)
y=np.random.randint(2, size=10000).reshape(1000,10)

#bl
bl_x, bl_y = bl.fit_resample(x,y)
bl_y.shape

Expected Results

Some array which features the same number of columns as the input.

(1000, 10)

Actual Results

Randomly truncates one of the columns during calls to fit_resample and fit_sample. Have toggled the cell in my notebook in sequence to discern a pattern; there is none. Result randomly appears in 1/4 results (ish). Even after controlling for the random state in the instance creation.

(1000, 9)

Versions

Linux-4.4.0-134-generic-x86_64-with-debian-stretch-sid
Python 3.6.6 |Anaconda, Inc.| (default, Oct 9 2018, 12:34:16)
[GCC 7.3.0]
NumPy 1.15.2
SciPy 1.1.0
Scikit-Learn 0.20.0
Imbalanced-Learn 0.4.1

The text was updated successfully, but these errors were encountered:

glemaitre · 2018-10-17T15:36:27Z

Yep I can reproduce it. This is pretty bad. Let'see where it comes from.

glemaitre · 2018-10-17T15:48:00Z

Uhm actually your y is not something that we are supporting. We are supporting three cases:

binary
multiclass (1D array with multiple values)
one-hot-encoded multiclass (2D in which we should have a single 1 per lines)

The case that you are giving is actually a multilabel case which is not supported. I have to check if we can raise an error.

jsl303 · 2019-04-16T13:22:45Z

Sorry for opening again... Somewhat related...
Isn't multi label support implemented from here? #340
Sometimes all the terms can be confusing. multi class, multi label, multi output... Top of that one hot encoding, multi label binarizing, and so on....

glemaitre · 2019-04-16T13:31:19Z

Isn't multi label support implemented from here?

Only when it corresponds to a one-hot encoding of a multiclass problem (a single 1 per row where the row corresponds to the class). Otherwise, there is no literature to do it in a multi-label setting.

Regarding the definition of those terms, you can refer to scikit-learn directly:https://scikit-learn.org/stable/modules/multiclass.html

glemaitre added the Type: Bug Indicates an unexpected problem or unintended behavior label Oct 17, 2018

glemaitre mentioned this issue Oct 17, 2018

FIX: raise an error when multilabel does not encode multiclass #490

Merged

glemaitre closed this as completed in #490 Oct 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Oversampling modules return a truncated array in the multi-class instance #489

Oversampling modules return a truncated array in the multi-class instance #489

samhardyhey commented Oct 15, 2018

glemaitre commented Oct 17, 2018

glemaitre commented Oct 17, 2018

jsl303 commented Apr 16, 2019

glemaitre commented Apr 16, 2019

Oversampling modules return a truncated array in the multi-class instance #489

Oversampling modules return a truncated array in the multi-class instance #489

Comments

samhardyhey commented Oct 15, 2018

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

glemaitre commented Oct 17, 2018

glemaitre commented Oct 17, 2018

jsl303 commented Apr 16, 2019

glemaitre commented Apr 16, 2019