Skip to content

Oversampling modules return a truncated array in the multi-class instance #489

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
samhardyhey opened this issue Oct 15, 2018 · 4 comments · Fixed by #490
Closed

Oversampling modules return a truncated array in the multi-class instance #489

samhardyhey opened this issue Oct 15, 2018 · 4 comments · Fixed by #490
Labels
Type: Bug Indicates an unexpected problem or unintended behavior

Comments

@samhardyhey
Copy link

Description

Oversampling modules sometimes return a truncated array in the multi-class instance. Apologies if this is a user error. Below example feeds in a multi-label matrix; unsure if this has implications for the algorithm (if so feel free to correct my understanding! :)).

Steps/Code to Reproduce

from imblearn.over_sampling import BorderlineSMOTE

bl = BorderlineSMOTE(random_state=0, n_jobs=8,k_neighbors=1)

x=np.random.randint(5, size=5000).reshape(1000,5)
y=np.random.randint(2, size=10000).reshape(1000,10)

#bl
bl_x, bl_y = bl.fit_resample(x,y)
bl_y.shape

Expected Results

Some array which features the same number of columns as the input.

(1000, 10)

Actual Results

Randomly truncates one of the columns during calls to fit_resample and fit_sample. Have toggled the cell in my notebook in sequence to discern a pattern; there is none. Result randomly appears in 1/4 results (ish). Even after controlling for the random state in the instance creation.

(1000, 9)

Versions

Linux-4.4.0-134-generic-x86_64-with-debian-stretch-sid
Python 3.6.6 |Anaconda, Inc.| (default, Oct 9 2018, 12:34:16)
[GCC 7.3.0]
NumPy 1.15.2
SciPy 1.1.0
Scikit-Learn 0.20.0
Imbalanced-Learn 0.4.1

@glemaitre
Copy link
Member

Yep I can reproduce it. This is pretty bad. Let'see where it comes from.

@glemaitre glemaitre added the Type: Bug Indicates an unexpected problem or unintended behavior label Oct 17, 2018
@glemaitre
Copy link
Member

Uhm actually your y is not something that we are supporting. We are supporting three cases:

  • binary
  • multiclass (1D array with multiple values)
  • one-hot-encoded multiclass (2D in which we should have a single 1 per lines)

The case that you are giving is actually a multilabel case which is not supported. I have to check if we can raise an error.

@jsl303
Copy link

jsl303 commented Apr 16, 2019

Sorry for opening again... Somewhat related...
Isn't multi label support implemented from here? #340
Sometimes all the terms can be confusing. multi class, multi label, multi output... Top of that one hot encoding, multi label binarizing, and so on....

@glemaitre
Copy link
Member

Isn't multi label support implemented from here?

Only when it corresponds to a one-hot encoding of a multiclass problem (a single 1 per row where the row corresponds to the class). Otherwise, there is no literature to do it in a multi-label setting.

Regarding the definition of those terms, you can refer to scikit-learn directly:https://scikit-learn.org/stable/modules/multiclass.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants