Skip to content

DOC rework the introduction section of the user guide #1110

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 34 additions & 29 deletions doc/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,41 +9,45 @@ Introduction
API's of imbalanced-learn samplers
----------------------------------

The available samplers follows the scikit-learn API using the base estimator
and adding a sampling functionality through the ``sample`` method:
The available samplers follow the
`scikit-learn API <https://scikit-learn.org/stable/getting_started.html#fitting-and-predicting-estimator-basics>`_
using the base estimator
and incorporating a sampling functionality via the ``sample`` method:

:Estimator:

The base object, implements a ``fit`` method to learn from data, either::
The base object, implements a ``fit`` method to learn from data::

estimator = obj.fit(data, targets)

:Resampler:

To resample a data sets, each sampler implements::
To resample a data sets, each sampler implements a ``fit_resample`` method::

data_resampled, targets_resampled = obj.fit_resample(data, targets)

Imbalanced-learn samplers accept the same inputs that in scikit-learn:
Imbalanced-learn samplers accept the same inputs as scikit-learn estimators:

* `data`:
* 2-D :class:`list`,
* 2-D :class:`numpy.ndarray`,
* :class:`pandas.DataFrame`,
* :class:`scipy.sparse.csr_matrix` or :class:`scipy.sparse.csc_matrix`;
* `targets`:
* 1-D :class:`numpy.ndarray`,
* :class:`pandas.Series`.
* `data`, 2-dimensional array-like structures, such as:
* Python's list of lists :class:`list`,
* Numpy arrays :class:`numpy.ndarray`,
* Panda dataframes :class:`pandas.DataFrame`,
* Scipy sparse matrices :class:`scipy.sparse.csr_matrix` or :class:`scipy.sparse.csc_matrix`;

* `targets`, 1-dimensional array-like structures, such as:
* Numpy arrays :class:`numpy.ndarray`,
* Pandas series :class:`pandas.Series`.

The output will be of the following type:

* `data_resampled`:
* 2-D :class:`numpy.ndarray`,
* :class:`pandas.DataFrame`,
* :class:`scipy.sparse.csr_matrix` or :class:`scipy.sparse.csc_matrix`;
* `targets_resampled`:
* 1-D :class:`numpy.ndarray`,
* :class:`pandas.Series`.
* `data_resampled`, 2-dimensional aray-like structures, such as:
* Numpy arrays :class:`numpy.ndarray`,
* Pandas dataframes :class:`pandas.DataFrame`,
* Scipy sparse matrices :class:`scipy.sparse.csr_matrix` or :class:`scipy.sparse.csc_matrix`;

* `targets_resampled`, 1-dimensional array-like structures, such as:
* Numpy arrays :class:`numpy.ndarray`,
* Pandas series :class:`pandas.Series`.

.. topic:: Pandas in/out

Expand All @@ -62,18 +66,19 @@ The output will be of the following type:
Problem statement regarding imbalanced data sets
------------------------------------------------

The learning phase and the subsequent prediction of machine learning algorithms
can be affected by the problem of imbalanced data set. The balancing issue
corresponds to the difference of the number of samples in the different
classes. We illustrate the effect of training a linear SVM classifier with
different levels of class balancing.
The learning and prediction phrases of machine learning algorithms
can be impacted by the issue of **imbalanced datasets**. This imbalance
refers to the difference in the number of samples across different classes.
We demonstrate the effect of training a `Logistic Regression classifier
<https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html>`_
with varying levels of class balancing by adjusting their weights.

.. image:: ./auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_001.png
:target: ./auto_examples/over-sampling/plot_comparison_over_sampling.html
:scale: 60
:align: center

As expected, the decision function of the linear SVM varies greatly depending
upon how imbalanced the data is. With a greater imbalanced ratio, the decision
function favors the class with the larger number of samples, usually referred
as the majority class.
As expected, the decision function of the Logistic Regression classifier varies significantly
depending on how imbalanced the data is. With a greater imbalance ratio, the decision function
tends to favour the class with the larger number of samples, usually referred to as the
**majority class**.