Given a new sample, we denote it by
where the first element is the bias term and the others are the feature values.
-
Binary problem
Consider a binary classification task with a positive class and a negative class.
Denote the nodes in the hidden layer by
and the incoming weights to
by
Then
and
where
is an activation function of your choice.
Using similar notations, we have
and the probability that the new sample is positive is
-
Multiclass problem
Consider a multiclass classification task with
classes
.
Using the same notation as above, we have
Then, define
We then get
-
Activation functions
In both cases, is a vector containing all weights,
and
is a constant
that determines the strength of regularization.
- num_hidden_units: the number of units in the hidden layer
- activation: the activation function for the hidden layer
- solver: learning algorithm used to optimize the loss function
- penalty: regularization strength
(i.e. larger values lead to stronger regularization.)
- batch_size: the number of samples in each batch used in stochastic optimization
- learning_rate: learning rate schedule for weight updates
- constant: uses constant rate given by learning_rate_init.
- invscaling: the learning rate gradually decreases from the initial rate given by learning_rate_init.
- adaptive: the learning rate is divided by 5 only when two consecutive iterations fail to decrease the loss. The initial rate is given by learning_rate_init.
- learning_rate_init: the initial learning rate
- early_stopping: whether to terminate learning if validation score fails to improve
Stopping criteria:
- tol: minimum reduction in loss required for optimization to continue.
- max_iter: maximum number of iterations allowed for the learning algorithm to converge.
Check out the documentation listed below to view the attributes that are available in sklearn but not exposed to the user in the software.
- sklearn tutorial on neural networks.
- sklearn
MLPClassifier
documentation.- Stanford CS231n lecture note on neural networks.