Skip to content

Help: Wizard ‐ Challenge ‐ Metric

Isabelle Guyon edited this page Sep 28, 2017 · 5 revisions

Make sure the "Name of the metric function" is the same as the name of the metric function in your "Code". It is possible to have multiple functions in your code, but ChaLab need to identify the main entry point.


This page lets you define the metric by which the score that will be displayed on the leaderboard will be calculated. The code is written in Python.
Provide in the field "Name of the metric function" a function name that MATCHES the name of one of the functions in your python file. Other functions may be provided.
The argument 'solution' contains the 'ground truth' of the values to be predicted (provided by the organizers), samples in lines, values in columns.
The argument 'prediction' contains the corresponding predicted values (provided by the challenge participants).

You may also select from a set of default metrics that were supplied in the AutoML challenge. All such metrics are re-scaled such that "guessing at random" gives a score around 0 and perfect prediction gives a score of 1. Negative scores are possible (worse than chance).

The scores are taken from the following list:

  • R2: R-square or "coefficient of determination" used for regression problems: R2 = 1-MSE/VAR, where MSE=< (yi - qi)2> is the mean-square-error and VAR= < (yi - m)2> is the variance, with m=< yi >.
  • ABS: A coefficient similar to the R2 but based on mean absolute error (MAE) and mean absolute deviation (MAD): ABS =  1-MAE/MAD, with MAE=< abs(yi - qi) > and MAD=< abs(yi - m) >.
  • BAC: Balanced accuracy, which is the average of class-wise accuracy for classification problems (or the average of sensitivity (true positive rate) and specificity (true negative rate) for the special case of binary classification). For binary classification problems, the class-wise accuracy is the fraction of correct class predictions when qi is thresholded at 0.5, for each class. The class-wise accuracy is averaged over all classes for multi-label problems. For multi-class classification problems, the predictions are binarized by selecting the class with maximum prediction value argmaxk qik before computing the class-wise accuracy. We normalize the BAC with the formula BAC := (BAC-R)/(1-R), where R is the expected value of BAC for random predictions (i.e. R=0.5 for binary classification and R=(1/C) for C-class classification problems).
  • AUC: Area under the ROC curve, used for ranking and for binary classification problems. The ROC curve is the curve of sensitivity vs. 1-specificity, when a threshold is varied on the predictions. The AUC is identical to the BAC for binary predictions. The AUC is calculated for each class separately before averaging over all classes. We normalize it with the formula: AUC := 2AUC-1, making it de-facto identical to the so-called Gini index.
  • F1 score: The harmonic mean of precision and recall. Precision=positive predictive value=true_positive/all_called_positive. Recall=sensitivity=true positive rate=true_positive/all_real_positive. Prediction thresholding and class averaging is handled similarly as in the case of the BAC. We also normalize F1 with F1 := (F1-R)/(1-R), where R is the expected value of F1 for random predictions (i.e. R=0.5 for binary classification and R=(1/C) for C-class classification problems).
  • PAC: Probabilistic accuracy PAC = exp(- CE) based on the cross-entropy or log loss, CE = - < sumk log(qik) > for multi-class classification and CE = - <yi log(qi) + (1-yi) log(1-qi)> for binary classification and multi-label problems. Class averaging is performed after taking the exponential in the multi-label case. We normalize with PAC := (PAC-R)/(1-R), where R is the score obtained using qi =< yi > or qik=< yik > (i.e. using as predictions the fraction of positive class examples as an estimate of the prior probability).

We note that for R2, ABS, and PAC the normalization uses a "trivial guess" corresponding to the average target value qi =< yi > or qik=< yik >. In contrast, for BAC, AUC, and F1 the "trivial guess" is a random prediction of one of the classes with uniform probability.
In all formulas the brackets < . > designates the average over all P samples indexed by i: < yi > = (1/P) sumi (yi). Only R2 and ABS make sense for regression; we compute the other scores for completeness by replacing the target values by binary values after thresholding them in the mid-range.