FIX Libsvm and liblinear rand() fix for convergence on windows targets (and improvement on all targets) (scikit-learn#13511)

smarie · Sylvain MARIE · cmarmo · web-flow · commit eaf0a044fdc0 · 2020-04-06T17:40:13.000+02:00
* Fixed random number generation on windows. See cjlin1/liblinear#28 * Added correct PR number * Fixed random number generator for libsvm on windows targets * Updated comments * Suppressed C4293 warnings using explicit cast. * Suppressed C4293 warnings using explicit cast. * Following code review: `myrand` is now inline, and all integer size checks for random generation fix are now static. * Dummy comment edit to re-trigger build * Fixed include error * Attempt to provide a better and faster result by using a mersene twister random number generator as suggested by https://codeforces.com/blog/entry/61587 * Attempt to integrate the extra compiler arg requesting for explicit C++11 support * Attempt to integrate the extra compiler arg requesting for explicit C++11 support. liblinear extension. * the `extra_compile_args` for C++11 was not working when used in `config.add_extension` so now pre-creating liblinear library first with the flag. * Fixed liblinear extension build: there was a missing library dependency. * Fixed liblinear library dependency for extension compilation. * Fixed liblinear library compilation: added tron * Now correctly setting the seed in libsvm * Now correctly setting the seed in liblinear. This is done using a new interface function `set_seed`. * Updated what's new accordingly * Increased tolerance when comparing `_average_precision` with `_average_precision_slow`. Indeed `_average_precision` is based on predictions ordering and has errors in case of ties (e.g. several 0.5 prediction values) * Minor improvement of this test method for readability (no change in output) * Minor doc update as per review * dropped commented line * Reverted test readability improvement change * Using double barticks * Clarified all comments * removed useless space * Clarified all comments, and included a `set_seed` method as per code review. * Updated what's new about changed models as suggested by review. * Replaced random number generator: now using a `bounded_rand_int` using tweaked Lemire post-processor from http://www.pcg-random.org/posts/bounded-rands.html) * Updated readme rst and moved it to 0.22 * Whats new moved from 0.222 to 0.23 * Updated docs * Update doc/whats_new/v0.23.rst Co-Authored-By: Chiara Marmo <cmarmo@users.noreply.github.com> * Dummy change to re-trigger the CI build * Update doc/whats_new/v0.21.rst * Update doc/whats_new/v0.21.rst * Update doc/whats_new/v0.21.rst * Update doc/whats_new/v0.21.rst * Update doc/whats_new/v0.21.rst * Update doc/whats_new/v0.21.rst * Update doc/whats_new/v0.21.rst * Update doc/whats_new/v0.21.rst * Update doc/whats_new/v0.21.rst * Update doc/whats_new/v0.21.rst * Update doc/whats_new/v0.22.rst * As per code review: added LogisticRegression in the list of changes, and repeated the list of classes affected in changelog. * New random number generator used in libsvm / liblinear is now in an independent header file `newrand/newrand.h`. * Update sklearn/svm/src/liblinear/linear.cpp * Fixed dates in headers and added a header to newrand.h * Added a sentence in the changelog to explain the impact in user-friendly terms. * Added link to PR in readme file and removed commented lines as per code review Co-authored-by: Sylvain MARIE <sylvain.marie@se.com> Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com>
diff --git a/doc/whats_new/v0.23.rst b/doc/whats_new/v0.23.rst
@@ -25,6 +25,12 @@ random sampling procedures.
 - :class:`ensemble.BaggingClassifier`, :class:`ensemble.BaggingRegressor`,
   and :class:`ensemble.IsolationForest`. |Fix|
 
+- Any model using the :func:`svm.libsvm` or the :func:`svm.liblinear` solver,
+  including :class:`svm.LinearSVC`, :class:`svm.LinearSVR`,
+  :class:`svm.NuSVC`, :class:`svm.NuSVR`, :class:`svm.OneClassSVM`,
+  :class:`svm.SVC`, :class:`svm.SVR`, :class:`linear_model.LogisticRegression`.
+  |Efficiency| |Fix|
+
 Details are listed in the changelog below.
 
 (While we are trying to better inform users by providing this information, we
@@ -296,7 +302,7 @@ Changelog
 - |API| Changed the formatting of values in
   :meth:`metrics.ConfusionMatrixDisplay.plot` and
   :func:`metrics.plot_confusion_matrix` to pick the shorter format (either '2g'
-  or 'd'). :pr:`16159` by :user:`Rick Mackenbach <Rick-Mackenbach>` and 
+  or 'd'). :pr:`16159` by :user:`Rick Mackenbach <Rick-Mackenbach>` and
   `Thomas Fan`_.
 
 - |Enhancement| :func:`metrics.pairwise.pairwise_distances_chunked` now allows
@@ -378,6 +384,26 @@ Changelog
 :mod:`sklearn.svm`
 ..................
 
+- |Fix| |Efficiency| Improved ``libsvm`` and ``liblinear`` random number
+  generators used to randomly select coordinates in the coordinate descent
+  algorithms. Platform-dependent C ``rand()`` was used, which is only able to
+  generate numbers up to ``32767`` on windows platform (see this `blog
+  post <https://codeforces.com/blog/entry/61587>`) and also has poor
+  randomization power as suggested by `this presentation
+  <https://channel9.msdn.com/Events/GoingNative/2013/rand-Considered-Harmful>`.
+  It was replaced with C++11 ``mt19937``, a Mersenne Twister that correctly
+  generates 31bits/63bits random numbers on all platforms. In addition, the
+  crude "modulo" postprocessor used to get a random number in a bounded
+  interval was replaced by the tweaked Lemire method as suggested by `this blog
+  post <http://www.pcg-random.org/posts/bounded-rands.html>`.
+  Any model using the :func:`svm.libsvm` or the :func:`svm.liblinear` solver,
+  including :class:`svm.LinearSVC`, :class:`svm.LinearSVR`,
+  :class:`svm.NuSVC`, :class:`svm.NuSVR`, :class:`svm.OneClassSVM`,
+  :class:`svm.SVC`, :class:`svm.SVR`, :class:`linear_model.LogisticRegression`,
+  is affected. In particular users can expect a better convergence when the
+  number of samples (LibSVM) or the number of features (LibLinear) is large.
+  :pr:`13511` by :user:`Sylvain Marié <smarie>`.
+
 - |API| :class:`svm.SVR` and :class:`svm.OneClassSVM` attributes, `probA_` and
   `probB_`, are now deprecated as they were not useful. :pr:`15558` by
   `Thomas Fan`_.
diff --git a/sklearn/metrics/tests/test_ranking.py b/sklearn/metrics/tests/test_ranking.py
@@ -737,8 +737,9 @@ def _test_precision_recall_curve(y_true, probas_pred):
     assert_array_almost_equal(precision_recall_auc, 0.859, 3)
     assert_array_almost_equal(precision_recall_auc,
                               average_precision_score(y_true, probas_pred))
+    # `_average_precision` is not very precise in case of 0.5 ties: be tolerant
     assert_almost_equal(_average_precision(y_true, probas_pred),
-                        precision_recall_auc, decimal=3)
+                        precision_recall_auc, decimal=2)
     assert p.size == r.size
     assert p.size == thresholds.size + 1
     # Smoke test in the case of proba having only one value
diff --git a/sklearn/svm/setup.py b/sklearn/svm/setup.py
@@ -16,22 +16,27 @@ def configuration(parent_package='', top_path=None):
     config.add_library('libsvm-skl',
                        sources=[join('src', 'libsvm', 'libsvm_template.cpp')],
                        depends=[join('src', 'libsvm', 'svm.cpp'),
-                                join('src', 'libsvm', 'svm.h')],
+                                join('src', 'libsvm', 'svm.h'),
+                                join('src', 'newrand', 'newrand.h')],
                        # Force C++ linking in case gcc is picked up instead
                        # of g++ under windows with some versions of MinGW
                        extra_link_args=['-lstdc++'],
+                       # Use C++11 to use the random number generator fix
+                       extra_compiler_args=['-std=c++11'],
                        )
 
     libsvm_sources = ['_libsvm.pyx']
     libsvm_depends = [join('src', 'libsvm', 'libsvm_helper.c'),
                       join('src', 'libsvm', 'libsvm_template.cpp'),
                       join('src', 'libsvm', 'svm.cpp'),
-                      join('src', 'libsvm', 'svm.h')]
+                      join('src', 'libsvm', 'svm.h'),
+                      join('src', 'newrand', 'newrand.h')]
 
     config.add_extension('_libsvm',
                          sources=libsvm_sources,
                          include_dirs=[numpy.get_include(),
-                                       join('src', 'libsvm')],
+                                       join('src', 'libsvm'),
+                                       join('src', 'newrand')],
                          libraries=['libsvm-skl'],
                          depends=libsvm_depends,
                          )
@@ -41,16 +46,30 @@ def configuration(parent_package='', top_path=None):
     if os.name == 'posix':
         libraries.append('m')
 
-    liblinear_sources = ['_liblinear.pyx',
-                         join('src', 'liblinear', '*.cpp')]
+    # precompile liblinear to use C++11 flag
+    config.add_library('liblinear-skl',
+                       sources=[join('src', 'liblinear', 'linear.cpp'),
+                                join('src', 'liblinear', 'tron.cpp')],
+                       depends=[join('src', 'liblinear', 'linear.h'),
+                                join('src', 'liblinear', 'tron.h'),
+                                join('src', 'newrand', 'newrand.h')],
+                       # Force C++ linking in case gcc is picked up instead
+                       # of g++ under windows with some versions of MinGW
+                       extra_link_args=['-lstdc++'],
+                       # Use C++11 to use the random number generator fix
+                       extra_compiler_args=['-std=c++11'],
+                       )
 
+    liblinear_sources = ['_liblinear.pyx']
     liblinear_depends = [join('src', 'liblinear', '*.h'),
+                         join('src', 'newrand', 'newrand.h'),
                          join('src', 'liblinear', 'liblinear_helper.c')]
 
     config.add_extension('_liblinear',
                          sources=liblinear_sources,
-                         libraries=libraries,
+                         libraries=['liblinear-skl'] + libraries,
                          include_dirs=[join('.', 'src', 'liblinear'),
+                                       join('.', 'src', 'newrand'),
                                        join('..', 'utils'),
                                        numpy.get_include()],
                          depends=liblinear_depends,
@@ -64,8 +83,10 @@ def configuration(parent_package='', top_path=None):
     config.add_extension('_libsvm_sparse', libraries=['libsvm-skl'],
                          sources=libsvm_sparse_sources,
                          include_dirs=[numpy.get_include(),
-                                       join("src", "libsvm")],
+                                       join("src", "libsvm"),
+                                       join("src", "newrand")],
                          depends=[join("src", "libsvm", "svm.h"),
+                                  join('src', 'newrand', 'newrand.h'),
                                   join("src", "libsvm",
                                        "libsvm_sparse_helper.c")])
 
diff --git a/sklearn/svm/src/liblinear/liblinear_helper.c b/sklearn/svm/src/liblinear/liblinear_helper.c
@@ -182,7 +182,7 @@ struct parameter *set_parameter(int solver_type, double eps, double C,
     if (param == NULL)
         return NULL;
 
-    srand(seed);
+    set_seed(seed);
     param->solver_type = solver_type;
     param->eps = eps;
     param->C = C;
diff --git a/sklearn/svm/src/liblinear/linear.cpp b/sklearn/svm/src/liblinear/linear.cpp
@@ -1,8 +1,8 @@
-/* 
+/*
    Modified 2011:
 
    - Make labels sorted in group_classes, Dan Yamins.
-   
+
    Modified 2012:
 
    - Changes roles of +1 and -1 to match scikit API, Andreas Mueller
@@ -22,6 +22,13 @@
    Modified 2015:
    - Patched liblinear for sample_weights - Manoj Kumar
      See https://github.com/scikit-learn/scikit-learn/pull/5274
+
+   Modified 2020:
+   - Improved random number generator by using a mersenne twister + tweaked
+     lemire postprocessor. This fixed a convergence issue on windows targets.
+     Sylvain Marie
+     See <https://github.com/scikit-learn/scikit-learn/pull/13511#issuecomment-481729756>
+
  */
 
 #include <math.h>
@@ -32,6 +39,10 @@
 #include <locale.h>
 #include "linear.h"
 #include "tron.h"
+#include <climits>
+#include <random>
+#include "../newrand/newrand.h"
+
 typedef signed char schar;
 template <class T> static inline void swap(T& x, T& y) { T t=x; x=y; y=t; }
 #ifndef min
@@ -456,19 +467,19 @@ void l2r_l2_svr_fun::grad(double *w, double *g)
 		g[i] = w[i] + 2*g[i];
 }
 
-// A coordinate descent algorithm for 
+// A coordinate descent algorithm for
 // multi-class support vector machines by Crammer and Singer
 //
 //  min_{\alpha}  0.5 \sum_m ||w_m(\alpha)||^2 + \sum_i \sum_m e^m_i alpha^m_i
 //    s.t.     \alpha^m_i <= C^m_i \forall m,i , \sum_m \alpha^m_i=0 \forall i
-// 
+//
 //  where e^m_i = 0 if y_i  = m,
 //        e^m_i = 1 if y_i != m,
-//  C^m_i = C if m  = y_i, 
-//  C^m_i = 0 if m != y_i, 
-//  and w_m(\alpha) = \sum_i \alpha^m_i x_i 
+//  C^m_i = C if m  = y_i,
+//  C^m_i = 0 if m != y_i,
+//  and w_m(\alpha) = \sum_i \alpha^m_i x_i
 //
-// Given: 
+// Given:
 // x, y, C
 // eps is the stopping tolerance
 //
@@ -579,7 +590,7 @@ int Solver_MCSVM_CS::Solve(double *w)
 	double eps_shrink = max(10.0*eps, 1.0); // stopping tolerance for shrinking
 	bool start_from_all = true;
 
-	// Initial alpha can be set here. Note that 
+	// Initial alpha can be set here. Note that
 	// sum_m alpha[i*nr_class+m] = 0, for all i=1,...,l-1
 	// alpha[i*nr_class+m] <= C[GETI(i)] if prob->y[i] == m
 	// alpha[i*nr_class+m] <= 0 if prob->y[i] != m
@@ -615,7 +626,7 @@ int Solver_MCSVM_CS::Solve(double *w)
 		double stopping = -INF;
 		for(i=0;i<active_size;i++)
 		{
-			int j = i+rand()%(active_size-i);
+			int j = i+bounded_rand_int(active_size-i);
 			swap(index[i], index[j]);
 		}
 		for(s=0;s<active_size;s++)
@@ -775,14 +786,14 @@ int Solver_MCSVM_CS::Solve(double *w)
 	return iter;
 }
 
-// A coordinate descent algorithm for 
+// A coordinate descent algorithm for
 // L1-loss and L2-loss SVM dual problems
 //
 //  min_\alpha  0.5(\alpha^T (Q + D)\alpha) - e^T \alpha,
 //    s.t.      0 <= \alpha_i <= upper_bound_i,
-// 
+//
 //  where Qij = yi yj xi^T xj and
-//  D is a diagonal matrix 
+//  D is a diagonal matrix
 //
 // In L1-SVM case:
 // 		upper_bound_i = Cp if y_i = 1
@@ -793,12 +804,12 @@ int Solver_MCSVM_CS::Solve(double *w)
 // 		D_ii = 1/(2*Cp)	if y_i = 1
 // 		D_ii = 1/(2*Cn)	if y_i = -1
 //
-// Given: 
+// Given:
 // x, y, Cp, Cn
 // eps is the stopping tolerance
 //
 // solution will be put in w
-// 
+//
 // See Algorithm 3 of Hsieh et al., ICML 2008
 
 #undef GETI
@@ -888,7 +899,7 @@ static int solve_l2r_l1l2_svc(
 
 		for (i=0; i<active_size; i++)
 		{
-			int j = i+rand()%(active_size-i);
+			int j = i+bounded_rand_int(active_size-i);
 			swap(index[i], index[j]);
 		}
 
@@ -1009,14 +1020,14 @@ static int solve_l2r_l1l2_svc(
 }
 
 
-// A coordinate descent algorithm for 
+// A coordinate descent algorithm for
 // L1-loss and L2-loss epsilon-SVR dual problem
 //
 //  min_\beta  0.5\beta^T (Q + diag(lambda)) \beta - p \sum_{i=1}^l|\beta_i| + \sum_{i=1}^l yi\beta_i,
 //    s.t.      -upper_bound_i <= \beta_i <= upper_bound_i,
-// 
+//
 //  where Qij = xi^T xj and
-//  D is a diagonal matrix 
+//  D is a diagonal matrix
 //
 // In L1-SVM case:
 // 		upper_bound_i = C
@@ -1025,13 +1036,13 @@ static int solve_l2r_l1l2_svc(
 // 		upper_bound_i = INF
 // 		lambda_i = 1/(2*C)
 //
-// Given: 
+// Given:
 // x, y, p, C
 // eps is the stopping tolerance
 //
 // solution will be put in w
 //
-// See Algorithm 4 of Ho and Lin, 2012   
+// See Algorithm 4 of Ho and Lin, 2012
 
 #undef GETI
 #define GETI(i) (i)
@@ -1107,7 +1118,7 @@ static int solve_l2r_l1l2_svr(
 
 		for(i=0; i<active_size; i++)
 		{
-			int j = i+rand()%(active_size-i);
+			int j = i+bounded_rand_int(active_size-i);
 			swap(index[i], index[j]);
 		}
 
@@ -1253,17 +1264,17 @@ static int solve_l2r_l1l2_svr(
 }
 
 
-// A coordinate descent algorithm for 
+// A coordinate descent algorithm for
 // the dual of L2-regularized logistic regression problems
 //
 //  min_\alpha  0.5(\alpha^T Q \alpha) + \sum \alpha_i log (\alpha_i) + (upper_bound_i - \alpha_i) log (upper_bound_i - \alpha_i),
 //    s.t.      0 <= \alpha_i <= upper_bound_i,
-// 
-//  where Qij = yi yj xi^T xj and 
+//
+//  where Qij = yi yj xi^T xj and
 //  upper_bound_i = Cp if y_i = 1
 //  upper_bound_i = Cn if y_i = -1
 //
-// Given: 
+// Given:
 // x, y, Cp, Cn
 // eps is the stopping tolerance
 //
@@ -1333,7 +1344,7 @@ int solve_l2r_lr_dual(const problem *prob, double *w, double eps, double Cp, dou
 	{
 		for (i=0; i<l; i++)
 		{
-			int j = i+rand()%(l-i);
+			int j = i+bounded_rand_int(l-i);
 			swap(index[i], index[j]);
 		}
 		int newton_iter = 0;
@@ -1521,7 +1532,7 @@ static int solve_l1r_l2_svc(
 
 		for(j=0; j<active_size; j++)
 		{
-			int i = j+rand()%(active_size-j);
+			int i = j+bounded_rand_int(active_size-j);
 			swap(index[i], index[j]);
 		}
 
@@ -1903,7 +1914,7 @@ static int solve_l1r_lr(
 
 			for(j=0; j<QP_active_size; j++)
 			{
-				int i = j+rand()%(QP_active_size-j);
+				int i = j+bounded_rand_int(QP_active_size-j);
 				swap(index[i], index[j]);
 			}
 
@@ -2234,14 +2245,14 @@ static void group_classes(const problem *prob, int *nr_class_ret, int **label_re
                 label[i+1] = this_label;
                 count[i+1] = this_count;
         }
-        
+
         for (i=0; i <l; i++)
         {
                 j = 0;
                 int this_label = (int)prob->y[i];
                 while(this_label != label[j])
                 {
-                        j++;      
+                        j++;
                 }
                 data_label[i] = j;
 
@@ -2594,7 +2605,7 @@ void cross_validation(const problem *prob, const parameter *param, int nr_fold,
 	for(i=0;i<l;i++) perm[i]=i;
 	for(i=0;i<l;i++)
 	{
-		int j = i+rand()%(l-i);
+		int j = i+bounded_rand_int(l-i);
 		swap(perm[i],perm[j]);
 	}
 	for(i=0;i<=nr_fold;i++)
@@ -3057,4 +3068,3 @@ void set_print_string_function(void (*print_func)(const char*))
 	else
 		liblinear_print_string = print_func;
 }
-
diff --git a/sklearn/svm/src/liblinear/linear.h b/sklearn/svm/src/liblinear/linear.h
@@ -49,6 +49,8 @@ struct model
 	int *n_iter;    /* no. of iterations of each class */
 };
 
+void set_seed(unsigned seed);
+
 struct model* train(const struct problem *prob, const struct parameter *param, BlasFunctions *blas_functions);
 void cross_validation(const struct problem *prob, const struct parameter *param, int nr_fold, double *target);
 
diff --git a/sklearn/svm/src/libsvm/LIBSVM_CHANGES b/sklearn/svm/src/libsvm/LIBSVM_CHANGES
diff --git a/sklearn/svm/src/libsvm/svm.cpp b/sklearn/svm/src/libsvm/svm.cpp
diff --git a/sklearn/svm/src/newrand/newrand.h b/sklearn/svm/src/newrand/newrand.h