-
Notifications
You must be signed in to change notification settings - Fork 0
Coordinated descent in linear models project discussion, summer 2012
binary:
name | size | N (train/test) | p | #nz (train) | used in | format |
---|---|---|---|---|---|---|
Leukemia | 1.9M | 72 | 3571 | dense | 1 | RData |
Newsgroup | 9.4M | 11,314 | 777,811 | 0.05% | 1 | RData |
Internet-Ad | 49K | 2359 | 1430 | 1.2% | 1 | RData |
a9a | 2.3M/1.1M | 32,561 / 16,281 | 123 | 451,592 (11%) | 2 | libsvm |
real-sim | 33.6M | 72,309 | 20,958 | 3,709,083 (0.2%) | 2 | libsvm |
rcv1 | 13.1M/432M | 20,242 / 677,399 | 47,236 | 49,556,258 (0.15%) | 2 | libsvm |
multiclass:
name | size | #class | N(train/test) | p | #nz | used in | format |
---|---|---|---|---|---|---|---|
news20 | 3.6M/0.9M | 20 | 15,935 /3,993 | 1,355,191 | 9,097,916 (0,03%) | 2 | libsvm |
Cancer | 22M | 14 | 144 | 16,063 | dense | 1 | RData |
regression:
name | size | N | p | #nz | used in | format |
---|---|---|---|---|---|---|
Prostate Cancer Data | 97 | 9 | dense | 3 | RData |
1 Friedman, J., T. Hastie, and R. Tibshirani. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33, no. 1 (2010): 1.
p.20 download data
2 Tibshirani, R., J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R.J. Tibshirani. “Strong Rules for Discarding Predictors in Lasso-type Problems.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) (2011).
p.3214 download data
3 Tibshirani, R. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological) (1996): 267–288.
Questions:
- Which data sets should be used?
- Format to store them?
- Other regression type data sets?
agramfort : I would as much as possible use mldata and the mldata loader we ship with sklearn (fetch_mldata)
- l2 loss*
- log loss*
- multi-logit*
with l1 and l1 & l2 penalty
Questions:
- Settings to use in benchmarking (penalty value etc. ) ?
agramfort : I would start with 2 extreme cases (high lambda or low lambda) for each n_samples >> n_features or n_samples << n_features
- glmnet
- glmnet-python ( version? )
- R glmnet package + rpy2 (latest version)
- liblinear
- liblinear + python interface (latest version)
Questions:
- How to time execution to achieve a fair comparison ?
- Which glmnet interface should be used?
agramfort : I would start with rpy2 even you pay the price of a copy which is not really fair.
scikit-learn implementations speed development tracking
tracking of execution times for the scikit-learn implementations of (2) on data sets (1) over time
vbench
Questions:
- Already some example code available for scikit-learn ?