A list of topics for a google summer of code (gsoc) 2012

A list of topics for a Google summer of code (GSOC) 2012

Important: Expectations for prospective students

Improvements and extensions to the Decision Tree Implementation

Possible Mentor: Andreas Mueller?

Possible Candidate: Vikram Kamath

Goal: The C5.0 is an algorithm used to construct m-ary decision trees. It is a successor to the C4.5 algorithm (which in turn is an extension of the ID3 algorithm), all of which were developed by Ross Quinlan. The C5.0 source (implemented in C) has been released under the GNU General Public License (GPL). The aim is to port it and hence make it a feature of sklearn. Additionally, documentation/examples can be created (I have learned from my interaction with Ross Quinlan that the documentation of the C5.0 has not been released under the GPL and is in fact, proprietary).

References:

1. http://www.rulequest.com/see5-info.html
1. http://www.rulequest.com/see5-unix.html
1. http://rulequest.com/see5-comparison.html.
1. http://www2.cs.uregina.ca/~dbd/cs831/notes/ml/dtrees/c4.5/tutorial.html
1. ai.stanford.edu/~ronnyk/treesHB.pdf

Online Low Rank Matrix Completion

Possible mentor: Olivier Grisel

Possible candidate: Vlad Niculae, ?

Goal: Online or Minibatch SGD or similar on a squared l2 reconstruction loss + low rank penalty (nuclear norm) on scipy.sparse matrix: the implicit components of the sparse input representation would be interpreted by the algorithms as missing values rather than zero values.

Application: Build a scalable recommender system example, e.g. on the movielens dataset.

TODO: find references in the literature.

Online Non Negative Matrix Factorization

Possible candidate: Vlad Niculae, Immanuel Bayer, ?

Goal: Online or Minibatch NMF using SGD + positive projections (or any other out-of-core algorithms) accepting both dense and sparse matrix as input (decomposition components can be dense array only).

Application: Build a scalable topic model e.g. on million of Wikipedia abstracts for instance using this script.

References:

http://research.microsoft.com/apps/pubs/default.aspx?id=143211

Robust PCA

Algorithms for decomposing a design matrix into a low rank + sparse components.

Possible mentor: ?

Possible candidate: Kerui Min (Minibio: "I'm a graduate student at UIUC who is currently pursuing the research work related to low-rank matrices recovery & Robust PCA.")

Applications: ?

References:

http://perception.csl.uiuc.edu/matrix-rank/home.html
http://www.icml-2011.org/papers/41_icmlpaper.pdf (randomized algorithm supposedly scalable to larg-ish datasets)

Multilayer Perceptron / Neural Network

Possible mentor: Andreas Mueller

Possible candidate: David Marek

Goal: Implement a stochastic gradient descent algorithm to learn a multi-layer perceptron, starting from https://gist.github.com/2061456.

References:

SVM with low rank kernel approximation

Possible mentor: Andreas Mueller

Goal: Implement a stochastic gradient descent SVM using a low-rank kernel approximation.

References:

http://pages.cs.wisc.edu/~swright/papers/sncss_tpami.pdf

Generalized Additive Models

Possible mentor: Paolo Losi, Alex Gramfort, (others?)

Goal: Implement one of the state of art methods for Generalized Additive Models Sparse Version of it is SpAM

References:

Coordinated descent in linear models beyond squared loss (eg Logistic)

Possible mentors: Alex Gramfort, Gael Varoquaux

Goal: Implement state of art methods for optimizing sparse linear models using coordinate descent.

One objective to avoid the dependency on LibLinear for the LogisticRegression model in order to allow warm restart and Elastic-Net regularization (L1 + L2)

A second objective is to improve the Lasso coordinate descent using strong rules to automatically discard features.

References:

Improve GMM

Possible mentors: Gael Varoquaux

Refurbish the current GMM code to put it to the scikit's standards
Implement a core-set strategy for GMM

http://las.ethz.ch/files/feldman11scalable-long.pdf http://videolectures.net/nips2011_faulkner_coresets/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A list of topics for a google summer of code (gsoc) 2012

A list of topics for a Google summer of code (GSOC) 2012

Improvements and extensions to the Decision Tree Implementation

Online Low Rank Matrix Completion

Online Non Negative Matrix Factorization

Robust PCA

Multilayer Perceptron / Neural Network

SVM with low rank kernel approximation

Generalized Additive Models

Coordinated descent in linear models beyond squared loss (eg Logistic)

Improve GMM

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally