Google Summer of Code 2012 Organization Shogun Machine Learning Toolbox

Shogun Machine Learning Toolbox

Web Page: http://www.shogun-toolbox.org/gsoc-ideas.html

Mailing List: mailto:shogun-list@shogun-toolbox.org

SHOGUN is a machine learning toolbox, which is designed for unified large-scale learning for a broad range of feature types and learning settings. It offers a considerable number of machine learning models such as support vector machines for classification and regression, hidden Markov models, multiple kernel learning, linear discriminant analysis, linear programming machines, and perceptrons. Most of the specific algorithms are able to deal with several different data classes, including dense and sparse vectors and sequences using floating point or discrete data types. We have used this toolbox in several applications from computational biology, some of them coming with no less than 10 million training examples and others with 7 billion test examples. With more than a thousand installations worldwide, SHOGUN is already widely adopted in the machine learning community and beyond.

SHOGUN is implemented in C++ and interfaces to all important languages like MATLAB, R, Octave, Python, Lua, Java, C#, Ruby and has a stand-alone command line interface. The source code is freely available under the GNU General Public License, Version 3 at http://www.shogun-toolbox.org.

During Summer of Code 2012 we are looking to extend the library in three different ways:

Improving accessibility to shogun by developing improving i/o support (more file formats) and mloss.org/mldata.org integration.
Framework improvements (frameworks for regression, multiclass, structured output problems, quadratic progamming solvers).
Integration of existing and new machine algorithms.

Here is listed a set of suggestions for projects.

Please use the scheme shown below for your student application. If you have any questions, ask on the mailing list (shogun-list@shogun-toolbox.org, please note that you have to be subscribed in order to post).

Projects

Build Generic Structured Output Learning Framework The aim is to implement tools for structured output (SO) problems. The data in these problems have complex structure (e.g. graphs, sequences) and the traditional learning algorithms fail to find solutions efficiently. Structured output support vector machines and conditional random fields are methods for SO learning. They will be implemented to form Shogun's first module for SO learning. Finally, these methods will be applied to hidden Markov models-type of problems such as gene prediction.
Built generic multiclass learning framework I'm a student with strong machine learning and open source programming experiences. I'm applying for a project that will implement a generic multiclass learning framework in shogun. While shogun is the state-of-the-art toolbox for binary classifiers, more multiclass methods need to be added to make it competitive in this area. Many real-world problems are naturally multiclass. So adding strong multiclass support for shogun would benefits a large community.
Bundle method solver for structured output learning Learning of the structured output classifiers leads to solving a convex minimization problem which is not tractable by standard algorithms. A significant effort in ML community has been put to development of specialized solvers among which the Bundle Method for Risk Minimization (BMRM), implemented e.g. in popular StructSVM, is the current the state-of-the-art. The BMRM is a simplified variant of bundle methods which are standard tools for non-smooth optimization. The simplicity of the BMRM is compensated by its reduced efficiency. Experiments show that a careful implementation of the classical bundle methods perform significantly faster (speedup ~ 5-10) than their variants (like BMRM) adopted by ML community. The goal will be an OS library implementing the classical bundle method for the SO learning and its integration to Shogun.
Implement multitask and domain adaptation algorithms Multitask learning is a modern approach to machine learning that learns a problem together with other related problems at the same time, using a shared representation. This approach often leads to a better model for the main task, because it allows the learner to use the commonality among the tasks. The proposed project is about implementing various multitask learning algorithms for the Shogun toolbox.
Implementation of latent SVM Implementation of a general purpose latent SVM.
Implementing Gaussian Process Regession in Shogun This project focuses on implementing Gaussian Process Regression with hyperparameter learning in Shogun. The goal is to make the implementation easily extendable and able to handle large datasets through sparse approximation.
Kernel based two-sample and independence test Statistical tests for dependence or difference are an important tool in data-analysis. However, when data is high-dimensional or in non-numerical form (strings, graphs), classical methods fail. This project implements recently developed kernel-based generalizations of statistical tests, which overcome this issue. The kernel-two-sample test based on the Maximum-Mean-Discrepancy (MMD) tests whether two sets of samples are from the same or from different distributions. Related to the kernel-two-sample test is the Hilbert-Schmidt-Independence criterion (HSIC), which tests for statistical dependence between two sets of samples. Multiple tests based on the MMD and the HSIC are implemented along with a general framework for statistical tests in SHOGUN.
Various usability improvements Shogun is a fairly large project, it requires not only the machine learning algorithms. Maintenance, improvement of the individual parts, integrating them into the interfaces are a important tasks. The proposed project is about various usability improvements for the Shogun toolbox.