GSoC/GCI Archive
Google Summer of Code 2011

Shogun Machine Learning Toolbox (Technical University Berlin / Max Planck Campus Tübingen)

Web Page: http://www.shogun-toolbox.org/gsoc-ideas.html

Mailing List: http://www.shogun-toolbox.org/

SHOGUN is a machine learning toolbox, which is designed for unified large-scale learning for a broad range of feature types and learning settings. It offers a considerable number of machine learning models such as support vector machines for classification and regression, hidden Markov models, multiple kernel learning, linear discriminant analysis, linear programming machines, and perceptrons. Most of the specific algorithms are able to deal with several different data classes, including dense and sparse vectors and sequences using floating point or discrete data types. We have used this toolbox in several applications from computational biology, some of them coming with no less than 10 million training examples and others with 7 billion test examples. With more than a thousand installations worldwide, SHOGUN is already widely adopted in the machine learning community and beyond.  

SHOGUN is implemented in C++ and interfaces to MATLAB, R, Octave, Python, and has a stand-alone command line interface. The source code is freely available under the GNU General Public License, Version 3 at http://www.shogun-toolbox.org.

This summer we are looking to extend the library in four different ways: Improving interfaces to other machine learning libraries or integrating them when appropriate, improved i/o support, framework improvements and new machine algorithms. Here is listed a set of suggestions for projects.

Please use the scheme shown below for your student application. If you have any questions, ask on the mailing list (shogun-list@shogun-toolbox.org, please note that you have to be subscribed in order to post).

Now google summer of code is over with a big success for shogun: All students completed their projects successfully and their source code has been merged into core shogun. In addition to shogun's github page https://github.com/shogun-toolbox/shogun/ , the students code contributions can be found in  http://code.google.com/p/google-summer-of-code-2011-shogun-machine-learning-toolbox/ .

Projects

  • Built a flexible cross-validation framework into shogun Nearly every learning machine has parameters which have to be determined manually. Shogun currently lacks a model selection framework. Therefore, the goal of this project is to extend shogun to make cross-validation possible. Different strategies, how training data is split up should be available and easy to exchange. Various model selection schemes are integrated (train,validation,test split, n-fold cross validation, leave one out cross validation, etc)
  • Framework for Online Features and Vowpal Wabbit Integration Introduce support for 'streaming' features into Shogun through a framework. Then integrate vowpal wabbit, which is a very fast large scale online learning algorithm based on SGD. This framework will enable addition of further online learning algorithms into Shogun. The efficient implementation of this in VW will be borrowed and integrated into Shogun.
  • Implement missing dimensionality reduction algorithms Dimensionality reduction is the process of finding a suitable low-dimensional dataset from high-dimensional one by reducing its dimensionality. One of the most important practical issues of applied machine learning, it is widely used for preprocessing real data. Its importance is deriving from challenges presented by high-dimensional spaces, performance and other issues. Being large-scale, SHOGUN is missing preprocessors for dim. reduction and this project aims for implementation some of them.
  • Interface shogun to new languages. This project adds support for new languages on shogun, including Java, Ruby, and Lua. This makes shogun toolbox could use high-performance numerical library of these languages, which could make shogun more widely used and more powerful.
  • ML estimation of parameters of Gaussian Mixture Models with carefully implemented EM algorithm. The Expectation-Maximization algorithm is well known in the machine learning community. The goal of the project will be a robust implementation of the Expectation-Maximization algorithm for Gaussian Mixture Models within the Shogun Machine Learning Toolbox. Computational tricks and techniques will be used to overcome computational problems inherent within the algorithm.