Implement estimators of largescale sparse Gaussian densities
Soumyajit De
Short description: Computing loglikelihood for Gaussian distribution requires computation of logdeterminant of the covariance (or precision) matrix. Usual approach, based on Cholesky factorization, often suffers from huge memory requirement for fillin phenomenon when the covariance matrix is huge and sparse. This project aims for computing the logdeterminant in an efficient way, which makes use of a bunch of techniques from numerical linear algebra and complex analysis. The objective of this project is to approximate the matrixlogarithm up to an arbitrary precision and evaluate logdeterminant with reduced memory requirement, targeting for speeding up by enabling parallel computation of the components involved.
Me
 Soumyajit De (Rahul)
 Pursuing Masters in Computer Science and Engineering (prefinal year) in Indian Institute of Technology, Bombay, India
 I will be doing my Masters thesis in Machine Learning. I did my bachelor thesis in AI. This is the first time I am applying for GSoC.
 Github id.  lambday
Coding Skills(15)
 C/C++: 3 (5 years experience, used for most of my personal and bachelors projects, including building a framework for 3D image processing, writing a stochastic local search SAT solver. In Masters, used for machine learning course projects)
 Python: 2.5 (2 years experience, heavily used for course projects in natural language processing, including HMMbased POStagger, automatic sentence completion etc.)
 Matlab: 2 (1 years experience, mostly for small, personal projects)
 Java: 2.5 (4 years experience, mostly for research and industrial projects)
 Eigen3: 2 (used for entrance tasks in shogun)
 Shogun: 2 (used for one of my course projects and a few entrance tasks)
 I installed shogun and set up the development platform in a desktop with Intel Core i72600K CPU @ 3.40GHz running Fedora 16 x86_64, with 16GB RAM in my lab. I also have access to a server with 128GB of RAM and 32 Intel Xeon CPUs E7520 @ 1.87GHz running Ubuntu 11.04 x86_64
 I did courses on machine learning, natural language processing and statistical relational learning. So far I have mainly done course projects in machine learning, including a data mining contest in Kaggle (required performing classification using different families of classifiers), implementing a subsequence string kernel (I used shogun for this project), and using it for classification of protein data
Me and shogun
 Apart from course projects, I got involved with shogun in few entrance tasks related to this project and submitted patches. I also added a few utilities that might be useful for this project later
 The main target for this project is to make it memory efficient and make it ready for parallel computation for solving a number of linear systems through modularizing the subtasks carefully. I am also planning to aim for an implementation in which further extensions and/or performance tuning would be easy to integrate
 I did courses in probability and statistics, linear algebra, graph theory in my bachelors and machine learning in masters. I studied selected topics in numerical linear algebra on my own interest last semester and I'll be able to pick up specific concepts/algorithms that are required for this project on the way
My Project
 I got interested in this project mostly because of the scope of this project. I'm really interested in this area and working on this project would give me a hand on experience in dealing with many different topics and methods on a large scale. I'm hoping to learn a lot during the development phase and plan to continue working on this topic in future on possible extensions
 I think I am a suitable candidate because I have the overview of the tasks that has to be done in this project. I'm familiar with most of the concepts involved and fairly comfortable with the implementation part. I'm already familiar with shogun, eigen3, git/gitflow, valgrind, gtest and latex

I'll be free from my course loads after first week of May, so I am planning to start a bit early. I have to perform few routine tasks as a system administrator in my lab but it will not consume much time (510h a week max). I have no major engagement this summer and no courses next semester, so I'll be able to work 40+h/week from midMay to July. I may have to manage a few other things in August and Sept, but I'll still be able to manage working fulltime. After GSoC, I'll be a bit occupied till Decemeber, but from January onward I'll be working on various performance tuning related to this project
 going through the paper [1] in more detail and submit an extended schedule as required
 discussing about the implementation and alternative ideas with mentors, designing the basic structure
 studying Eigen3 sparse module intensively
 checking out various graph coloring libraries, testing
 integrating an external graph coloring library to shogun
 writing unit tests
 importing/rewriting (as required) related modules from existing implementation (Krylstat) for this
 writing unit tests
 reading up related algorithms, discussion with mentors, designing classes
 integrating eigenvalue computation module from existing implementation
 integrating the preconditioned CG solver from existing implementation
 writing unit tests
 discussing and designing about efficient implementation for solving bunch of systems (that appears in the expression for rational approximation in [1])
 writing the approximation of matrix log times vector expression using integration weight and shift coefficients (computation of which will require elliptic curve functions and extremal eigenvalues, as described in [2]) and the CG solver
 discussing ideas on incorporating parallelization targetting for running parallely on a cluster
 writing the log determinant by combining all the tasks
 write integration tests
 cleaning up the code, fixing bugs, documentation
 addition of more tests, examples
 getting the code ready for publication
I'll be blogging about my weekly progress and further plans here.
References
[1] Erlend Aune, Daniel Simpson and Jo Eidsvik, Parameter estimation in high dimensional Gaussian distributions, Preprint in Statistics, No. 5/2012, NTNU, 2012
[2] Hale, N., Higham, N. J., & Trefethen, L. N. (2008). Computing $A^{\alpha}$, $log(A)$, and related matrix functions by contour integrals. SIAM Journal on Numerical Analysis, 46(5), 25052523.