Implement estimators of large-scale sparse Gaussian densities

Soumyajit De

Short description: Computing log-likelihood for Gaussian distribution requires computation of log-determinant of the covariance (or precision) matrix. Usual approach, based on Cholesky factorization, often suffers from huge memory requirement for fill-in phenomenon when the covariance matrix is huge and sparse. This project aims for computing the log-determinant in an efficient way, which makes use of a bunch of techniques from numerical linear algebra and complex analysis. The objective of this project is to approximate the matrix-logarithm up to an arbitrary precision and evaluate log-determinant with reduced memory requirement, targeting for speeding up by enabling parallel computation of the components involved.

Me

  • Soumyajit De (Rahul)
  • Pursuing Masters in Computer Science and Engineering (pre-final year) in Indian Institute of Technology, Bombay, India
  • I will be doing my Masters thesis in Machine Learning. I did my bachelor thesis in AI. This is the first time I am applying for GSoC.
  • Github id. - lambday

Coding Skills(1-5)

  • C/C++: 3 (5 years experience, used for most of my personal and bachelors projects, including building a framework for 3D image processing, writing a stochastic local search SAT solver. In Masters, used for machine learning course projects)
  • Python: 2.5 (2 years experience, heavily used for course projects in natural language processing, including HMM-based POS-tagger, automatic sentence completion etc.)
  • Matlab: 2 (1 years experience, mostly for small, personal projects)
  • Java: 2.5 (4 years experience, mostly for research and industrial projects)
  • Eigen3: 2 (used for entrance tasks in shogun)
  • Shogun: 2 (used for one of my course projects and a few entrance tasks)
  • I installed shogun and set up the development platform in a desktop with Intel Core i7-2600K CPU @ 3.40GHz running Fedora 16 x86_64, with 16GB RAM in my lab. I also have access to a server with 128GB of RAM and 32 Intel Xeon CPUs E7520  @ 1.87GHz running Ubuntu 11.04 x86_64
  • I did courses on machine learning, natural language processing and statistical relational learning. So far I have mainly done course projects in machine learning, including a data mining contest in Kaggle (required performing classification using different families of classifiers), implementing a subsequence string kernel (I used shogun for this project), and using it for classification of protein data

Me and shogun

  • Apart from course projects, I got involved with shogun in few entrance tasks related to this project and submitted patches. I also added a few utilities that might be useful for this project later
  • The main target for this project is to make it memory efficient and make it ready for parallel computation for solving a number of linear systems through modularizing the subtasks carefully. I am also planning to aim for an implementation in which further extensions and/or performance tuning would be easy to integrate
  • I did courses in probability and statistics, linear algebra, graph theory in my bachelors and machine learning in masters. I studied selected topics in numerical linear algebra on my own interest last semester and I'll be able to pick up specific concepts/algorithms that are required for this project on the way

My Project

  • I got interested in this project mostly because of the scope of this project. I'm really interested in this area and working on this project would give me a hand on experience in dealing with many different topics and methods on a large scale. I'm hoping to learn a lot during the development phase and plan to continue working on this topic in future on possible extensions
  • I think I am a suitable candidate because I have the overview of the tasks that has to be done in this project. I'm familiar with most of the concepts involved and fairly comfortable with the implementation part. I'm already familiar with shogun, eigen3, git/gitflow, valgrind, gtest and latex
  • I'll be free from my course loads after first week of May, so I am planning to start a bit early. I have to perform few routine tasks as a system administrator in my lab but it will not consume much time (5-10h a week max). I have no major engagement this summer and no courses next semester, so I'll be able to work 40+h/week from mid-May to July. I may have to manage a few other things in August and Sept, but I'll still be able to manage working full-time. After GSoC, I'll be a bit occupied till Decemeber, but from January onward I'll be working on various performance tuning related to this project
Schedule

May 7th - May 15th (preparation)
  • going through the paper [1] in more detail and submit an extended schedule as required
  • discussing about the implementation and alternative ideas with mentors, designing the basic structure
  • studying Eigen3 sparse module intensively
May 16th - May 31st (issue#873: importing graph coloring library for probing vectors)
  • checking out various graph coloring libraries, testing
  • integrating an external graph coloring library to shogun
  • writing unit tests
This will be needed for computing probing vectors using greedy graph coloring in the approximation of matrix log determinant
 
June 1st - June 15th (issue#949: importing elliptic curve functions for computing complex integration shifts and weights)
  • importing/rewriting (as required) related modules from existing implementation (Krylstat) for this
  • writing unit tests
This will be needed for approximating matrix log times vector expression in later stage.
 
June 16th - July 15th (importing conjugate gradient solver for systems in rational approximation of matrix log times vector)
  • reading up related algorithms, discussion with mentors, designing classes
  • integrating eigenvalue computation module from existing implementation
  • integrating the preconditioned CG solver from existing implementation
  • writing unit tests
July 16th - Aug 10th (approximation of the matrix log times vector expression)
  • discussing and designing about efficient implementation for solving bunch of systems (that appears in the expression for rational approximation in [1])
  • writing the approximation of matrix log times vector expression using integration weight and shift coefficients (computation of which will require elliptic curve functions and extremal eigenvalues, as described in [2]) and the CG solver
Aug 11th - Sept 10th (combining with probing vectors evaluated by graph coloring)
  • discussing ideas on incorporating parallelization targetting for running parallely on a cluster
  • writing the log determinant by combining all the tasks
  • write integration tests
Sept 11th - Sept 23rd (finishing)
  • cleaning up the code, fixing bugs, documentation
  • addition of more tests, examples
  • getting the code ready for publication

I'll be blogging about my weekly progress and further plans here.

References

[1] Erlend Aune, Daniel Simpson and Jo Eidsvik, Parameter estimation in high dimensional Gaussian distributions, Preprint in Statistics, No. 5/2012, NTNU, 2012

[2] Hale, N., Higham, N. J., & Trefethen, L. N. (2008). Computing $A^{\alpha}$, $log(A)$, and related matrix functions by contour integrals. SIAM Journal on Numerical Analysis, 46(5), 2505-2523.