GSoC/GCI Archive
Google Summer of Code 2009 The Apache Software Foundation

[Mahout] Distributed Latent Dirichlet Allocation

by David Hall for The Apache Software Foundation

Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning algorithm for automatically and jointly clustering words into "opics and documents into mixtures of topics, and it has been successfully applied to model change in scientific fields over time (Griffiths and Steyver, 2004; Hall, et al. 2008). In this project, I propose to implement distributed LDA using MapReduce, and to investigate extensions of LDA and possibly more efficient algorithms for distributed inference.