Distribuited mailbox indexing over HBase/HDFS

Mihai Soloi

Short description: Currently, James mailbox supports email indexing over Lucene, the directory implementation of the Lucene search and indexing relies on relational databases, or file-system storing. As the number of indexes increases with the number of clients using the mailbox so does the performance of the indexing degrade, thus an implementation over a noSQL database like HBase would solve this problem by distributing the indexes and documents on a system designed for high amounts of data.

Additional info: https://issues.apache.org/jira/browse/MAILBOX-173

Name:
Mihai Soloi

Contact/Email:
Email: mihai.soloi@gmail.com
Google Talk: mihai.soloi xmpp:gmail.com
IRC: s0l0@irc.freenode.net

Background:
I'm a computer science student in my first year of M.Sc. at the Polytechnic University in Bucharest studying Parallel and Distributed Computer Systems. My coding skills are composed mainly of the Java language family including the enterprise side of it. My research project, BaTS(Bag-of-Tasks Scheduling under Budget Constraints) is part of the Contrail project[1], and is conducted under the supervision of dr. Ana Oprescu and dr. Florin Pop and is based on the cloud environment, i am working on a bag of task price optimization mechanism that allows users to compare cloud providers and automatically choose which instance is most suitable for their application in particular, and optimizing this over the make-span of the application,i.e. if another instance of a server is better suited for the task at hand while the operations are running, weighing all the options it will choose the best one.

I've also worked on the open-source Torque project[2], implementing a new algorithm[3] along with my colleagues on top of the resource manager PBS, with the purpose to improve the service quality of the scheduling in the Grid environment, we didn't use an svn but code can be provided, it is yet to be incorporated into Torque.

Project details:
The Apache James project is a set of libraries written in java that provides users upon build with an enterprise mail server. James Mailbox is a library integrated in James via the mailbox api and corresponding adapter, and allows the use of flexible mailbox storage via known protocols such as IMAP, POP, SMTP, Microsoft Exchange and others. It is used primarily for the mail persistence into databases, for that it uses SQL databases via Apache openJPA, JCR(Java Content Repository based on Apache Jackrabbit) and since last summer of code, Apache HBase. It can also be used as independent mailbox storage for implementation into your own application.

Currently, James mailbox currently supports email indexing over Lucene. A Lucene Directory implementation that stores the index and allows searching has to be provided for the actual email indexing. Lucene being a powerful text search engine library also comes with a JDBC based file system directory to store the indexes in a database as well as other file systems implementations in order to store it directly or even memory based[4]. The problem with this directory implementation is that it degrades overall performance due to the ever increasing number of indexes, the problem had to be handled by splitting indexes between Lucene instances but this approach complicates the architecture and expects previous knowledge on the indexed emails. An implementation on a noSQL database would solve this issue by allowing the index to be distributed over all the James mailbox directories available, thus by using the Apache Hbase we could corectly implement the mailbox directory[5] to support distributed indexing.

As cited before[5] a Lucene implementation based on HBase has already been formulated by Boris Lublinksy and Mike Segel. They propose the following approach of porting a limited memory index support from Lucene contributions module: Lucandra and Hbasene which overwrote classes higher up in Lucene's class hierarchy i.e.:

  1. IndexReader - reads indexes from the index searcher and also from HBase/HDFS
  2. IndexWriter - writes indexes to the noSQL database.


The implementation proposed has two benefits, one being that of a performance boost by using an in-memory cache to limit the amount of reads and writes to the noSQL database and the other of scalability increase thus being able to run multiple Lucene instances as required by the growing number of mailboxes/clients. But both these benefits are conflicting due to the fact that scalability supposes that we should have only one version of the content all around, thus the implementation should have a time to live for the cache on each Lucene instance and when it is reached a full cache reset will be done.

Right now the LuceneMessageSearchIndex java class handles all the writing and reading to and respectively to the directory but with the proposed implementation we will be able to decouple these actions from and we'll be able to store directly into in-memory cache and also in the distributed noSQL database.

The data model for the in-memory cache is very important in this implementation due to the fact that the reads are done through the cache, and if the data is not in the cache or is very old the memory model will have to query the noSQL instance and update the requested index information, on the other hand the writes will be done directly into the database which will cause a delay in index information, but the cache having an appropriate time to live this delay should be greatly reduced.

Lucene basically has two data set models, i.e. Indexes and Documents. For the in-memory cache to be feasible these should be implemented as memory model thus having two:

  1.     IndexMemoryModel: LuceneIndexMemoryModel, FieldTermDocuments, TermDocuments, TermDocument
  2.     DocumentMemoryModel: LuceneDocumentMemoryModel, DocumentTermFrequency, DocumentStructure, FieldData

each in turn consisting of 4 classes which are used by the main LuceneMemoryModel(Index and Document).


The LuceneDocumentNormMemoryModel is proposed as optional for better search results ranking but using high amounts of memory, because the implementation relies heavily on using maps of maps. Due to the fact that norm information is keyed by field name it can be appended to LuceneIndexMemoryModel class inside the IndexMemoryModel.

The IndexWriter and IndexReader have to be overridden and basically rewritten implementing all the methods from the standard reader and writer.

The implementation of HBase data tables should be based on three tables, one of indexes, one of documents and the optional one for norms. The first table has a row for each field/term combination known to a Lucene instance and contains one column family i.e. the document, the documents table stores the documents and back references to indexes/norms having a row for each document, last but not least the optional table for norms contains as keys the fields(document id) and as column family the norms.

Benefits to James:
James would have the first and only, at the moment, distributed mailbox implementation. James would also be having a faster search over the distributed index due to in-memory caching and also a single mailbox will be able to support multiple James instances searching over it.

Deliverables:
A distributed Lucene directory implementation, along with a minimum of 3 HBase tables, 11 classes for the architecture implementation, and most probably Apache AVRO integration for the data formats.

Project schedule/Roadmap:
The below schedule is what will be ready for the mid-term evaluation will be the preliminary implementation of the IndexMemoryModel and the DocumentMemoryModel along with the overriden INdexWriter and IndexReader as well as preliminary tables for the Hbase data structure.

Below are the steps i am planning to take on this project:

Community Bonding Period (April 24th -> May 20th):
During this period, i will be involved in fixing bugs and writing patches for the mailbox project as well as reading and learning more details on HBase and Lucene, i plan on reading "Lucene in Action" and "Hbase in Action" in order to know how to use the APIs, as well as running all the tests individually to understand how the current Lucene indexing and searching works and the Hbase mailbox integration. This will give me the chance to become more familiar with the intricacies of the project.

Project start until mid-term evaluation (May 21st -> July 9th):
Beginning the development of the MemoryModel architecture as well as implementing the HBase storage. Will be able to use the tests already in the project in order to do my work in a test driven development manner. I will first start with trying to index a document and make sure that all the tests pass, after that i will start working on redoing the default directory implementation, IndexReader and IndexWriter to support Hbase, as well as creating the needed tables in the noSQL database.

Mid-term until project end (July 10th -> August 13th):
Implement the final classes and data model as well as the optional LuceneDocumentNormMemoryModel, write final tests.

Progress Reports:
The project progress and problems will be continuously reported on my twitter account #mihaisoloi. I am currently studying the James mailbox architecture, and the way it's implemented. Have followed tutorials on HDFS [6] also read the documentation on the architecture guide[7], have read the presentation on Lucene basics[8], and followed the jira issues that led to the current implementation of Lucene searching, have also created Hadoop clusters using Apache WHIRR[9]

Exams and other commitments:
I will probably have a partial exam during the month of may, but i do not know when exactly, as well as the finals in June. I am currently employed as a Java enterprise developer as i have to currently pay my living expenses but am planning to quit(or take a summer leave) the job and work full-time on the GSoC project(approx. 35-40 hours/week), i have raised enough money in preparation for this and living expenses will not be an issue. I do not have any summer plans, but i may be away during a weekend as i enjoy hiking, and the mountains are close by.

Mentorship and why James:
Some months ago I met Ioan Eugen Stan on a local open-source foundation[10] mailing list and emailed him, and consequently met, he was very encouraging in taking up an open-source project and guided me through James code, and explaining development details. He has inspired me in becoming his mentee on James, and consider him a friend.

James mailbox is also attractive from the point of view of the technologies used and the great opportunity this has for me to have a way to practice new skills in the domain of distributed open-source development.

[1] http://contrail-project.eu/
[2] http://www.adaptivecomputing.com/products/open-source/torque/
[3] http://florinpop.ro/map/index.php/Torque_Extension
[4] http://wiki.apache.org/lucene-java/PoweredBy
[5] http://www.infoq.com/articles/LuceneHbase
[6] http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
[7] http://hadoop.apache.org/common/docs/current/hdfs_design.html
[8] http://www.slideshare.net/nitin_stephens/lucene-basics
[9] http://whirr.apache.org/docs/0.7.1/quick-start-guide.html
[10] http://ceata.org/