LUCENE-3892: Add a useful intblock postings format
Han Jiang
Short description: I aim to improve codec performance by integrating PForDelta with lucene 4.0. The proposed work includes implementation on PForDelta postings format, modifications on branch codec, and optimization. I'll also compare the performance between PForDelta postings format and current approaches.
Additional info: https://issues.apache.org/jira/browse/LUCENE-3892
Project title: LUCENE-3892: Add a useful intblock postings format
Name: Han Jiang
Email: jianghan08@gmail.com
Phone number: +8615120090338
Organization/Project: The Apache Software Foundation (ASF)
Description of project goals
I.Background
A. PForDelta algorithm
Currently, two codecs are supported in lucene 4.0 for productive use: 1) Standard Codec, i.e. Lucene40Codec, which behaves quite similarly to previous 3.x versions; 2) Pulsing Codec, which instead stores low-frequency terms directly in term dictionary, in order to reduce extra lookups in posting files[1]. However, both approaches encode postings with variable byte format, which relatively costs more CPU cycles during the decode time, from the perspective of current computer architecture.
PForDelta [3] (Patched frame-of-reference, Delta version) is a block-based compressing algorithm, designed for super-scalar capabilities of modern CPUs. Compared to most state-of-art algorithms like S9,S16,VB,Rice, it shows better decompressing performance and fairly good compression ratio [2, 4]. Previous experiments on JVM also showed promising results [5, 6]. Generally speaking, the performance of PForDelta seems to benefit from three aspects: 1) removing control hazards in loops during the decompressing phase; 2) avoiding data hazards in code section by compressing values into codes without data dependency; 3) keeping fine-grained access speed by encoding values into fixed-length codes.
B. Flexible Indexing in lucene 4.0
Following the introduction in [7] and the codes from trunk, we can generally figure out how Codecs API integrate with high-level IW and IR interface.
During the indexing time, IndexWriter will finally get an instance of FreqProxTermsWriterPerField to iterate each termid, and get corresponding PostingConsumer to save information of each document. During query time, IndexSearcher will get an instance of *Weight to handle the iteration on *Enum interface. Hence, by implementing the two interfaces: FieldsConsumer and FieldsProducer, and their corresponding APIs, we'll make Codec handle flexible postings formats.
II.Project Goals
A. Implementation of PForDelta postings format
The PForDelta postings format will be implemented by subclassing intblock classes, following the patterns in oal.index.codecs.pfordelta.PatchedFrameOfRefCodec.java in bulk-branch, and oal.codecs.MockFixedIntBlockPostingsFormat.java in trunk. In fact, the two interfaces: FieldsConsumer and FieldsProducer have already been implemented with the help of BlockTreeTermsWriter and BlockTreeTermsReader. Therefore, our task is to define a new IntStreamFactory, pass it to a PostingsWriter/ReaderBase, and get BlockTreeTermsWriter/Reader customized with the new postings format.
The new postings format will be used by wrapping default codec.
B. Further tests and performance optimization on current implementations.
There are at least three possible optimizations for PForDelta implementations, regardless of current backend:
1. Inspired by the termsCache mechanism in BlockTermsReader.java and SimpleTextFieldsReader.java, it's a good idea to cache the most recent looked-up <field,term> pair in RAM.
2. As is mentioned in [3], architecture with incremental on-demand decompression will hold more data in RAM. Since PForDelta supports fine-grained access, caching compressed chunks with certain strategy(e.g. LRU,MRU) may improve the performance.
3. Also, combination between PForDelta postings format and Pulsing mechanism seems to be promising. Since intblock postings formats relatively take much time decoding a whole block, when responsing skip-intensive queries, we'll try to inline low-frequency postings into term dictionary to reduce disk seek.
Unit and integration tests will be performed during all periods.
C. Performance Comparison among current postings formats.
There are two parameters for PForDelta to be tuned: code section length, and exception percentage. Comparison on different hardware will also be required. It is necessary to write scripts to cover as many aspects as possible to test the backend speed, compression ratio, query speed, etc.
III. What will be achieved by the end of the projects(-), and the future plan(*)
- A mature PForDelta implementation with all known issues fixed
- An efficient PForDelta postings format integrating with lucene 4.0. (If the new format comes out to be slower than current approaches, a detailed report will be provided about potential bottlenecks)
- A detailed report concerning performance comparison between implemented approaches.
- Full documentation on new components.
* Introducing another intblock postings format into lucene 4.0.(Simple64, VSEncoding[8], etc)
* Improving the codec interface to be more coder-friendly
[1] http://blog.mikemccandless.com/2010/06/lucenes-pulsingcodec-on-primary-key.html
[5] http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
[7] http://wiki.apache.org/lucene-java/FlexibleIndexing
[8] http://integerencoding.isti.cnr.it/
Project timeline:
Note: I'll have an exam on June 11th, and have to prepare an oral defense on my graduation thesis during the last two weeks of May. Therefore less work will be arranged during those weeks.
Apr 7 ~ Apr 23:
Read more codes and docs about flexible indexing, prepare programming environment, and prepare part of the testcases.
Apr 24 ~ May 5:
Tidy related patches and branch codes, discuss with the mentor and make a determination on the design.
May 6 ~ May 12:
Understand the pfordelta and pfordelta2 implementation in bulk-branch.
May 13 ~ May 19 & May 20 ~ May 26:
Integrate pfordelta implementation with trunk version.
May 27 ~ Jun 2 & Jun 3 ~ Jun 16:
Test initial implementation, and compare backend performance among versions of PForDelta, Standard, and Standard+Pulsing postings format.
Jun 17 ~ Jun 23
Combine PForDelta postings format with Pulsing format, and performance comparison.
Jun 24 ~ Jun 30 & Jul 1 ~ Jul 7:
Cache support for pfordelta implementation, and performance comparison.
Jul 8 ~ Jul 14:
MID-TERM evaluation. Check all the codes and testcases, and improve documentation.
Jul 15 ~ Jul 21 & Jul 22 ~ Jul 28:
Tune parameters and compare performance on large datasets.
Jul 29 ~ Aug 4:
Final check of all the codes.
Aug 5 ~ Aug 13:
Suggested pencil down: Improve documentation, and wiki page.
Aug 14 ~ Aug 20:
Firm pencil down: All codes and documentations will be submitted.
Additional resources as hardware or software needed:
My own laptop is sufficient for coding and debugging, but I will need community's help during testing and benchmarking phase.
About Me
I am a senior undergraduate student from Peking University, China. I major in Computer Science and currently work at Search Engine and Web Mining Group. I am very interested in technologies on information retrieval, and am willing to implement state-of-art algorithms to improve the practical work.
Knowledge and skills related to the project:
-
Principles of IR system. (And some simple/classic implementations on several parts of search engine.)
-
Proficient in Java. (I had implemented a toy Java compiler targeting on MIPS architecture, by following the UCLA CS 132 project. And once I organized a small group to implement a Java IDE integrated with Jikes compiler)
-
Used to work under Linux (ArchLinux on my laptop)
