LUCENE-3312 Break out StorableField from IndexableField

Nikola Tanković

Short description: Lucene is a open-source full-text search library written in Java, but ported to many other languages. It relays on concept of document, a core Lucene information holder in need to be indexed or stored. This project will decouple indexing and storing operations over fields in document, as well as separate document classes in indexing and search time.

Additional info: https://issues.apache.org/jira/browse/LUCENE-3312

Summary

Document in Lucene is consisted of Fields which can have different properties affecting the way Lucene indexes, stores or analyzes by these fields.
Currently, efforts have been made to decouple classes that manage documents and fields from indexing mechanisms. This enables implementations of custom document handling classes.
Problem still remains in mixing of two separate operations that lucene provides over documents fields: indexing and storing.
As a solution, instead of using IndexableField for handling both storing and indexing, a separate interface called StorableField can be introduced.
In addition, this project will also tend to separate document classes used in index-time from search time as they are somewhat different because indexing time and search time mechanisms consume different information from documents.

Introduction and Motivation

Lucene is used throughout many search engines and projects in need of text analysis and indexing. It functions by adding documents to a full-text index. This index can then be searched to return results ranked by different criteria like relevance to the query or by an arbitrary field values.

Lucene has a logical architecture based on principle of a document, unit of search and index, containing fields, an simple arbitrary key-value pair. Documents in Lucene should be thought as collections of fields containing different values, making them universal and general, not like document in common English usage of that word.

Fields in Lucene can be indexed, which is a basic Lucene functionality, but also stored. Benefits of stored field means that you do not have to query the actual data source which is being indexed for additional information about certain record. E.g. you can store first paragraph of large text or certain important dates to be displayed after searching for a document.

This project tends to refactor the part of Lucene API dealing with two general types of field: storable and indexable fields. It would be much cleaner if the indexer would have separate acces from documents’ storable and indexable fields, thus why introduction of new StorableField interface is needed; one that will break out information regarding storable fields out of IndexableField.

Another thing to look into are DocValues which provide faster way of getting stored information from documents by reducing the amounts of disk seeks. This is achieved by storing fields in documents column stride.

Another Lucene modification in this project would be the separation of document class in search time from one in indexing time. This means that indexing details in documents fields will no longer be provided in search time, which was proven to be confusing and bringing further problems.

Proposed Solution and Rationale

Suggested proposal includes following:

  • Separation of IndexableField and StorableField
  • Separation of indexing time document from searching time document class
  • Necessary changes to indexing and searching API and implementation
  • Necessary changes to FieldType classes
  • Possible reunion of DocValues into StorableField


Although this seems like an easy refactoring task, there are some important aspects to take into consideration:

  • API passed to indexer should remain narrow
  • Performance drops with fields that are both indexed and stored should be minimal


Work Plan

Milestone
Due
Description
Detailed introduction with Lucene indexing and searching API
May 1st
Study of internal indexing and searching details.
Milestone
Due
Description
Class design of proposed solution
May 23rd
This is an important milestone which includes much discussion with community together with presenting several possible solutions.
Deliverable of this phase is class design candidate for implementation, together with test scenarios.
Milestone
Due
Description
First round Implementation
June 10th
Implementation of proposed solution is completed and ready to be tested, but not fully integrated with rest of Lucene and Solr.
Milestone
Due
Description
Evaluation & Redesign
June 17th
Sharing of testing results, and decision about possible redesign.
Milestone
Due
Description
Second round Implementation
July 25th
Implementation of second redesigned model.
Milestone
Due
Description
Final evaluation and delivery
August 18th
Finalizing implementation and evaluation, fixing bugs, integration and delivery of final source code.





About Author and Related Work

Nikola Tankovic is born 1986 in Pula, Croatia. He develops a passion for computer programming in mid-school, and starts competiting in many regional competitions and ACSL league. At the age of 20, he start his own web company called TrueStudio (http://www.truestudio.net) where he developed many web application for several industrial needs, escpecially custom CMS solutions. During college, he also works as a software developer at Faculty of Electrical Engineering and Computing in Zagreb, maintaining and extending faculty’s E-Campus CMS solution.
During subject „Distributed Software Development“ he lead an international team of students on SCORE competition, which took place at ICSE ‘09 conference in Vancouver, Canada. His team achieved first prize.

He is currently a PhD student at Faculty of Electrical Engineering and Computing, studying executable models over graph databases. He aspires to make software development of data-centric desktop and mobile applications much easier through modelling of semantically rich executable models. He has advanced knowledge of many programming languages, including: C#, PHP, but primarily Java and Javascript, which he uses throughout the development of described modelling tool for his doctoral thesis.

He encounters Lucene by using Neo4J graph database to store application models and data and uses it heavily for quickly fetching desired graph nodes.