Bi-gram Language Modeling

Gaurav Arora

Abstract

Bi-gram Language modeling approach to information retrieval have proved to outperform the three tradition IR approaches . Bi-gram Language model apart from better retrieval performance renders a rich resource Bi-gram from collection which can be used for phrase searching, Diversifying search results, and query reformulation suggestion to user. Bi-gram Language model would make Xapian a more powerful library for research in information retrieval.

Additional Information

Bi-gram Language Model deviates from traditional ranking model,language model consider document as Language sample and rank document with probability of generation of query using document Language Model.

Given a relevant document, queries are generated by the explicit generation of important terms and unimportant terms. The important terms are supposed to be drawn at random from the document. The unimportant terms are supposed to be drawn at random from the full collection.

The probabilities of drawing the terms from the document are calculated by a simple procedure often explained by urn containing colored balls.

For the Query Terms missing, Normalizing document length in the document Dirichlet Smoothing, collection smoothing is  applied.

Code samples