Apropos replacement based on mandoc and SQLite's FTS

Abhinav Upadhyay

Short description: The best thing about NetBSD is that it ships with a lot of useful and helpful documentation which should obviate the need for looking for any outside help. However, there has been a lack of good search utility that would make this documentation easily available to the users. In the age of Google, it seems necessary to rethink this. This project aims to develop a replacement for apropos that provides full text search capabilities using the FTS engine of sqlite3.

Problem: There is tons of good documentation available with NetBSD in the form of man pages.But apparently it isn't good enough if the users can't find the right documentation easily. In the Google era, it is quite ironic to see that there is no effective way to search the documentation available on the local system. 

Historically  the "Name" section of the man pages has been indexed and apropos (and man -k) utility has been there to search, but it is very limited in its scope and effectiveness. The user has to be exact in specifying the keywords to get the right results.

Proposed Solution:

The idea is to provide a modern replacement for apropos using the Full Text Search Engine of Sqlite3 [11].

Proposed Features:

    1. An easy to use CLI interface for searching
    2. A CGI interface (if time permits)
    3. Snippets and line no. of the matched results
    4. A ranking algorithm for better results
    5. A tool to automatically update the database when new packages are installed (which isn't available in current system).
    6. Use of the database for other purposes, e.g. automatic management of man page aliases.

Accessibility:

It will be implemented in the form of an easy to access library so that it is possible to build different kind of clients for it

A CLI will be provided

A CGI will also be provided if time permits.

A desktop GUI client can also be built in a similar manner.

Implementation Plan:

I have figured out the following broad outline for implementation:

    1. Design a database schema to index the data in an optimized manner for best results.
    2. Write a utility to parse the man pages using the mandoc library and feed them to sqlite to create a full text index.
    3. Design and implement the search interface
    4. Implement and fine tune a ranking algorithm to get more relevant results.
    5. Integrate with the NetBSD base system.

Deliverables:

Must Have Deliverables:

  1. A basic implementation of full text search by simply indexing the complete output of mandoc(1) as a single column in the FTS virtual table of sqlite3. It will get things started. Later on more improvisations can be done.
  2. Snippets of the matched results
  3. Better language support using the Porter Stemming Tokenizer [1].
  4. Integration with the existing man(1).
  5. A utility to update the index whenever new man pages are installed.
  6. Documentation.
                                                                                    
Should Have Deliverables: 
  1. A ranking algorithm to improve the quality of results
  2. The algorithm might require more information about the man pages which can be obtained by using the mandoc(3) parser to parse the different sections of the man pages and store them in separate columns as required by the algorithm. 
For example: We can parse the man pages and store the name and description sections in two separate columns. More weight can be given to matches found in the name section. The "See Also" section can also be put to good use.
                                                           
Would Be Nice to Have Deliverables:
  1. Use the database to directly manage man page aliases by extracting the .Nm macros from the man pages.
  2. Support for synonyms list to improve the search.
  3. A web based interface through CGI
Timeline:

April 26 - May 22:
  • Get acquainted with the community members and my mentor.
  • Learn about the components of NetBSD which are relevant to the project.
  • Learn more about sqlite FTS.
  • Investigate more about the use cases of the project to get clear understanding of the requirements.
  • Prototype some implementation ideas.
 
May 23 - June 4:
  • Investigate about the user interfaces and other features that can be provided
  • Design the database schema keeping in mind the future requirements.
(Less activity in this duration as I might have semester exams during this period. I will be around and try to balance the time between work and studies.)
 
June 5 - June 18:
  • Start implementing the basic search interface (as per the point 1 of Must Have Deliverables).
  • Add more features like snippets and position indicators to the search results.
  • Get feedback from the community and figure out what improvements are required.
 
June 19 - July 2:
  • Study about ranking algorithms and figure out which one would be the best fit for our use case.
  • Modify the parsing routine in the previous implementation to use the mandoc(3) parser and store extra information in the database.
  • Redesign and rebuild the database as required.
  • Implement the ranking algorithm.
                                                                   
July 3 - July 11:
  • Test out the search utility with the new ranking algorithms.
  • Get community feedback.
  • Figure out improvements and bugs.
  • Create documentation for midterm evaluation.
 
July 12 - July 25:
  • Discuss out further improvements/features/requirements with the mentor.
  • Give finishing touches to the UI by implementing the various command line options (as discussed during the requirements investigation).
  • Bug fixes
July 26 - August 8:
  • Integration with man(1) and other parts of NetBSD userland as necessary.
  • Implement an automatic update mechanism to update the index whenever new man pages are installed.
  • More testing and community feedback.
                                                                   
August 9- August 15:
  • Prepare Documentation
  • Add the feature to manage man page aliases (If time permits)
  • Add CGI interface (if time permits).
Rest of the days reserved (If something goes wrong, we will have some time).
                                                                   
About the Project and NetBSD:

  • The project will be integrated in the NetBSD base system as a userland utility.
  • It will involve reimplementation, modification and integration of several portions of the base system which includes apropos(1), whatis(1), man(1) (details yet to be figured out).
 
 My familiarity with NetBSD:
  •  I have installed NetBSD on a virtual machine (QEMU) as well as on my hard disk and running on the bare metal.
  • I learnt about various system configurations (networking, X11, ssh, etc.). I also installed several packages like Gnome, CVS, Git, etc.
  • I also checked out the NetBSD current trunk using CVS and built the kernel.
  • I have started experimenting with Sqlite's FTS engine and the man pages. I wrote a shell script which traverses the man page directories and stores them in the sqlite database.
    • Interesting Observations: Size of man pages stored: ~55 Mb, size of database: 76 Mb
                                                                   
My Interactions With The NetBSD Community:
I have been interested in this project from the day it was posted for GSoC. I immediately sent an email to the tech-userlevel mailing list. I received encouraging and helpful replies from the developers and community members [2].
 
I have also been interacting with the NetBSD developers  and asking questions on the #netbsd-code IRC channel. I have been in contact with nbjoerg .Other developers on the channel have also been very helpful and kind.
                                                                   
Requirements For The Project:

Interfaces and APIs Required by the Project:
1. Sqlite3 FTS Engine
2. mdocml parser and library
3. A Ranking algorithm
 
My familiarity with Sqlite3: 
I have compiled the sqlite3 amalgamation with FTS support enabled. I played around with sqlite3 by building simple databases and running sql queries.
 
I also built some FTS virtual tables and tried out FTS queries. I have gone through the Sqlite3 FTS documentation [5]. Besides this I have strong experience of working with sql databases in projects.
 
My familiarity with mdocml:
I was able to build the latest mdocml package. I tried out mandoc(1) with its various options. I also went through the documentation of mandoc(3) and tried to understand the API interface. But I will need more time to spend with mandoc(3) to try out some code.
                                                                   
Ranking Algorithm:
Apart from getting familiar with the relevant interfaces and APIs, a good ranking  algorithm will have to be identified or developed.
I am currently doing some self study on ranking algorithms from the Stanford University's IR book [12]. I think the concept of inverse document frequency [13] and vector space model may be used (provided sqlite3 provides enough interesting data).
                                                                   
About Me:

My name is Abhinav Upadhyay and I am a 4th year student of Bachelor of Technology (Information Technology) from Northern India Engineering College, Lucknow (India).
                                                                   
I have been programming for almost seven years now. I started with Visual Basic in school. In college I switched to C and Java. Recently I also learnt Python. I also know Shell Scripting, JavaScript, SQL, etc.
                                                                   
Relevant Projects: 

1. Web Search Engine: Last year I developed a web search engine using the Apache Lucene API in Java. I learnt about the various components required in building a search engine. 
  
Project Duration: 2 Months
Language and APIs: Java, Apache Lucene
  
2. Apache Module: I developed an Apache module during my summer internship last year. The module was developed using C  and it required learning the Apache modules API and the Apache Portable Runtime [5].
I also had to learn the Git version control system
Project Duration: 3 Months
Language and Tools: C, APR, Git 
                                                                                        
I have done several other projects which are mentioned in my resume [3].
                                                                   
Experience With Open Source:
 
1. Contributed 4 patches for different projects to the Ubuntu Community
   
* Two patches for Tomcat6 ([6], [7]):
Language: Shell Script
Status: Merged for next release.
* One patch for Apport (the crash reporting tool) ([8]):
Language: Python
Status: Merged for next release
* One patch for Tomboy ([9])
Language: C#
Status: Yet to be merged, because of UI freeze in Gnome.
 
You can checkout my Launchpad Profile [4]
                                                                   
2. The Apache Module that I developed during my summer internship, is also open sourced under the Apache License and hosted on Github [10].
                                                                   
My Interest in The Project:
I have a great deal of interest in the field of Information Retrieval, in fact I want to do my masters in this field. So this project is a natural favorite for me. I believe this project will provide me necessary exposure and boost my prospects of further education in IR.
                                                                   
Also, I have had interest in systems programming from the beginning. I have gone through (partially) Richard Stevens' famous book (Advanced Programming in the UNIX Environment), and doing a project in NetBSD will be a great start to get some practical learning experience.
                                                                   
Contact Details:

Email: er.abhinav.upadhyay@gmail.com
Mobile: +91-9453853619
Landline Number: +91-522-4060898
Address: A-156, Indira Nagar, Lucknow-226016, INDIA
                                                                          
References: