Apropos replacement based on mandoc and SQLite's FTS
Abhinav Upadhyay
Short description: The best thing about NetBSD is that it ships with a lot of useful and helpful documentation which should obviate the need for looking for any outside help. However, there has been a lack of good search utility that would make this documentation easily available to the users. In the age of Google, it seems necessary to rethink this. This project aims to develop a replacement for apropos that provides full text search capabilities using the FTS engine of sqlite3.
Problem: There is tons of good documentation available with NetBSD in the form of man pages.But apparently it isn't good enough if the users can't find the right documentation easily. In the Google era, it is quite ironic to see that there is no effective way to search the documentation available on the local system.
Historically the "Name" section of the man pages has been indexed and apropos (and man -k) utility has been there to search, but it is very limited in its scope and effectiveness. The user has to be exact in specifying the keywords to get the right results.
Proposed Solution:
The idea is to provide a modern replacement for apropos using the Full Text Search Engine of Sqlite3 [11].
Proposed Features:
- An easy to use CLI interface for searching
- A CGI interface (if time permits)
- Snippets and line no. of the matched results
- A ranking algorithm for better results
- A tool to automatically update the database when new packages are installed (which isn't available in current system).
- Use of the database for other purposes, e.g. automatic management of man page aliases.
Accessibility:
It will be implemented in the form of an easy to access library so that it is possible to build different kind of clients for it
A CLI will be provided
A CGI will also be provided if time permits.
A desktop GUI client can also be built in a similar manner.
Implementation Plan:
I have figured out the following broad outline for implementation:
- Design a database schema to index the data in an optimized manner for best results.
- Write a utility to parse the man pages using the mandoc library and feed them to sqlite to create a full text index.
- Design and implement the search interface
- Implement and fine tune a ranking algorithm to get more relevant results.
- Integrate with the NetBSD base system.
Deliverables:
Must Have Deliverables:
- A basic implementation of full text search by simply indexing the complete output of mandoc(1) as a single column in the FTS virtual table of sqlite3. It will get things started. Later on more improvisations can be done.
- Snippets of the matched results
- Better language support using the Porter Stemming Tokenizer [1].
- Integration with the existing man(1).
- A utility to update the index whenever new man pages are installed.
- Documentation.
Should Have Deliverables:
- A ranking algorithm to improve the quality of results
- The algorithm might require more information about the man pages which can be obtained by using the mandoc(3) parser to parse the different sections of the man pages and store them in separate columns as required by the algorithm.
- Use the database to directly manage man page aliases by extracting the .Nm macros from the man pages.
- Support for synonyms list to improve the search.
- A web based interface through CGI
- Get acquainted with the community members and my mentor.
- Learn about the components of NetBSD which are relevant to the project.
- Learn more about sqlite FTS.
- Investigate more about the use cases of the project to get clear understanding of the requirements.
- Prototype some implementation ideas.
- Investigate about the user interfaces and other features that can be provided
- Design the database schema keeping in mind the future requirements.
- Start implementing the basic search interface (as per the point 1 of Must Have Deliverables).
- Add more features like snippets and position indicators to the search results.
- Get feedback from the community and figure out what improvements are required.
- Study about ranking algorithms and figure out which one would be the best fit for our use case.
- Modify the parsing routine in the previous implementation to use the mandoc(3) parser and store extra information in the database.
- Redesign and rebuild the database as required.
- Implement the ranking algorithm.
- Test out the search utility with the new ranking algorithms.
- Get community feedback.
- Figure out improvements and bugs.
- Create documentation for midterm evaluation.
- Discuss out further improvements/features/requirements with the mentor.
- Give finishing touches to the UI by implementing the various command line options (as discussed during the requirements investigation).
- Bug fixes
- Integration with man(1) and other parts of NetBSD userland as necessary.
- Implement an automatic update mechanism to update the index whenever new man pages are installed.
- More testing and community feedback.
- Prepare Documentation
- Add the feature to manage man page aliases (If time permits)
- Add CGI interface (if time permits).
- The project will be integrated in the NetBSD base system as a userland utility.
- It will involve reimplementation, modification and integration of several portions of the base system which includes apropos(1), whatis(1), man(1) (details yet to be figured out).
- I have installed NetBSD on a virtual machine (QEMU) as well as on my hard disk and running on the bare metal.
- I learnt about various system configurations (networking, X11, ssh, etc.). I also installed several packages like Gnome, CVS, Git, etc.
- I also checked out the NetBSD current trunk using CVS and built the kernel.
- I have started experimenting with Sqlite's FTS engine and the man pages. I wrote a shell script which traverses the man page directories and stores them in the sqlite database.
- Interesting Observations: Size of man pages stored: ~55 Mb, size of database: 76 Mb
