CPAN search for the modern web
Moritz Onken
Short description: The goal of this project is to build a sophisticated search engine for the CPAN that knows how to return relevant results, by incorporating metadata from sources such as user ratings, user-added tags, dependency graphs, freshness, and test results. This service will provide both a front-end website, and an open REST web API which will allow others to mashup the data and innovate.
Moritz Onken
onken@netcubed.de
Google ID: mo
CPAN search for the modern web
Abstract
The Comprehensive Perl Archive Network (CPAN) is huge: over 90,000 modules in over 22,000 distributions. Unfortunately it lacks proper accessibility: CPAN search is a naive, text-only search, which means that it works only if you already know what you are looking for. This is fine for experienced users, but creates a barrier for newcomers.
The goal of this project is to build a sophisticated search engine that knows how to return relevant results, by incorporating metadata from sources such as user ratings, user-added tags, dependency graphs, freshness, and test results. This service will provide both a front-end website, and an open REST web API which will allow others to mashup the data and innovate. The project is based on the MetaCPAN prototype which will be extended to fit the needs of the search engine and various other aspects.
Benefits to the Perl/Open Source Community
This will benefit both newcomers, who will be presented with relevant modules that are valued by the Perl community, and for experienced users who will not be forced to wade through broken and deprecated distributions, but will still be exposed to new distros that may turn out to be the stars of the future. Also, the extended search syntax makes searches inside a specific namespace possible. Using modern web technologies, the barrier to interaction will be removed, which will encourage user feedback, which in turn will improve the search results.
As a by-product, the MetaCPAN API will be extended, to make experimentation and innovation easier for others.
Deliverables
- Rework the MetaCPAN data structures
- Refine the backend API which will be used by the front-end and made accessible via the REST web interface
- Incorporate data from
- CPAN Ratings
- CPAN Testers
- cpanvote
- dependency chain
- Include it in the ranking algorithm
- Tagging functionality and remote authentication against providers like GitHub and BitCard
- A front-end that makes the search and MetaCPAN accessible
Project Details
There have been many attempts at trying to improve the CPAN experience, and none has succeeded. There are a few reasons for this:
- It is a big undertaking which requires dedication. It is difficult for somebody with a full time job to maintain the impetus required for their project to gain traction with the wider community.
- Inaccessible data. Until recently, CPAN module data has been difficult to access. Thanks to CPAN::Meta, File::Rsync::Mirror::Recent and related modules this is no longer the case and the CPAN is more accessible than ever before.
- Too many designers and not enough implementors. Everybody has opinions, often diametrically opposed. These projects tend to end with little to show but arguments.
GSoC would give me the opportunity to dedicate four months to transforming MetaCPAN from a working prototype to a search platform which would be useful to the wider Perl community. I am supported by the small but talented, focussed and enthusiastic team that developed MetaCPAN in the first place.
Experienced users should be able to find the module they are looking for as fast as possible while newcomers with little experience should get meaningful results if they search for an "OOP framework". To improve the relevance of search results, I will devise an algorithm to generate a PageRank-like score for each distribution, which integrates data from sources such as CPAN Ratings, CPAN Testers, cpanvote and the dependency chain.
The intention is to create a positive feedback loop for real world users. By allowing users to customise their CPAN experience, to flag modules that are important for them, or modules that they don't like. This then benefits us because it provides more data to improve relevance. And this in turn improves the user experience.
I will add remote authentication with popular backends such as GitHub, Twitter, Facebook to lower the barrier to interaction, and the ability for users to tag distributions on the page. This will help both for classification (eg "OOP", "ORM", "Database") and for filtering (eg "Deprecated", "Not maintained").
Project Schedule
Community Bonding Period:
Make further contact with authors and maintainers of CPAN-Ratings, cpanvote, CPAN Testers to figure out the best way to include their data in the index.
Discuss and collect ideas from the toolchain gurus at the QA Hackathon in Amsterdam.
Get to know OAuth and related protocols for authentication.
Dig deeper into ElasticSearch and think about a good scheme for storing the data.
Week #1:
Implement a basic front-end and a very basic full-text search that queries MetaCPAN. Use this set-up to collect feedback from the community and log search request for further evaluation.
Week #3:
Extend the MetaCPAN indexer to include data from CPAN Testers, cpanvote and CPAN Ratings.
Week #5:
Extend MetaCPAN to accept a PageRank value for each distribution.
Implement a scoring script that calculates the PageRank vector and applies the score to the distributions.
Week #7:
Evaluate the scores. Reach out for the community and get feedback on the search results.
Rerun the scoring script with different parameters and reevaluate. This is an ongoing process and will be repeated through the rest of the time.
Week #9 (mid-term evaluation):
Implement authentication and tagging.
Extend MetaCPAN to accept tags for each distribution and add an API for that.
Week #11:
Add tagging functionality to the search. Users can exclude results with certain tags or search for them.
Week #13:
Wrap things up. Finish the documentation and clean up the code.
References and Likely Mentors
Clinton Gormley (clinton@iannounce.co.uk, author of ElasticSearch.pm)
Florian Ragwitz (rafl@debian.org)
MetaCPAN (Website, Repository)
License
Same terms as Perl itself, where possible.
Bio
I am 25 years old and have been developing in Perl for over ten years. Other areas of interest are JavaScript and web development in general. I utilize and contribute to many Perl open source projects (HTML::FormFu, Catalyst, DBIx::Class, various MooseX modules).
Eligibility
I am currently a full-time student at the Karlsruhe Institute of Technology (KIT) and can provide documentation upon request.
