GSoC/GCI Archive
Google Summer of Code 2015 Apache Software Foundation

Support Sitemap Crawler in Nutch 2.x

by cihad guzel for Apache Software Foundation

The url’s can be got from only pages that were scanned before in nutch crawler system. This method is expensive. Also, the degrees of importance and “change frequance” of these urls are not known only guessed. But, it is possible to find the whole of urls in a up-to-date sitemap file. For this reason, sitemap files in website should be crawled. Nutch project will have that support of sitemap crawler thanks to this development.