GSoC/GCI Archive
Google Code-in 2014 Apertium

Make Apertium's scraper framework write the tree to disk every n seconds instead of every article

completed by: Joonas Kylmälä

mentors: Jonathan

Currently Apertium's scraper framework writes the entire xml tree to the output file every time it adds an additional page's worth of content to the tree. This makes it pretty slow. However, writing the entire tree to disk at the end of reading all pages is a bad strategy too, since a properly configured scraper can take days(!) to run, and if something happens to interrupt the script (poor handling of bad html, internet connection loss, etc.), you'll lose all the content you've scraped up to that point. The best strategy is probably to have the scraper write the tree to disk repeatedly at intervals of a predefined number of seconds (e.g., 60), so you at most lose that much time's worth of scraping. Modify the current scraper to do this (with the number of seconds settable by an optional switch, and defaulting to something sane, like 60). You should make sure there are no issues involving race conditions, and that you don't break any aspects of the scraper as it's currently written.
For further information and guidance on this task, you are encouraged to come to our IRC channel.