GSoC/GCI Archive
Google Code-in 2012 Apertium

Investigate how to extract parallel text from Armenian-Russian-English website

completed by: Ulysses

mentors: Francis Tyers, Jonathan

This site appears to have many articles in Armenian, English and Russian:

http://www.aravot.am/

The task will be to investigate the multilingual structure of the site and recommend how the articles can be best identified, downloaded and aligned.

e.g. consider:

http://www.aravot.am/2012/12/04/136979/

http://www.aravot.am/ru/2012/12/04/136979/

http://www.aravot.am/en/2012/12/04/136979/