GSoC/GCI Archive
Google Code-in 2010 The Apertium project

Train part-of-speech taggers for Dutch and Afrikaans

completed by: AureiAnimus

mentors: Francis Tyers

The aim of this task is to train part-of-speech taggers for Dutch and Afrikaans as part of the apertium-af-nl MT system. This will involve writing a TSX file for each of the languages.[1] Then running the training process 'unsupervised' as described on the Wiki.[2] As a corpus of Afrikaans use the Afrikaans Wikipedia,[3] for Dutch, use the EuroParl[4] corpus. You should also write 5--10 forbid/enforce rules for each tagger based on a brief survey of disambiguation errors.

 

1. http://wiki.apertium.org/wiki/TSX_format

2. http://wiki.apertium.org/wiki/Unsupervised_tagger_training

3. http://download.wikimedia.org/afwiki/20101104/

4. http://www.statmt.org/europarl/v5/nl-en.tgz