GSoC/GCI Archive
Google Code-in 2010 The Apertium project

Write an Apertium aware wrapper for Hunspell's 'analyze'

completed by: cristipiticul

mentors: Francis Tyers

Currently the hunspell morphological analyser 'analyze' accepts words in a one-word per line format. That is, each token to be analysed morphologically needs to be followed by a newline. This is problematic for many purposes, for example in Apertium where we want to be able to analyse lines. The task of this project is to write a wrapper which reads input character by character and does naïve tokenisation based on whitespace.

The code should also be able to relabel tags and convert the format (see below)

Current format:

  $ ./analyze hu_HU.aff hu_HU.dic foo
  > születésnap
  analyze(születésnap) =  st:születésnap po:noun ts:NOM
  stem(születésnap) = születésnap

Desired format:

  $ cat foo | ./hun-proc hu_HU.aff hu_HU.dic
  ^születésnap/születésnap<N><Nom>$

You must have experience with programming in either C or C++ to accept this task. A prototype implementation will be supplied, as with functions for reading utf-8 characters from a stream, and will be helped with specific implementation issues.

It is recommended that you come on IRC #apertium irc.freenode.net before claiming this task.

==Code==

* 'analyze.cxx': http://hunspell.cvs.sourceforge.net/viewvc/hunspell/hunspell/src/tools/analyze.cxx?revision=1.1.1.1&view=markup

* foma_proc.c': http://apertium.svn.sourceforge.net/viewvc/apertium/branches/foma/src/foma_proc.c?revision=23108&view=markup

* prototype: http://pastebin.com/YVDXj29r

==Documentation==

* Apertium stream format: http://wiki.apertium.org/wiki/Apertium_stream_format