GSoC/GCI Archive
Google Code-in 2010 The Apertium project

Evaluate efficacy of decompounding algorithm in Dutch--Afrikaans MT

completed by: AureiAnimus

mentors: Francis Tyers

The aim of this task is to take two files of 1,000 lines in the following format:

 ; ;       ^1/1<num>$ ^basismetaal/basis<n><sg><cmp>+metaal<n><sg>$              1 basismetaal
 ; ;       ^1/1<num>$ ^basisbestuur/basis<n><sg><cmp>+bestuur<n><sg>$            1 basisbeheer
 ; ;       ^1/1<num>$ ^basinstrumente/bas<n><sg><cmp>+instrument<n><pl>$         1 basinstrumenten
 ; ;       ^1/1<num>$ ^Bariumverbindings/Barium<n><sg><cmp>+verbinding<n><pl>$   1 Bariumverbindingen
 ; ;       ^1/1<num>$ ^Bariumsulfaat/Barium<n><sg><cmp>+sulfaat<n><sg>$          1 Bariumsulfaat
 ; ;       ^1/1<num>$ ^Bariumpoeier/Barium<n><sg><cmp>+poeier<n><sg>$            1 Bariumpoeder
 ; ;       ^1/1<num>$ ^bariumnitraat/barium<n><sg><cmp>+nitraat<n><sg>$          1 bariumnitraat
 ; ;       ^1/1<num>$ ^amateurpogings/amateur<n><sg><cmp>+poging<n><pl>$         1 amateurpogingen

You first need to check the compound word segmentation/analysis. Then check the translation.

Place a 'GA' before the first ';' if the analysis/segmentation is good

Place a 'BA' before the first ';' if the analysis/segmentation is bad.

Place a 'GT' after the first ';' if the translation is good.

Place a 'BT1' after the first ';' if the translation is bad because the constituent words are bad.

Place a 'BT2' after the first ';' if the translation is bad because the words are good but an epenthetic is missing/wrong.

Contact a mentor on IRC #apertium irc.freenode.net to get the necessary files.