GSoC/GCI Archive
Google Code-in 2014 Apertium

Implement a feature extractor for a MaxEnt POS tagger

completed by: Stan K.

mentors: Kevin Brubeck Unhammer, Francis Tyers

The objective of this task is to write a python program which extracts features from a tagged corpus.

The analysed corpus looks like:

https://svn.code.sf.net/p/apertium/svn/languages/apertium-eng/texts/turing1.tagged.txt

The tagged corpus looks like:

https://svn.code.sf.net/p/apertium/svn/languages/apertium-eng/texts/turing1.handtagged.spectie.txt

The feature patterns should be described as regular expressions, e.g.

surface = ^([A-Za-z]+)/

lemma = /([A-Za-z]+)<

number = (<sg>|<pl>)

gender = (<f>|<m>|<nt>)

pos = (<n>|<adj>|<vblex>|<pr>)

 

So in the example below for "this":


^to/to<pr>$
^this/this<det><dem><sg>/this<prn><tn><mf><sg>$
^day/day<n><sg>$

 

^to/to<pr>$
^this/this<det><dem><sg>$
^day/day<n><sg>$

 

Your features table might be :

 

Output             | Input   | Features

this<det><dem><sg> | this    | (-1, "lemma", "to") (-1, "pos", "<pr>") (0, "number", "<sg>") (0, "lemma", "this") (1, "lemma", "day") (1, "pos", "<n>") (1, "number", "<sg>")

--------------------------------------------

Untagged:

^Turing/Turing<np><cog><sg>$
^machines/machine<n><pl>$
^are/be<vbser><pres>$
^to/to<pr>$
^this/this<det><dem><sg>/this<prn><tn><mf><sg>$
^day/day<n><sg>$
^a/a<det><ind><sg>$
^central/central<adj>$
^object/object<n><sg>/object<vblex><inf>/object<vblex><pres>$
^of/of<pr>$
^study/study<n><sg>/study<vblex><inf>/study<vblex><pres>$
^in/in<pr>$
^theory/theory<n><sg>$
^of/of<pr>$
^computation/computation<n><sg>$
^./.<sent>$

Tagged:

^Turing/Turing<np><cog><sg>$
^machines/machine<n><pl>$
^are/be<vbser><pres>$
^to/to<pr>$
^this/this<det><dem><sg>$
^day/day<n><sg>$
^a/a<det><ind><sg>$
^central/central<adj>$
^object/object<n><sg>$
^of/of<pr>$
^study/study<n><sg>$
^in/in<pr>$
^theory/theory<n><sg>$
^of/of<pr>$
^computation/computation<n><sg>$
^./.<sent>$

 

 

 

http://www.aclweb.org/anthology/W96-0213