GSoC/GCI Archive
Google Code-in 2012 Apertium

scrape Mongolian noun paradigms into yaml file

completed by: Richard Tynan

mentors: Francis Tyers, Jonathan

There are charts of Mongolian (=Khalkha) noun paradigms at the following url:

http://wiki.firespeaker.org/Khalkha_noun_classes

Your job is to write a script (preferably in python3) to scrape those paradigms into a yaml files (for testing of morphological transducers) like those at https://apertium.svn.sourceforge.net/svnroot/apertium/incubator/apertium-cv-tr/tests/

The script should produce files according to the following guidelines:

  • each sub-paradigm type should be a separate file, named e.g. "normal nouns - ending with consonants.yaml" and "normal nouns - ending with vowels.yaml" (it would be good to case-convert to all-lowercase),
  • each word should be a section in the Tests section of the file, e.g. "гар = time:",
  • transcriptions (in []s) should be ignored,
  • empty case forms should be skipped (e.g., no "Pl" form for classroom),
  • case forms highlighted in blue should be skipped,
  • all formatting of individual letters should be ignored (e.g., bolded н is common—the '''s around the character should be done away with),
  • variable forms should include all (and only) the forms given (no "—"s) (this will probably be the hardest part of designing this script),
  • the script should be able to deal with new sub-paradigms, but it can (doesn't have to) ignore the "to sort" section
  • the entries for the forms should be tagged as <n> with other tags coming from the form given, and the base form should be the Nom form for each noun; e.g.:
    • гар<n><nom> : гар
    • гар<n><gen> : гарыг
    • гар<n><dat> : гарт
    • гар<n><nom><pl> : гарууд
  • note that the Pl form is actually <nom><pl>
  • the header of the yaml files should point to ../khk.autogen.hfst for Gen and ../khk.automorf.hfst for Morph.