GSoC/GCI Archive
Google Summer of Code 2013

Apertium

Web Page: http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code

Mailing List: https://lists.sourceforge.net/lists/listinfo/apertium-stuff

The Apertium project develops a free/open-source platform for machine translation and language technology. We try to focus our efforts on lesser-resourced and marginalised languages, but also work with more widely-spoken languages.

The platform, including data for a large number of language pairs, a translation engine and auxiliary tools is being developed around the world, largely in universities and companies (e.g. Prompsit Language Engineering), but independent free-software developers also play a huge role.

There are currently 33 published language pairs within the project (including a number of "firsts" — for example Aragonese—Spanish, Spanish—Occitan, Breton—French, and Basque—Spanish among others), and several more in development.

Apertium has a special focus in lowering the barrier for the creation of linguistic resources for any language, ideally to be used for MT, but also reusable for other purposes (e.g. grammar checking, morphological analysis, PoS tagging, etc.).

Projects

  • A Sliding-Window Drop-in Replacement for the HMM Part-of-Speech Tagger in Apertium This is a proposal for the Apertium organization in GSoC 2013. The goal of the project is to implement a new Part-of-Speech tagger, the Sliding-Window Part-of-Speech Tagger (SWPoST), to serve as a drop-in replacement for the current HMM Part-of-Speech Tagger in Apertium. The new tagger can achieve higher quality performance and is easier to understand and modify. The proposal mainly consists of the following parts: 1) Title and contact information. 2) My general interests on machine translation and the Apertium project. 3) Explanations of the tagger's math model in my own words. Firstly, mathematical descriptions and a simple example are used together to show the mechanism of the training and tagging procedure of the new tagger. Secondly, two solutions are proposed for implementing the FORBID and ENFORCE restrictions in the tagger, using a more complex model, the Light SWPoST (LSWPoST). 4) The descriptions on the work plan, including: Coding Challenge, Community bonding period, and the detailed week plan. 5) List my skills and qualifications that help to implement the tagger. An online version of the proposal is provided in "Additional info", which provides a better display, using wiki and LaTex.
  • Apertium Turkish-Uzbek Apertium Turkish-Uzbek is a MT system based on Apertium framework to automatically translate tr->uz
  • Application for "Interface for creating tagged corpora" GSOC 2013 There is a need in Apertium for most released pairs and the ones to come: better part-of-speech (POS) taggers. Training supervised taggers has never been a waste of time but all the opposite but with a better UI to tag corpora we will be able to get more people working on the proyect as disambiguators
  • Chinese-to-Spanish Apertium System We will introduce the Chinese-to-Spanish translation in the Apertium open source project. We will do this integration using the GPL available tools for the pair of languages.
  • Danish-Norwegian (Bokmål) language pair I will work using Apertium platform on the Danish-Norwegian (Bokmål) language pair that's currently in the nursery stage. I will make a set of transfer rules and work on a constraint grammar to improve the tagging process. I will mainly do work on the nb-da direction as I can make reliable grammaticality judgments, Danish being my first language. I also plan to extend support to nn-da (Nynorsk-Dansh)
  • Hindi-English Language Pair The Hindi-English language pair currently lies in the incubator stage of Apertium with still a lot of work left to be done on the dictionaries and transfer rules. I would like to make this language pair ready for release by the end of coding period.
  • Improvements in lexical-selection module The lexical selection module in Apertium is currently a prototype. There are many optimisations that could be made to make it faster and more efficient. There are a number of scripts which can be used for learning lexical-selection rules, but the scripts are not particularly well written. Part of the task will be to rewrite the scripts taking into account all possible corner cases, as well as to extend the functionality of the module itself.
  • Rule-based finite-state disambiguation Implement a disambiguation framework for Apertium that can be expressed as a finite-state transducer (FST). The framework will be based on the constraint grammar (CG) formalism, which is already supported by Apertium. There exists already a proof-of-concept compiler that converts CG rules to FSTs using: fomacg. This project will extend fomacg to handle all CG constructs and implement program that runs the rule FSTs on the output of Apertium's morphological analyzer component.
  • Ukrainian-Russian language pair Create a machine translation system based on the Apertium platform for unidirectional translation from Ukrainian into Russian
  • Visual interface for editing transfer rules Develop an application which reads t1x, t2x and t3x files and lets users edit their content on a visual drag-and-drop interface. Transfer rules are represented as Scratch-like structural diagrams.