GSoC/GCI Archive
Google Summer of Code 2012

Apertium

Web Page: http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code

Mailing List: mailto:apertium-stuff@lists.sourceforge.net

  • The Apertium project develops a free/open-source platform for machine translation and language technology. We try to focus our efforts on lesser-resourced and marginalised languages, but also work with more widely-spoken languages.
  • The platform, including data for a large number of language pairs, a translation engine and auxiliary tools is being developed around the world, largely in universities and companies (e.g. Prompsit Language Engineering), but independent free-software developers also play a huge role.
  • There are currently 30 published language pairs within the project (including a number of "firsts" — for example Aragonese—Spanish, Turkish—Kyrgyz, Spanish—Occitan, Breton—French, and Basque—Spanish among others), and several more in development.

Projects

  • Apertium id-ms: Indonesian-Malaysian machine translation The Indonesian-Malaysian language pair in Apertium currently does not have active maintainers. The objective of this project is to develop a release-quality version of the Apertium id-ms language pair. The morphological analyzers for both Indonesian and Malaysian will be improved; the Indonesian and Malaysian dictionaries will be completed.
  • Apertium on your mobile Provide customer the services of Apertium on their mobile with more added features like •Different keypad for different language •Translation of SMS and other text content like contact, address, memo •Basic form for translation of text.
  • Apertium-kaz-tat: machine translation between Kazakh and Tatar Creating a rule-based shallow-transfer machine translation system for translating between Kazakh and Tatar.
  • apertium-quz-spa: Machine Translation between Cuzco Quechua and Spanish translation pair between Spanish (lttollbox) and Cuzco Quechua (hfst) for Apertium
  • Apertium-sl-sh: machine translation between Slovene and Serbo-Croatian Currently Apertium does not have a release-quality for the translation system of the Slovenian and Serbo-Croatian language pair. The goal of this project is to make a release-quality of the apertium-sl-sh language pair.
  • Corpus-based lexicalised feature transfer This project will deal with setting additional lexical features, taking context into account. The main idea is to extend the Apertium pipeline by placing a new module, after the POS tagging process and before the transfer process, which will set additional tags that can later be used in the transfer module. Examples of such tags include noun definiteness, verb aspect etc. The goal of the project is to both improve the existing sh-mk pair and to serve as a prototype for similar corpus-based modules.
  • Make lttoolbox-java embeddable Currently, lttoolbox-java is only usable from the command line, and it relies on external resources of the language pair to be translated (which must be downloaded and compiled by the user separately). The aim of this task would be to overcome this so that we could have self-contained JAR files to translate a language pair that could easily be integrated in larger Java projects.
  • New Maltese-Arabic language pair New Maltese-Arabic language pair, providing Maltese-to-Arabic translation.
  • Rule-based finite-state disambiguation Designing of an XML formalism for writing disambiguation rules, a validator for it, upgrades to lttoolbox needed to represent the rules as a finite-state transducer, a compiler, and a processor which applies the rules to an Apertium input stream.
  • Turkish-Turkmen Machine Translation-Apertium The project covers the two languages translation. The target language is Turkmen. The project will be based on these languages to create a machine translation.