GSoC/GCI Archive
Google Code-in 2013 KDE

Create language trigrams for language detection in Sonnet

completed by: Levente Kurusa

mentors: Martin Sandsmark

In the spell checking/language library for KDE we're currently working on automatic language detection. To implement this, we need data files generated from a text corpus in the relevant language.

Here is a list of the trigram files we already have, so don't recreate any of these: http://quickgit.kde.org/?p=sonnet.git&a=tree&hb=8087b9c1154bf0a63384ccdc5f9b3f321b48c1ed&f=data%2Ftrigrams

The first step is to generate a plain text file containing a large amount of text in the language we're going to analyze. A good explanation for how to do it is here: http://blog.afterthedeadline.com/2009/12/04/generating-a-plain-text-corpus-from-wikipedia/ (or here, up to and including step three: http://trulymadlywordly.blogspot.ru/2011/03/creating-text-corpus-from-wikipedia.html)

The second step is to create a trigram file for Sonnet, using the following tool: http://quickgit.kde.org/?p=scratch%2Fsandsmark%2Fgentrigrams.git. To build it open a terminal, check it out with git (git clone git://anongit.kde.org/scratch/sandsmark/gentrigrams.git), enter the folder (cd gentrigrams), run qmake to prepare for building (qmake) and finally build it (make). Then run it (./gentrigram ../path/to/corpus.txt language.trigram, where you replace "../path/to/corpus.txt" with the path to the text file you generated in the first step). This will generate a text file with trigrams and their score that you send to me (the mentor).

To complete this task you will need to do this for five languages we don't already have, look at the first link (it should be fairly straightforward to do once you've done it for one language).