Extended unicode support
Dmitry Olshansky
Short description: Make complex Unicode stuff a piece of cake in the D programming language. Major points are: normalization; faster and always up to date implementation of common isXXX character properties; a proper case insensitive comparators for strings; grapheme clusters. Possible extras: support for legacy encodings, generic mutli-level lookup tables, i.e. Trie for std.container.
Extended Unicode support
Make complex Unicode stuff a piece of cake in the D programming language.
Rationale
The D programming language had built-in support for Unicode the day one and it proved to be a wise decision over the years. However as time passes it steadily becomes obvious that in-language foundation of encoding aware types and UTF decoding alone is not enough.
Nowadays user expectations go higher than this, for instance, the following tasks should be easy to get right: case-agnostic string comparison, dealing with potentially non-normalized texts, operating on user-perceived character level.
Having it all nicely packed in standard library, D can hold its grounds at the top of the list of Unicode-aware languages.
Main points of proposal are presented roughly in the order of priority.
Data structures for Unicode
This is a core foundation of the library. It's provided so that not only std, but other 3rd party libraries (e.g. regex, hyphenation etc.) can built upon it.
In fact we can cover this need with 2 datatypes: CodepointSet based on inversion lists[1] and a flexible n-level Trie[2] templated type.
CodepointSet enables convenient and efficient way for handling sets of characters.
Trie!n provides various levels of trading flexibility and size for sheer lookup speed.
Secondary goal is to allow building tries for specified encoding like UTF-16, UTF-8 thus avoiding a decoding step, while doing direct lookup on encoded strings.
All of relevant isXxx functions of std.uni are rebuild on this foundation to provide an optimum trade-off of size vs speed on a case by case basis.
Normalization
Should be as easy as:
auto normalized = normalize(mystring);
That would do what 99.9*% of people expect - normalized is in pure NFC. More over no allocation happens if mystring is already NFC.
The focus is getting NFC correct and making it as fast we can get. Other normalization forms considered a secondary goal, but the order of priority is roughly: NFKD**, NFKC, NFD. Another rationale for NFC first can be found here: http://www.w3.org/TR/charmod-norm/#sec-TextNormalization
*Some may not even know what exactly NFC and Unicode is about, but in fact their expectations are that of NFC normalization.
**If I'm not mistaken NFKD provides some very nice fuzzy matching capabilities.
Case conversions and case-agnostic operations
Unicode standard specifies two ways of case folding - a simple and a full version. Simple version substitutes codepoints 1:1, full has some 1:n mappings.
Thus the idea is to enhance current icmp with 'full' understanding, and provide 'simple' version for performance reasons (aka sicmp, now that's a name!).
Current toUpperCase, toLowerCase should be revised, generalized and updated. Look at them now, they are outdated and a performance disaster in the making. A course of action is to use trie-based lookup, it is proved to be faster than hardcoded branchy if-else or binary search on modern CPUs[2].
User perceived Character or graphemes
Manipulations on grapheme level were commonly recognized as an important task on D forums around a year ago. I can't find the exact link, but this seems like a follow-up:
http://forum.dlang.org/thread/iggkip$gq8$1@digitalmars.com?page=1#post-igvh1h:24lmh:241:40digitalmars.com
(be warned this was a long and heated debate, that mixed d newbies, codepoints, graphemes and a new range concept)
This is provided to underline a point that it's more of a clean design problem then technical difficulty. The proposed solution is:
Provide a byGrapheme(x) or graphemes(x) helpers that return range of Grapheme.
Grapheme itself is a concrete value type holding the grapheme cluster, that is a slice of (decoded) codepoints.
The challenge here is to avoid memory allocation overhead, presumably by reusing a scratch memory block internal to std.uni.
Another utility artifact would be to provide a graphemeStride as a complement to utf.stride, for application that are willing to, say, wrap text while preserving grapheme boundaries.
Automation
It would be a shame to miss an important point of updating to the next revisions of Unicode. As part of project a D script should be produced that takes care of downloading, processing and updating Phobos character table files.
Bonus: tackle legacy encodings
It's a secondary goal and a complimentary project. The goals are to replace current std.encoding (too good a name) with an easy to use and more general alternative. The usage should be as simple as:
auto legacy = encode("Привет!", "CP-1251");
assert(legacy[0] == 0xCF);
writeln(decode(legacy)); //ok UTF-8 string sent to stdout
A tweak at work: encode should return an thin aliased-this wrapper over ubyte[], to allow decode to recognize original encoding. Of course, encoding can be hinted otherwise.
Another use case is streaming on-the-fly transcoding.
Again, fast recoding of UTF string to a legacy format and back involves having a fast Unicode lookup tables - Tries.
Bonus: generalized Trie data structure
Make a truly generic Trie that encompasses multistage tables, finger trees and other simillar data structures not limited to codepoint storage.
In essence this can be tackled via a policy based design, thus exposing all of relevant tradeoffs. A common application of Tries is database indexes and Aho-Corassic derived algorithms.
Bottom line, it’s a nice addition to std.container.
Time commitment
Honestly, the amount of time per week might vary, especially in August, as I'm going to be busy with important real life things.
The plan is to spend around 20-25 hours a week for first part of GSOC. And, realistically, no more then 15-20 hours for the second part.
The project is intentionally partitioned in primary, secondary goals and bonus points so that I can wrap up earlier if short on time, while still delivering major features, or if I get more spare time I wouldn't have to "invent" extras on the fly.
Current progress
During my work on regular expressions last summer, I had to spent a part of time on adding basic Unicode support. Even in this limited shape it amounted to 3 weeks of work at least. Honestly, the complexity of it turned out to be totally unexpected.
The net gain is that now I know ins and outs of Unicode far better. More over I have done an implementation of CodepointSet and 1-level tries[3]. Currently it's more of a hack then a general solution. As my understanding improved over time, I observe that these data structures require a redesign to provide proper user interface, flexibility and performance.
Another thing I've done as part of regex project is code generation script[4], that could serve as starting point for a proper automation.
I have put together a synopsis page for a would be std.uni:
http://blackwhale.github.com/phobos/std_uni.html
Challenges & Impact
The main challenge is creating a robust and flexible Trie abstraction. Once it's been done, everything else follows naturally from it. Still getting correct normalization is no easy feat, same goes for grapheme boundaries.
The benefit of fully realized project is multifold:
- A flexible foundation for fast Unicode-aware text processors that leaves no place for reinventing the wheel.
- The complete gentleman’s set of Unicode utilities, with NFC normalization and case insensitive comparison.
- At least NFKD paves a way for fuzzy string comparison this is a great feature for search engines.
- Grapheme clusters allow proper text segmentation; primarily graphemes are used in font rendering systems, nothing prevents the next one to be written in D.
- An automation script that future-proofs implementation against upcoming changes in character database details.
Two major points of secondary objectives (marked as bonus) add a pair of compelling items to the table:
- generalized lookup tables, specializations (and poor ones) of it are being reinvented on daily basis;
- finally get easy support of legacy encodings in Phobos; it's hard to stress enough the fact that they are still legacy in the name only, for one I see tons of CP-1251 only software in Russia.
Milestones
Some estimates.
22 May
Have the foundation built and working, have some provisions for fully generalized Trie.
10 June
Get both case insensitive comparisons working, revisit and update isXXX properties.
20 June
Get correct NFC normalization with reasonable speed.
Midterm
Grapheme ranges, better NFC, probably NFKD.
Ideally midterm marks completion of primary goals.
Midtrem – August
Ideally by August all of important goals are covered, documentation is written and automation script works flawlessly.
Till final
I’m resorting to polishing and fixing, and strictly additive features.
Links
1. Bits of Unicode, slides on Inversion lists and maps ad Unicode, by one of the core Unicode developers http://macchiato.com/slides/Bits_of_Unicode.ppt
2. Unicode implementation guidelines, page 2 (numbered 106)http://Unicode.org/uni2book/ch05.pdf
3. Internal hacky data types, https://github.com/D-Programming-Language/phobos/blob/master/std/internal/uni.d
4. Generated Unicode tables, https://github.com/D-Programming-Language/phobos/blob/master/std/internal/uni_tab.d
Technical reports
http://www.Unicode.org/reports/tr15/
http://Unicode.org/reports/tr29/
