Making the Perl Core UTF-8 clean.
Hugmeir
Short description: Clean handling of Unicode throughout the Perl core, rather than in a certain selected areas.
Brian Neil Fraser
fraserbn@gmail.com
Making the Perl Core UTF-8 clean.
Abstract
Clean handling of Unicode throughout the Perl core, rather than in a certain selected areas.
Benefits to the Perl/Open Source Community
Any Perl developer who has ever needed reliable handling of Unicode in identifier names would benefit from this project, as well as the Perl core developers, for whom a sane parsing of UTF-8 would create the opportunity to rework the ‘use encoding’ pragma into something meaningful; Meanwhile, modules that extend Perl’s syntax would need to jump several less hoops by relinquishing ASCII semantics.
Currently, someone may use the utf8 pragma to gain a certain level of Unicode support in their programs, such as with string and regex literals (or, alternatively, through the “use feature ‘unicode_strings’;” pragma), and superficially with lexical variables. However, internally, lexicals are still stored as raw bytes rather than by code point - The same holds true for package variables. Meanwhile, certain subsystems of the lexer and the parser disregard both SVf_UTF8 and the UTF-8 hints flags.
Deliverables
The modified core files, as a patch or a git branch, plus the accompanying test suite.
Project Details
The project is broad in the sense that it touches fairly different areas of the core: The scratchpads - places where Perl saves lexical variables for any given scope - are implemented as more or less normal arrays and deals with SVs, so for the most part the issues are in finding a good way to detect or pass the HINT_UTF8 flag or its derivatives when dealing with char * parameters. Thus, this should be relatively easy to implement without far reaching complications.
Stashes and GVs present similar issues - hv_name_set(), which is used for stashes, has a flags parameter already, but nothing ever passes a non-zero argument. Thankfully, from my admittedly limited research, it would appear that most parts of the HV API are already respectful of Unicode, or at least primed to do so. Meanwhile, GVs accessed through symbolic links make their portion of the problem quite egregious: $běh, ${"b\x{c4}\x{9b}h"}, ${"b\x{11b}h"}, ${"b\N{U+011b}h"} and ${"b\304\233h"} all refer to the same variable, even though the middle parts of the second and fifth versions have nothing to do with the original (internally, lexicals have the same problems, but this is a non-issue: Without symlinks the only way to trigger this behaviour is by using an identifier with a grapheme whose bytes encode into a valid ASCII identifier - Which may not even exist). Certain important considerations will have to be taken during the implementation, as a sane handling of Unicode in package names could facilitate the future support of Unicode in filenames/paths, hopefully amongst other things.
While the lexer function bufutf8() (and the UTF call) tells whenever the lexer buffer is to be interpreted as UTF-8 or Latin-1, some parts of the lexer ignore it, or perform operations that aren’t sensitive to an Unicode context. Initially I’d be going through these and related subparts, changing things to pass the flag and respect it where it needs to.
It’s useful to note that, should any particular part of this project become impossible to implement, the decoupling of the different areas make it viable to take what is done and add them to the core separately, so gains from this project aren’t necessarily tied to a full completion.
Project Schedule
This is a rough sketch of the project’s idealised schedule. Beyond hacking on pad.c during the community period, which I intend to use to flesh out any personal misconceptions I may have of the core’s handling of Unicode, the rest is entirely malleable - And the first part is only slightly less so. Each step also includes an implicit “run make test and see what breaks.”
Community Bonding Period: Continue immersing myself in the core, learn git, and attempt to fully understand the problem space as to avoid the mistakes of utf8.pm. Most of the work on pad.c will take place here, as to get the most feedback possible from my mentor and #p5p early on.
May 23 - 30: Wrap up pad.c. Identify unclean spots in hv.c and gv.c, then discuss a sensible roadmap for fixing things with my mentor, and resume hacking.
May 31 - June 6: Assuming that I dedicated the previous week to hv.c, I should be working on gv.c by now.
June 7 - June 27: I have most of my midterms during these three weeks, so productivity will probably go down by quite a bit. The plan remains to continue hacking until both stashes and typeglobs handle things cleanly. Beyond the normal testing, it should now be possible to run - and trivial to create - an alternative version of the test suite with all variables swapped for Unicode facsimiles; though I’d have to confer with my mentor as to whenever testing this is a good use of the project’s time.
June 28 - July 10: Should nothing else surface that requires tweaking, start walking through everything that looks at the lexer buffer and check that the flags are being passed appropriately.
July 11 - July 15: Midterm evaluation.
July 16 - August 15: Continue working on the lexer and the parser. If possible, fix the PL_rsfp’s UTFness being ignored and string eval under utf8.pm bugs.
References and Likely Mentors
The entire #soc-help channel was beyond helpful, but a special mention goes to t0m, who answered Far More Than Everything I’ve Ever Wanted to Know about GSOC, and rafl, who not only helped me from day one, but probably would’ve done his own FMTEYEWTK had I asked. Although we have only recently gotten in touch, Zefram’s advice has been invaluable in getting this project going. rafl could also be a possible mentor.
License
Same terms as Perl itself.
Bio
I’m a 20 years old student in Buenos Aires, Argentina; I’m also currently working full-time as a Ruby developer and, while Ruby is a fairly acceptable Perl, this is something I very much look forward to cease being. Unfortunately I don’t have any experience working on Open Source projects; however, my job initially had me take over a moderately big home-grown web testing application and add support for a sizable portion of the site that was untested for no good reason: As such, I’ve had some experience in adding functionality while leaving everything else untouched. Even if, like Zefram noted and I can’t but agree wholeheartedly, that's not a realistic expectation; I can only say that I’ll do my best to wear a raincoat in someone else’s core.
If this project is selected, I’ll quit my job during the community bonding period - So while I can’t assure you that I’m the best person for this project, I definitely am the most committed.
I’ve been learning Perl for a little over a year now, and it quickly grew to become my favorite computer-related activity - My previous encounters with programming (C and Java, thanks to college) had left me disillusioned about, well, a lot. It’s odd what a single night with Learning Perl can do to someone! Ever since, Perl has been part of my daily routine - And GSOC gives me a chance to give back to the community from which I’ve been so gladly leeching. Though I’ll admit that the prospect of fattening my resume is a small ulterior motive!
Eligibility
I am currently a student at the Universidad de Palermo (University of Palermo) in Buenos Aires, Argentina, and can provide documentation upon request.
