A Look-Ahead(1) Left-to-Right Parser Generator for Parrot

Tyler Curtis

Short description: A LALR parser generator will be developed that will convert declarative specifications into Winxed code to parse an input stream of tokens.

Personal Information

Name: Tyler Curtis
Email: tyler.l.curtis@gmail.com
IRC: tcurtis

Abstract

       A LALR parser generator will be developed that will convert declarative specifications into Winxed code to parse an input stream of tokens.

Benefits to the Parrot VM and Open Source Community

       Parrot currently has a very convenient to use set of tools for the development of compilers. However, the Perl 6 grammars used by PCT have some performance disadvantages compared to LALR(1) grammars, when they can be expressed as LALR(1) grammars. They are also less familiar to users of conventional parsing tools, such as lex and yacc. The creation of more conventional parsing tools will make it easier for compiler writers to apply their prior knowledge to targetting Parrot. They could even potentially easily adapt portions of their existing parsers to use the developed tools. LALR(1) parser generators also make expressing some grammars easier. It will also be provider a more language-agnostic option for compiler development on Parrot.

Deliverables

       A LALR parser generator library which generates classes written in Winxed from objects specifying a grammar.
       An executable which uses the parser generator library to produce parsers from a DSL specifying a grammar.
       Some example grammars.
       A test suite.
       Documentation.

Project Details

       LALR parsers are a variation of LR parsers which require much smaller parsing tables. They merge states from an LR parsing table with identical sets of first components. They take time linear in the size of their input to run, which is quite fast. However, they are also very complicated. Writing them by hand is not practical. Instead, a number of LALR parser generators have been written, which convert a declarative specification of an LALR grammar (possibly including semantic actions to build a parse tree or perform other actions in the process of parsing) into a LALR parser.

       This is a rather convenient way of writing parsers. There are LALR parser generators producing parsers in a great number of programming languages now. Parrot has a compiler toolkit which uses Perl 6 regexes to express a language's grammar; however, Parsing expression grammar (PEGs, which are analogous to Perl 6 regexes) parsers require either more memory or more time to parse input than LALR parsers. They also do not handle left recursive grammars as well as LALR parsers.

       My project would consist of developing a parser generator that generates parsers as Parrot classes. The parsers will present an interface similar to HLL::Grammar objects(with parse and parsefile methods). The primary focus will be on developing the code generation from a grammar specification represented as Parrot objects. However, if time is available, a domain-specific language for specifying grammars will be developed. This DSL will be implemented using a parser that builds a grammar specification object corresponding to its
input.

       Some sources I intend to read about the theory of LALR(1) parsing:
               The Dragon Book
               Parsing Techniques - A Practical Guide
               "Practical LR(k) Translators" by Frank DeRemer.
               "Efficient Computation of LALR(1) Look-Ahead Sets" by Frank DeRemer
and Thomas Pennello.
               "On the Translation of Languages from Left to Right" by Donald Knuth.

Project Schedule

       The portion of the schedule prior to June 11 is intended to be less
demanding in time because I will still be in classes until then.

       During any schedule item involving the implementation of any component, tests and documentation are planned to be written concurrently with the implementation, in addition to any explicitly specified times.


       April 8 - May 22: Study the theory behind LALR(1) parsing. Look at
the code of existing LALR parser generators. Begin designing the
representation of the grammars.

       May 23 - May 31: Implement the set of classes representing a grammar, writing appropriate tests concurrently.

       May 30 - 31: Document these classes.

       May 31 - June 11: Implement the creation of LR(0) parsing tables from grammars.

            May 31 - June 1: Write initial tests.

            June 10 - June 11: Document this facility.

       June 12 - June 18: Implement the "Algorithm Digraph" from "Efficient
Computation of LALR(1) Look-Ahead Sets", which is used in the generation of the LALR(1) parsing table.

            June 12 - June 13: Write initial tests.

            June 17 - June 18: Document this algorithm and the interfaces it requires of its parameters.


       June 19 - July 2: Implement the creation of the LALR(1) parsing table.

            June 19 - June 20: Write initial tests.

            June 25: Draft documentation.

            July 1 - July 2: Update documentation for any changes.

       July 3 - July 10: Begin implementing the code generation.

            July 3 - July 4: Write simple test cases.

            July 9 - July 10: Draft documentation for the generated parser interface.


       July 11 - July 15: Work on mid-term evaluation.

       July 16 - July 30: Complete implementation of the code generator.

              July 20: Update documentation.

              July 25 - 26: Write more advanced tests for more complicated grammars.

              July 30: Update documentation.


       July 31 - August 6: If not running behind, implement a DSL for
specifying grammars for the parser generator. This should be implemented using the parser generator.

              July 31: Write a few test grammars.

              August 5 - 6: Write documentation.

       August 7 - August 22: Write more tests and documentation. Fix bugs. Explore optimizations, both in the generated code and in simplifying the grammar before code generation.

       August 22: Hard "pencils down" date.

       August 22 - August 26: Work on final evaluation.

References and Likely Mentors

       I have spoken with Whiteknight, and he is interested in mentoring.

License

       Artistic License 2.0

Bio

       I am a first-year student at the University of Chicago studying Computer Science and Mathematics. I did Google Summer of Code last
year for Parrot. Development tools, particularly compilers, parsers, and object systems, are my primary interests in Computer Science. I do
not have much previous knowledge of the theory of LALR parsing, but, if I am accepted, I plan to spend the remaining time before GSoC begins studying it.

       Eligibility: I am an eligible student.