Tuesday, 22 January 2013

Parsing Adventures with Marpa

As a semi-professional developer for developers (oh, and those end-user thingies out there) I've written a few parsers in my time. My work has always been based on the published works and books of proper computer science boffins - the usual candidates you've probably heard of, but I could never claim to completely understand the hard-core mathematical background of parsing.

We have been blessed by the availability of so much industrious research and powerful tools from these people. The classics being Lex and Yacc, their popular replacements Flex and Bison, or a whole slew of alternatives such as the powerful Parse::RecDescent available for Perl, not to mention the classic ad-hoc top-down parser that any man and his dog likes to write.

Speaking of industrious research and powerful tools, take a look at Marpa. The core parsing engine is written in C, and it is also hooked into Perl for those of us who prefer higher-level coding.

I accidentally saw it mentioned on a page at Perl.com (article) just at the right time. Following up quickly showed how promising it could be for my semi-amateur efforts, and I had a big project in mind at the time. Although Marpa is still a work in progress, that work seems to be fairly flying along, driven by it's Chief Engineer and a crew of keen users on the User Group. The product is certainly ready for real work, and the advice from the user group has been quite prompt and helpful.

A few years ago now, I started reading Higher-Order Perl by Mark Jason Dominus - something every keen Perl programmer should do. Of particular note is the work on Iterators, which has inspired the style of the lexer I've been applying lately to all my work. I have a little suite of iterators that I can chain together to perform various duties. These tokenizing iterators have been immediately useful for feeding tokens into my Marpa parser. The 1st few REAL postings on this Marpa subject will describe the Iterators and present source code - without too much blather if you're lucky!

Once the Lexer side is out of the way, the next interesting subject is the Marpa grammar itself. My project is to create a complete Informix 4GL parser which will emit a fully parsed syntax tree. From there, various activities can be applied, such as converting traditional 4GL to suit FourJs Genero version, or maybe pretty-printing, analysis, quality control (4GL lint, anyone?) whatever your imagination desires.

Apart from the usual learning curve you'd expect with any powerful module, I have to deal with one of the peculiar "features" of 4GL - the complete lack of reserved words. Being designed as a programmable wrapper around SQL, the 4GL language follows the principle of not actually reserving any of it's keywords. Strictly speaking, some (idiot) could write the following:
    char char(10),
    function char(10)

    call function(20)
end main

function function(int)
        int int

    let int = 10
    let char = int
    let function = char
    let char = function
    let function = char(30)    # calling the function defined below...
    display function
end function

function char(define)
    define define float    # yet another abuse of a special identifier

    display "char() called with ", define
    return define
end function
and expect it to run. Similar crimes can be written in SQL on all the major database engines, and in my career I HAVE seen several cases where a supposed-keyword is used as an identifier. So it's not merely of academic interest to handle this properly.

Very early in the reading of the Marpa literature, I came across mention of being able to feed "alternative" tokens into the parser at any particular point in the token stream. This immediately piqued my interest, and when I presented my case to the user group, the replies quickly came back (including from Jeffery) that it should indeed work, but they haven't done exactly that yet. I guess the intention was to handle natural language where words can be any kind of verb, noun or whatever depending on the context? I'm not sure of the original intention to tell the truth, but when all is said and done, it has worked admirably.

A few questions later, and a lot of hours spent transcribing the 4GL syntax diagrams, and I now have a fully functioning recognizer for Informix/FourJs 4GL. It has successfully processed all the 4GL I can get my hands on. A few mistaiks turned up but nothing sinister.

Getting to the game plan now, the next bunch of blog pages will cover, roughly:
  1. the Perl module I've created for holding lexical tokens
  2. one posting each for the Iterators, covering the ones used in the 4GL parser, and maybe one or two others that are interesting in their own right
  3. the main loop of the parser showing how it feeds alternative types of tokens into the parse. This relates closely to some of the fiddling the Iterators get up to.
  4. describe a few rule-wrappers I've used to simplify the grammar. Coincidentally these have been inspired slightly by the Higher-Order Perl book. They are not quite higher-order functions, but they do value-add and 'modify' the results of the main rule specifying function.

No comments:

Post a Comment