Schizophrenic Parentheses and Lemon Flavored Documentation

When is an open parentheses not an open parentheses? When it is an OPEN parentheses. Confused yet? My parser certainly was.

For regular languages, ambiguity in a statement can usually be cleared up by a little context. You might be able to mix around some prepositional phrases in a fun way to make it sound silly, use a homonym that someone misreads, or place an adverb in just the wrong place, but ultimately a human brain is parsing it. What you were trying to convey will probably come across just fine.

This is not the case with a programming language parser. The “context” is typically one “token”, which can be a single word or number, or could possibly be a large collection of tokens that has been “reduced” into a single thing. It is that reduction that makes it all possible, because it effectively boils down large amounts of context into a single, obvious nugget of data. For example, if you started reading a sentence from somewhere in the middle, and the first word you read was “wind”, it might be enough to know the previous word in order to properly understand it. Being able to discern whether or not it is acting as a verb or a noun might be enough to guess if you’re talking about a watch or a sailboat, but it isn’t as good as knowing that the entire chapter you’ve been reading is about Mahjong. However, if the whole context you received was “wind up”, you’d be lost. You might be able to prioritize watch over sailboat, but you couldn’t be sure it wasn’t somehow talking about an airplane’s lift.

Alright, enough rambling; time for more rambling.

The best and worst part about this whole programming language business up to this point has been the grammar; good ol’ yapParser.y (don’t be fooled by the extension; the format is Lemon, not yacc). It pushes your programmerbrain in directions it most likely hasn’t been before, which typically offers lots of really frustrating lows and doing-a-victory-lap highs. I can’t really explain what it is about building complicated grammar that makes me feel like that, only that the last time I can remember when I felt the same way might have been when I learned about recursion a million years ago. It offers this opportunity for terseness and elegance, and plenty of rope to hang yourself with. I’m sure it has to do with the feeling that I am solving a really hard puzzle.

This is where the dreaded “parsing conflict” error comes in. You sit down to add in a few “minor” features, which might require a tweak of your grammar a little bit. You add in a seemingly innocuous line to your grammar file, only for Lemon to spit back at you “18 parsing conflicts”, which stops you dead in your tracks. You look over the output file (which is meaningless the first time you read it, but so wonderful once you “get it”) and search for “conflict” to find the bad states. You then think about why the parser could possibly be confused when it reads in a comma, or a parentheses, or everything (if you’ve really screwed it up). Sometimes you get the Eureka moment you were hoping for and adjust your grammar accordingly, and sometimes you put the file down and walk away for a few months.

I did the latter.

The grammar tweak I wanted sounded pretty simple. In many places in my grammar, I allowed for a block of code to exist, such as the body of an if or while statement. However, unlike Every Other Language On The Planet, I didn’t allow for a single statement to take the place of a block. This can lead to an abundance of very-unnecessary braces, or angry programmers that are used to being able to do this in all languages other than Perl (postcondition doesn’t count!). I figured I’d add in this by making an intermediate token that was defined as either a statement or a statement_block.

Noooooooope. A billion zillion million parsing conflicts. I studied the output and quickly realized that by doing this, I’ve done the worst possible thing to the parser; created an ambiguity that touched pretty much every rule in the grammar. I ended up commenting out the connection in the code along with a pretty weak TODO to come back to it “later”.

The grammar line I had commented out for months:

// TODO: Reorganize grammar to allow for arbitrary scope creation and single-statement if contents
//statement_block(B) ::= statement(S).
//    { B = yapSyntaxCreateList(YST_STATEMENTLIST, S); }

A sample of the output state with this line enabled (note the conflicts):

State 11:
          statement ::= WHILE * expr_list statement_block
          expr_list ::= * expr_list COMMA expression
          expr_list ::= * expression
     (16) expr_list ::= *
          expression ::= * INT expression
          expression ::= * STRING expression
          expression ::= * NOT expression
          expression ::= * expression PLUS expression
          expression ::= * expression DASH expression
          expression ::= * expression STAR expression
          expression ::= * expression SLASH expression
          expression ::= * expression AND expression
          expression ::= * expression OR expression
          expression ::= * expression MOD LEFTPAREN expr_list RIGHTPAREN
          expression ::= * LEFTPAREN expr_list RIGHTPAREN
          expression ::= * lvalue EQUALS expression
          expression ::= * lvalue INHERITS expression
          expression ::= * lvalue
          expression ::= * INTEGER
          expression ::= * LITERALSTRING
          expression ::= * NULL
          expression ::= * FUNCTION LEFTPAREN ident_list RIGHTPAREN statement_block
          lvalue ::= * lvalue_indexable
          lvalue ::= * VAR IDENTIFIER
          lvalue_indexable ::= * lvalue_indexable LEFTPAREN expr_list RIGHTPAREN
          lvalue_indexable ::= * lvalue_indexable OPENBRACKET expression CLOSEBRACKET
          lvalue_indexable ::= * lvalue_indexable COLON IDENTIFIER LEFTPAREN expr_list RIGHTPAREN
          lvalue_indexable ::= * lvalue_indexable PERIOD IDENTIFIER
          lvalue_indexable ::= * IDENTIFIER

                     LEFTPAREN shift  12
                     LEFTPAREN reduce 16  ** Parsing conflict **
                           INT shift  31
                           INT reduce 16  ** Parsing conflict **
                        STRING shift  28
                        STRING reduce 16  ** Parsing conflict **
                           NOT shift  23
                           NOT reduce 16  ** Parsing conflict **
                      FUNCTION shift  66
                      FUNCTION reduce 16  ** Parsing conflict **
                    IDENTIFIER shift  94
                    IDENTIFIER reduce 16  ** Parsing conflict **
                       INTEGER shift  95
                       INTEGER reduce 16  ** Parsing conflict **
                 LITERALSTRING shift  79
                 LITERALSTRING reduce 16  ** Parsing conflict **
                          NULL shift  82
                          NULL reduce 16  ** Parsing conflict **
                           VAR shift  72
                           VAR reduce 16  ** Parsing conflict **
                     expr_list shift  2
                    expression shift  39
                        lvalue shift  50
              lvalue_indexable shift  45
                     {default} reduce 16

The reason these kinds of problems are so difficult is that you’ve most likely made a misstep somewhere completely different than the place you’re modifying, and it is coming back to bite you now. In my case, it turns out that the Big Stupid Mistake I was making was that I allowed my expression_list token to be “empty”. This seemed really convenient at the time; the grammar that handled function calls magically took zero arguments peacefully, my empty statements (just a semicolon) magically worked, amongst other things. What I didn’t realize was that I was adding an incredible amount of ambiguity to the parser, and I was just really, really lucky that I hadn’t been burnt too hard up until this point. I mean, how could I expect the parser to function properly if every possible location during the parse might be an empty list of expressions?

This fixed most of my problems, but not all of them. I still had an issue with parentheses. Let’s take a line of code as a starting point for the issue:

var v = someFunc(a, b, c) * 7 * (3 + 4);

There are two parenthesized sets of tokens in that line of code, and they have wildly different meanings. The first parens serve as a packaging for the arguments to a function call, and (in fact) cause the function call to occur in the first place. Removing the first set of parens would just cause that code to cite a reference to a function instead of actually executing it. The second set of parens is merely there to provide grouping and implicitly poke at the order of operations. Other than the fact that the parser orders the nodes in the syntax tree a little differently, the actual nodes themselves are the same.

The issue is with reduction. If you set up your grammar to reduce (3 + 4) into a single token of type “expression_list” or perhaps even “paren_expr_list”, that reduced token might not actually parse into a function call anymore, that is, unless your function call grammar was crafted with this reduction in mind (which mine wasn’t). You end up with a layering of bad grammar on bad grammar, and months away from your codebase to “recuperate”.

The fix ended up being a lucky series of googlings mixed with some more luck, along with the Lemon parser’s %fallback token. I decided that the only way I was going to solve these grammar problems was if I could see an example Lemon grammar of a language that actually had some complexity to it (read: not grammar to make a calculator), and actually looked more like a regular programming language (read: not SQL). Also, the Lemon parser generator is not too wonderful with its documentation, no offense to the authors. It is a wonderful, lean-and-mean piece of code, and there are plenty of comments in the few lemon grammars out there, but there are some features that can only be found out about via word-of-mouth-or-newsgroup-or-irc, and it just so happens that %fallback is one of them.

I saw on the Lua wiki that someone had took a shot at making a Lemon grammar for Lua (Listing 1). As Yap looks like it is going to end up as some bizarre mashup of Lua with hints of Javascript in there, this was quite an exciting read. However, one of the lines in the grammar was this line:

%fallback OPEN '(' .

… and the token OPEN being used later:

prefixexp ::= OPEN exp ')' .

The magic of this rule was completely lost on me, so I did what any sane geek would do, and I googled it. I ended up here, which is just some old sqlite mailing list discussion about parsing ambiguity and the fix being to use the %fallback token. It was then that I had my Eureka moment and could fix things.

What the %fallback token actually does is allow for a safety net / second chance for a token when the parser chokes on it. In the case of the mailing list question, the author wanted to attempt to parse the second token in his rule as a string if the rule failed doing the “right” thing. However, in the Lua grammar example, the fallback token was “OPEN”, which is a new token! The big magic trick here is that his parentheses needed to be two things at once; schizophrenic. Sometimes it just needed to be a boring grouping mechanism, and sometimes it needed to be more than that (such as a function call). Since a fallback action doesn’t occur until a parse failure is imminent, you can actually “fall back” to an alternate name for your own token, and implicitly set precedence on two usages of the same token, thus eliminating ambiguity.

This tool seems to convert Lemon grammar into some other grammar, and explicitly doesn’t support %fallback. The docs state “…then you can manually define an nonterminal that does what the %fallback directive would have done“. This comment suggests that the %fallback mechanism is simply a clever way to perform something that can actually be done by rearranging your grammar’s nonterminals differently, which I completely believe. It made me think of how you can convert recursive algorithms to iterative by just dropping in a stack, but that you might lose some of the “elegance”. It also made me realize that I am probably not experienced enough in the ways of programming language grammar syntax to perform this without %fallback, and that is okay. I have sweet, sweet single statement blocks now, and am blissfully ignorant about all of the parsing conflicts I am to have in the future. I already know I am hosed on the ternary operator in the same way Lua is (and for the same reason), but other than that, I think it might be time to burn through the Yap Language Issues and try to 1.0/release this codebase.

Programming is almost as fun as rambling.

Posted in Blog