Turns out that constructing a hand-written C-style parser has a few parts that were very difficult for me.

The first thing was realizing that Backus Naur Form (BNF) largely sucks if you want to hand-write your own parser. BNF is really verbose and expressing simple things like optional terminals or lists is difficult. BNF is also poor for expressing operator precedence, as many intermediate and redundant non-terminals are required to be evaluated during parse-tree derivation. As an alternative Extended Backus Naur Form is perfect for languages that plan to use hand-written parsers instead of parsers created by parser generators. Left-factoring a BNF for LL parsing is also not very useful since handling infinite recursion with hand-written code is trivial.

The second thing is that parsing expressions with various types of operators can be really difficult, especially if there’s a lack of confidence in recursion (like myself). Creating a parse tree given a string representing an expression is a very recursive problem.

In C expressions consist of *atoms* and operators. An atom would be a literal, constant value, identifier, or an expression wrapped in parentheses. Operators are the usual + or – kind of tokens.

If each operator has an associated precedence value there are a few different algorithms out there with references for learning. I ended up face in the dirt and in the end derived what is known as “precedence climbing“. According to Eli Bendersky precedence climbing is what is currently used by Clang to parse C++ expressions! That ought to instill some perceived merit. From what I can tell Lua 5.3 uses (well, very close to) the same method.

The idea of precedence climbing is to think of two major recursive operations:

- Compute righthand-side node
- Make a binary operator node and connect lefthand-side and righthand-side children nodes

The first point is the complex one (that is, conceptually complex). The algorithm starts given a lefthand-side node, however, righthand-side nodes do not come in through the input stream in tree format; the next token can represent a node that should be much deeper in the tree — this means that computing the righthand-side node ought to be the main recursive path.

Realizing that the righthand-side node computation is the recursive path led me to notice a key observation that tipped me off to a working algorithm.

Say we have the following input string as an expression: A 2 B 1 C 4 D 3 E 7 F

Numbers are operators, and the number itself is precedence (higher number is higher precedence), letters are atoms (like a const int variable). Here’s the valid parse tree:

The lowest leaves are evaluated first. It’s easy to see that the tree itself encodes the operator precedence.

If we begin parsing our input string the first atom is A, which will always be a lefthand-side node for most any parsing algorithm used, and will likely be the left-most node in the tree. The next token is the 2 operator followed by B. It’s easy enough to construct the subtree of node 2 pointing to A and B.

The next input token is the operator 1 and atom C. C is bound by operator precedence to the operator 4, though the current state of the algorithm has yet to even read in the token 4. Studying this scenario is what tipped me off to a working solution; C must be treated as a lefthand-side node, though at the current state is considered a potential righthand-side node.

Wikipedia, and this link from earlier, both show great pseudo code for the precedence climbing algorithm. The main difference between the two links is wikipedia includes a nested for-loop in favor of less overall recursive calls. My own code ended up looking something like after I cleaned it up from influences of previous links:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
ast_expr* sub_expr( ast_expr* lhs, int min_prec ) { while ( 1 ) { binop op = tk_to_binop( token ); int prec = BINARY_PRECEDENCE[ op ]; if ( prec < min_prec ) { break; } next( ); ast_expr* rhs = atom( ); int next_prec; while ( 1 ) { binop next_op = tk_to_binop( token ); next_prec = BINARY_PRECEDENCE[ next_op ]; if ( next_prec <= prec ) { break; } rhs = sub_expr( rhs, next_prec ); } ast_expr* op_node = make_binop( op ); op_node->child = lhs; lhs->next = rhs; lhs = op_node; } return lhs; } |

In the end I’m quite happy with the result, and even hooked up a nice ascii-tree printer courtesy of a random stack-overflow user. Here’s a dot product and initialization trees in ascii:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
.-(=)-. (x) (2) .-(=)-. (y) (4) .-(=)-. (z) (6) .--(=)----------------------. (dot) .----------(+)----. .----(+)----. .-(*)-. .-(*)-. .-(*)-. (z) (z) (x) (x) (y) (y) |

My favorite part about the operator precedence climbing algorithm is how it handles parentheses and prefix unary operators: parentheses can be considered an atom, and when the atom function finds a parentheses is just calls the *expression* parsing function directly and returns the result! The same can be done for prefix unary operators (if they have really high precedence). The algorithm also trivially handles right-associativity. I haven’t yet thought about unary postfix operators, so if any reader has thoughts on this topic please do comment!

Here’s psuedo-y *atom* code:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
ast_expr* expr( ) { return sub_expr( atom( ), 0 ); } ast_expr* atom( ) { ast_expr* exp; switch ( token ) { // parse single token expressions like // ints, floats, id, literals (string) // etc. here case '(': next( ); exp = expr( ); match( ')' ); break; default: exp = NULL; syntax_error( ); break; } return exp; } |