Parsing Expressions

If we design the syntax for expressions in a particular way:

     <expr> = number | variable | <dyadic>
     <dyadic> = "(" <expr> operator <expr> ")"

it becomes very easy to write a parser. Each of the three possible forms of an expression begins with a different kind of token (number, identifier, or open-parenthesis), so all you have to do is ask the lexical analyser for the next token, and you know exactly what is coming.

Unfortunately that syntax design is unacceptably restrictive for most purposes. Everybody expects to be able to write things like:

     print 2*x+3*y;
     print 2*(x+3)*y;
     y=y+1;

but we are forced to write:

     print ((2*x)+(3*y));
     print ((2*(x+3))*y);
     y=(y+1);

It is still possible to do everything, but too annoying, and hard to read the results.

Clearly we should be able to make parentheses optional. People can use them if they want to, or if they really are needed, but in other cases the natural priorities of the operators should suffice.

The trick comes in two parts. First we'll see how to dispense with parentheses when they aren't needed, and have operations work strictly from left to right (i.e. ignoring priorities).

The grammar is slightly redefined to make things more understandable:

     <expr0>     = number | variable | <bracketed>
     <bracketed> = "(" <expr> ")"
     <expr>      = <expr0> ( operator <expr0> )*

(remember that * represents repetition any number of times, including no times at all).

Note that expr0 represents the basic ("atomic" in a way) kinds of expressions. They are impervious to any future additions needed to handle priorities. Simple numbers, variables, and things inside parentheses are above all operators' priorities.

Note also that a <bracketed> is exceptionally easy to parse: call the lexical analyser and demand an "(", then call the parser for a general <expr>, then call the lexical analyser and demand a ")", and that's it. An <expr0> is also very easy to parse: each of its three possibilities begins with a different token.

The way to parse the main <expr>s is the only place where anything special is required: the repetition represented by the * in the grammar is matched by a loop in the parser. First there must be an <expr0>, so use the already-written parser for that. Then, so long as the next token is an operator (ask the lexical analyser), read that operator and the <expr0> that must follow it.

The parsing function should look something like this: (note how exactly it matches the descitpion above)

     ... parse_expr(void)
     { tree *left = parse_expr0();
       get next token
       while (the token represents an operator)
       { op = that operator;
         tree *right = parse_expr0();
         left = new dyadic_expr_node(left, op, right); }
       unread the last token;
       return left; }

You should be able to get a good picture of exactly how and why it works by following through the execution when the input is something like 4-3-2-1.

Note one important thing: when the input has more than one operator in it, such as 4-2-1, the tree is built in such a way that forces evaluation to be left-to-right. The tree we get from 4-2-1 looks like this:


               +---------------+
 ------------->|  DYADIC_EXPR  |
               +---------------+
               |   OP_MINUS    |
               +---------------+
               |   @   |   @   |
               +---+-------+---+
                   |       |                    +-------------+
                   |       +------------------->|  CONST_INT  |
                   |                            +-------------+
                   |                            |      1      |
                   |                            +-------------+
                   |   +---------------+
                   +-->|  DYADIC_EXPR  |
                       +---------------+
                       |   OP_MINUS    |
                       +---------------+
                       |   @   |   @   |
                       +---+-------+---+
                           |       | 
                           |       | 
                           |       | 
     +-------------+       |       |       +-------------+
     |  CONST_INT  |<------+       +------>|  CONST_INT  |
     +-------------+                       +-------------+
     |      4      |                       |      2      |
     +-------------+                       +-------------+

With that structure it is hard to imagine any kind of evaluator or interpreter that could possibly subtract the 1 from the 2, then subtract the result from 4. It is very clear that (4-2) exists as a self-contained subexpression, whereas (2-1) has no existence at all.

Now we just need to allow natural priorities of operators to come into play, and everything is done.

Another slight change to the grammar is needed:

     <expr0>      = number | variable | <bracketed>
     <bracketed>  = "(" <expr> ")"
     <expr1>      = <expr0> ( <mul-or-div> <expr0> )*
     <mul-or-div> = "*" | "/"

This is a trivial change. All that has happened is that something called <expr1> is being defined instead of <expr>, and the choice of operators has been restricted to multiplication and division. Everything else remains exactly the same, so converting the parsing functiona to this new scheme is bound to be a trivial, almost zero-time, operation.

So, calling the parsing function parse_expr1() will read any sequence of operands (i.e. expr0s) separated by high priority operators, and return a parse tree that correctly forces left-to-right evaluation.

Now we extend the grammar by adding just two simple lines (no changes, just an addition):

     <expr0>      = number | variable | <bracketed>
     <bracketed>  = "(" <expr> ")"
     <expr1>      = <expr0> ( <mul-or-div> <expr0> )*
     <mul-or-div> = "*" | "/"
     <expr2>      = <expr1> ( <add-or-sub> <expr1> )*
     <add-or-sub> = "+" | "-"

Notice the very strong similarities: the relationship between expr2 and expr1 is exactly the same as the relationship between expr1 and expr0. So, if an expr1 is a sequence of expr0s separated by * and / operators, then an expr2 will be a sequence of expr1s separated by + and - operators.

The parse-tree produced for an expr2 may contain any of the four operators +, -, *, and /, but * and / may only occur in expr1s, sub-expressions of the expr2s that contain the + and - operations. Meaning that the high priority operators have been forced to appear lower down in the tree than the low priority operators, and will therefore have to be evaulated first.

The parser for an expr2 is exactly the same as the parser for an expr1, but with suitable substitutions made: "expr0" becomes "expr1", "*" becomes "+", and "/" becomes "-".

A third level of priorities may be added simply by adding another layer to the grammar. There is no limit to the number of levels of priority that can be supported:

     <expr0>      = number | variable | <bracketed>
     <bracketed>  = "(" <expr> ")"
     <expr1>      = <expr0> ( <mul-or-div> <expr0> )*
     <mul-or-div> = "*" | "/" | "%"
     <expr2>      = <expr1> ( <add-or-sub> <expr1> )*
     <add-or-sub> = "+" | "-"
     <expr3>      = <expr2> ( <relational> <expr2> )*
     <relational> = "<" | ">" | "<=" | ">=" | "==" | "!="
     <expr4>      = <expr3> ( && <expr3> )*
     <expr5>      = <expr4> ( || <expr4> )*
     <expr6>      = <expr5> ( <update> <expr5> )*
     <update>     = "=" | "+=" | "-=" | "*=" | "/=" | "%=" | "&&=" | "||="
     <expr>       = <expr6>

And that really is all it takes.