Parsing Expressions
If we design the syntax for expressions in a particular way:
<expr> = number | variable | <dyadic>
<dyadic> = "(" <expr> operator <expr> ")"
it becomes very easy to write a parser. Each of the three possible
forms of an expression begins with a different kind of token
(number, identifier, or open-parenthesis), so all you have to
do is ask the lexical analyser for the next token, and you
know exactly what is coming.
Unfortunately that syntax design is unacceptably restrictive
for most purposes. Everybody expects to be able to write things
like: print 2*x+3*y;
print 2*(x+3)*y;
y=y+1;
but we are forced to write:
print ((2*x)+(3*y));
print ((2*(x+3))*y);
y=(y+1);
It is still possible to do everything, but too annoying, and hard
to read the results.
Clearly we should be able to make parentheses optional. People can
use them if they want to, or if they really are needed, but in other
cases the natural priorities of the operators should suffice.
The trick comes in two parts. First we'll see how to dispense
with parentheses when they aren't needed, and have operations
work strictly from left to right (i.e. ignoring priorities).
The grammar is slightly redefined to make things more understandable:
<expr0> = number | variable | <bracketed>
<bracketed> = "(" <expr> ")"
<expr> = <expr0> ( operator <expr0> )*
(remember that * represents repetition any number of times, including
no times at all).
Note that expr0 represents the basic ("atomic" in a way) kinds of
expressions. They are impervious to any future additions needed to
handle priorities. Simple numbers, variables, and things inside
parentheses are above all operators' priorities.
Note also that a <bracketed> is exceptionally easy to parse:
call the lexical analyser and demand an "(", then call the parser
for a general <expr>, then call the lexical analyser
and demand a ")", and that's it. An <expr0> is also very
easy to parse: each of its three possibilities begins with a different
token.
The way to parse the main <expr>s is the only place where
anything special is required: the repetition represented by the *
in the grammar is matched by a loop in the parser. First there
must be an <expr0>, so use the already-written parser for
that. Then, so long as the next token is an operator (ask the
lexical analyser), read that operator and the <expr0> that
must follow it.
The parsing function should look something like this: (note how exactly
it matches the descitpion above)
... parse_expr(void)
{ tree *left = parse_expr0();
get next token
while (the token represents an operator)
{ op = that operator;
tree *right = parse_expr0();
left = new dyadic_expr_node(left, op, right); }
unread the last token;
return left; }
You should be able to get a good picture of exactly how and why it
works by following through the execution when the input is something
like 4-3-2-1.
Note one important thing: when the input has more than one operator in
it, such as 4-2-1, the tree is built in such a way that forces
evaluation to be left-to-right. The tree we get from 4-2-1 looks like
this:
+---------------+
------------->| DYADIC_EXPR |
+---------------+
| OP_MINUS |
+---------------+
| @ | @ |
+---+-------+---+
| | +-------------+
| +------------------->| CONST_INT |
| +-------------+
| | 1 |
| +-------------+
| +---------------+
+-->| DYADIC_EXPR |
+---------------+
| OP_MINUS |
+---------------+
| @ | @ |
+---+-------+---+
| |
| |
| |
+-------------+ | | +-------------+
| CONST_INT |<------+ +------>| CONST_INT |
+-------------+ +-------------+
| 4 | | 2 |
+-------------+ +-------------+
With that structure it is hard to imagine any kind of evaluator
or interpreter that could possibly subtract the 1 from the 2, then
subtract the result from 4. It is very clear that (4-2) exists as
a self-contained subexpression, whereas (2-1) has no existence at all.
Now we just need to allow natural priorities of operators to come
into play, and everything is done.
Another slight change to the grammar is needed:
<expr0> = number | variable | <bracketed>
<bracketed> = "(" <expr> ")"
<expr1> = <expr0> ( <mul-or-div> <expr0> )*
<mul-or-div> = "*" | "/"
This is a trivial change. All that has happened is that something called
<expr1> is being defined instead of <expr>, and the choice
of operators has been restricted to multiplication and division. Everything
else remains exactly the same, so converting the parsing functiona
to this new scheme is bound to be a trivial, almost zero-time, operation.
So, calling the parsing function parse_expr1() will read any sequence of operands
(i.e. expr0s)
separated by high priority operators, and return a parse tree that correctly
forces left-to-right evaluation.
Now we extend the grammar by adding just two simple lines (no changes, just an addition):
<expr0> = number | variable | <bracketed>
<bracketed> = "(" <expr> ")"
<expr1> = <expr0> ( <mul-or-div> <expr0> )*
<mul-or-div> = "*" | "/"
<expr2> = <expr1> ( <add-or-sub> <expr1> )*
<add-or-sub> = "+" | "-"
Notice the very strong similarities: the relationship between expr2 and expr1 is
exactly the same as the relationship between expr1 and expr0. So, if an expr1
is a sequence of expr0s separated by * and / operators, then an expr2 will be
a sequence of expr1s separated by + and - operators.
The parse-tree produced for an expr2 may contain any of the four operators
+, -, *, and /, but * and / may only occur in expr1s, sub-expressions of the
expr2s that contain the + and - operations. Meaning that the high priority
operators have been forced to appear lower down in the tree than the low
priority operators, and will therefore have to be evaulated first.
The parser for an expr2 is exactly the same as the parser for an expr1, but
with suitable substitutions made: "expr0" becomes "expr1", "*" becomes "+",
and "/" becomes "-".
A third level of priorities may be added simply by adding another layer to
the grammar. There is no limit to the number of levels of priority that can
be supported:
<expr0> = number | variable | <bracketed>
<bracketed> = "(" <expr> ")"
<expr1> = <expr0> ( <mul-or-div> <expr0> )*
<mul-or-div> = "*" | "/" | "%"
<expr2> = <expr1> ( <add-or-sub> <expr1> )*
<add-or-sub> = "+" | "-"
<expr3> = <expr2> ( <relational> <expr2> )*
<relational> = "<" | ">" | "<=" | ">=" | "==" | "!="
<expr4> = <expr3> ( && <expr3> )*
<expr5> = <expr4> ( || <expr4> )*
<expr6> = <expr5> ( <update> <expr5> )*
<update> = "=" | "+=" | "-=" | "*=" | "/=" | "%=" | "&&=" | "||="
<expr> = <expr6>
And that really is all it takes.