Lecture: Grammars and Parsing
Dear students:
Over the last two weeks we've been building an interpreter for Vex programs. In particular, we wrote model classes and a lexer. Today we continue building that interpreter: we're going to turn a Vex program into a tree using a parser. Together the lexer and parser form the bulk of your work for milestone 2. Our discussion today is the last on Vex and designing our own language.
Grammar
The lexer returns a flat list of program pieces, of tokens. Our goal is to build a tree out of the pieces. The machine that turns a list into a tree is the parser. Grammar rules tell the parser how to shape the tree. The grammar encodes the precedence and the associativity of the operators, so we have to carefully write the rules. All valid programs may be generated from this one set of rules.
Programming language grammars start with a rule that describes the structure an entire program. The starting rule often looks like this:
program = statement* EOF
program = statement* EOF
Which of these is a terminal? Which is a non-terminal? EOF
is a fake token that the lexer inserts to make a visible marker of the program's end.
Each non-terminal must have its own rules or productions describing its possible structures. The example program shows print statements and assignment statements, so statement
might have these productions:
statement = IDENTIFIER EQUALS level0
| PRINT LEFT_PARENTHESIS level0 RIGHT_PARENTHESIS
statement = IDENTIFIER EQUALS level0 | PRINT LEFT_PARENTHESIS level0 RIGHT_PARENTHESIS
These productions introduce a new non-terminal: level0
. In this class, any non-terminal that starts with level
describes an expression, something that produces a value. An expression can be a literal or an operation like addition or multiplication. Each level has a precedence. We'll arrange them in what I like to think of as the Ladder of Predecence. The Ladder has three rules:
-
You enter the ladder at the top. If you need to parse an arbitrary expression, you invoke the
level0
rule, always. - At each rung, you can parse a substructure at the current rung (recurse) or drop to the next rung (descend).
- Only at the bottom rung can you parse a substructure from back at the top of the ladder.
At the bottom is levelN
, with the highest precedence, which has these productions:
levelN = LEFT_BRACKET INTEGER COMMA INTEGER RIGHT_BRACKET
| INTEGER
| IDENTIFIER
levelN = LEFT_BRACKET INTEGER COMMA INTEGER RIGHT_BRACKET | INTEGER | IDENTIFIER
Not only do these expressions have really high precedence, but they also don't appear directly next to each other. We don't see 5 2
or v 7
. They are non-associative. I've tried out various alternate names for these, like island
or unit
or atom
. I'm sticking with levelN
for now.
The next rung up has the second highest precedence. Let's say that multiplication (*
) occupies this level. What is the structure of one of these expressions? Using our first two rules, we might write this as our first draft:
level1 = levelN * levelN
| levelN
level1 = levelN * levelN | levelN
These productions don't quite describe the structures we want to express. What if we want to multiply three vectors? Something like a * b * c
? That's not legal currently. We fix the problem by allowing rules to be recursive.
In tree form, what should a * b * c
look like? If we want a
and b
to be multiplied first, the tree should look this:
*
/ \
* c
/ \
a b
* / \ * c / \ a b
This picture shows that we need to allow recursion in the left operand:
level1 = level1 * levelN
| levelN
level1 = level1 * levelN | levelN
If we allow both operands to be recursive, then we'd have an ambiguity in our grammar. We could interpret the expression as either (a * b) * c
or a * (b * c)
. Ambiguities aren't good. Different compilers will produce different trees.
The rule for level0
has a similar structure, but it sits atop the level1
expressions:
level0 = level0 + level1
| level1
level0 = level0 + level1 | level1
On a ladder, we drop from one rung only to the one right below. We follow the same principle in our grammar. The productions either stay at their level or drop one level below to a higher precedence level.
We generally don't climb back up the Ladder, because that would allow a lower precedence operation to sneak in under a higher-precedence operation. However, we make an exception when there's a syntactic boundary that safely resets the precedence. These boundaries generally appear in levelN
. For example, levelN
jumps back up to level0
on parentheses because the parentheses let the precedence to start over:
levelN = ...
| ( level0 )
levelN = ... | ( level0 )
This grammar might seem like just an inconsequential thinking exercise. Where's the code? But it's important. This is how programming languages are specified. The stack of nonterminals determines operator precedence. The placement of recursion within a production determines operator associativity. Grammars are included in a language's technical documentation. This grammar very nearly writes the parsing code for us.
Parser
We write a grammar to inform our implementation of a recursive descent parser, just as state machines informed our implementation of a lexer. Each non-terminal of a grammar becomes a method in our parser. I like to break off implementing a parser one rung at a time, starting at the bottom. But before we parse levelN
, let's have a look at our utility methods:
- a constructor that receives the tokens
-
has(type)
, which looks to see if the next token has a particular type -
advance
, which moves ahead to the next token but returns the one -
parse
, which runs through all the tokens and recursively builds a tree
Once we've implemented these universal parsing methods, then we are ready to implement the methods for our particular grammar's non-terminals. One production of levelN
translates into this:
class Parser
# ...
def levelN
if has(:integer)
token = advance
Ast::Integer.new(token.text.to_i)
# ...
else
raise "Unexpected token: #{@tokens[@i]}"
end
end
end
class Parser # ... def levelN if has(:integer) token = advance Ast::Integer.new(token.text.to_i) # ... else raise "Unexpected token: #{@tokens[@i]}" end end end
While we're testing, we'll have the parse
method call whatever method we're testing. Eventually we'll have it point at our starting non-terminal program
. At this point, our parser can only turn integer literals into trees. Once we've fleshed out levelN
, we'll implement level1
so it reflects the grammar. It has two productions. How do we choose between them? That seems a little tricky. Let's focus on just the first production. Our first draft might look like this:
def level1
left = level1
advance # eat *
right = levelN
Ast::Multiply.new(left, right)
end
def level1 left = level1 advance # eat * right = levelN Ast::Multiply.new(left, right) end
This code has a problem. The very first statement is a recursive call to the method we are trying to write. Our program will quickly reach a stack overflow. How can we fix this? Well, eventually, that recursion is going to bottom out to a call to levelN
. Let's make that call first:
def level1
left = levelN
advance # eat *
right = levelN
Ast::Multiply.new(left, right)
end
def level1 left = levelN advance # eat * right = levelN Ast::Multiply.new(left, right) end
This code also has a problem. Since we lost the recursion, we can't parse more than one multiplication. We replace the recursion with iteration that builds up the root of the entire sequence of multiplies in left
:
def level1
left = levelN
while has(:asterisk)
advance
right = levelN
left = Ast::Multiply.new(left, right)
end
left
end
def level1 left = levelN while has(:asterisk) advance right = levelN left = Ast::Multiply.new(left, right) end left end
We craft level0
, statement
, and program
in a similar manner.
TODO
Here's your list of things to do before we meet next:
See you next time.
Sincerely,
P.S. It's time for a haiku!