Lecture: Parsing
Dear students:
Over the last two weeks we've been building an interpreter for a fraction language. In particular, we wrote model classes and a lexer. Today we continue building that interpreter: we're going to turn a fraction program into a tree using a parser. Together the lexer and parser form the bulk of your work for milestone 2.
Grammar
Last week we explored how to build a grammar that honors precedence and associativity. The key idea is to build a ladder. At the top of the grammar are the least greedy operators, the ones with lowest precedence. At the bottom are the highest precedence operators. The very bottom rule describes our primitives. Within a rule, we have just two choices of how to expand our subtracts: we recurse on the same level or we descend just one rung to the next level. If we recurse on the left, we have a left-associative operator. If we recurse on the right, we have a right-assocative operator.
We end up with a grammar that looks like this for our fraction language:
level0 = level0 + level1
| level0 - level1
| level1
level1 = level1 * level2
| level1 / level2
| level2
levelN = INTEGER / INTEGER
level0 = level0 + level1 | level0 - level1 | level1 level1 = level1 * level2 | level1 / level2 | level2 levelN = INTEGER / INTEGER
We don't have nodes for all of these operations yet, so we'll need to add them. While we're at it, we'll put all the nodes in their own namespace to avoid name collisions.
Once we get a parser going for this much of our language, we'll expand it to include a few more things.
Parser
We write a grammar to inform our implementation of a recursive descent parser, just as state machines informed our implementation of a lexer. Each non-terminal of a grammar becomes a method in our parser. I like to break off implementing a parser one rung at a time, starting at the bottom. But before we parse levelN
, let's have a look at our utility methods:
- a constructor that receives the tokens
-
has(type)
, which looks to see if the next token has a particular type -
advance
, which moves ahead to the next token but returns the one -
parse
, which runs through all the tokens and recursively builds a tree
Once we've implemented these universal parsing methods, then we are ready to implement the methods for our particular grammar's non-terminals. Our levelN
translates into this:
class Parser
# ...
def levelN
if has(:integer)
top_token = advance
raise "Expected /" if not has(:forward_slash)
bottom_token = advance
Ast::Fraction.new(top_token.text.to_i, bottom_token.text.to_i)
# ...
else
raise "Unexpected token: #{@tokens[@i]}"
end
end
end
class Parser # ... def levelN if has(:integer) top_token = advance raise "Expected /" if not has(:forward_slash) bottom_token = advance Ast::Fraction.new(top_token.text.to_i, bottom_token.text.to_i) # ... else raise "Unexpected token: #{@tokens[@i]}" end end end
While we're testing, we'll have the parse
method call whatever method we're testing. At this point, our parser can only turn fraction primitives into trees. Once we've fleshed out levelN
, we'll implement level1
so it reflects the grammar. It has two productions. How do we choose between them? That seems a little tricky. Let's focus on just the first production. Our first draft might look like this:
def level1
left = level1
advance # eat *
right = levelN
Ast::Multiply.new(left, right)
end
def level1 left = level1 advance # eat * right = levelN Ast::Multiply.new(left, right) end
This code has a problem. The very first statement is a recursive call to the method we are trying to write. Our program will quickly reach a stack overflow. How can we fix this? Well, eventually, that recursion is going to bottom out to a call to levelN
. Let's make that call first:
def level1
left = levelN
advance # eat *
right = levelN
Ast::Multiply.new(left, right)
end
def level1 left = levelN advance # eat * right = levelN Ast::Multiply.new(left, right) end
This code also has a problem. Since we lost the recursion, we can't parse more than one multiplication. We replace the recursion with iteration that builds up the root of the entire sequence of multiplies in left
:
def level1
left = levelN
while has(:asterisk)
advance
right = levelN
left = Ast::Multiply.new(left, right)
end
left
end
def level1 left = levelN while has(:asterisk) advance right = levelN left = Ast::Multiply.new(left, right) end left end
We craft level0
in a similar manner. We only have to turn recursion into iteration for left-associative operators. Right-associative operators can be implemented with recursion. Because they recurse on the right, the recursion won't be infinite.
Print and Program
So far we've only been able to match a single expression. Let's grow our language to allow multiple print statements.
program = (statement LINEBREAK)* EOF
statement = PRINT level0
program = (statement LINEBREAK)* EOF statement = PRINT level0
To support this, we'll need to add some new tokens to our lexer. Linebreaks can be matched like any character. We'll all add a rule to match all identifiers. Then we'll see if the identifier is a keyword like print
. After the lexing is finished, we'll append a fake EOF
token.
We'll add a Program
and Print
node to our AST hierarchy. That also means the visitors will get some new visit_
methods.
Then we add methods to our parser.
Typechecking
Right now our language has only one type: fraction. Let's add in booleans. They are primitives and go in levelN
of our grammar:
levelN = INTEGER / INTEGER
| BOOLEAN
levelN = INTEGER / INTEGER | BOOLEAN
We add nodes and parsing logic for these. But something bigger also emerges: the possibility of type errors. Someone might try to add booleans. We can't let this happen. When and where do check for type errors? It depends on the language. If we're C or Java, we check right after we build the tree but before we execute it. If we're Ruby or Python, we check during execution. That's what we'll do with our fraction language.
In the evaluator, we add logic like this:
def visit_add(node)
left_primitive = node.left_node.visit(self)
raise "+ expects fractions" if !left_primitive.is_a?(Ast::Fraction)
# ...
end
def visit_add(node) left_primitive = node.left_node.visit(self) raise "+ expects fractions" if !left_primitive.is_a?(Ast::Fraction) # ... end
From here we could keep growing our language. We could add parentheses. We could add variables. But that's a job for milestones 1 and 2.
TODO
Here's your list of things to do before we meet next:
See you next time.
Sincerely,
P.S. It's time for a haiku!