Lecture: Parsing
Dear students:
Over the last two weeks we've been building an interpreter for a fraction language. In particular, we wrote model classes and a lexer. Today we continue building that interpreter: we're going to turn a fraction program into a tree using a parser. Together the lexer and parser form the bulk of your work for milestone 2.
Grammar
Last week we explored how to build a grammar. Our grammar in where a language's precedence and associativity are born. The key idea is to build a ladder. At the top of the grammar are the least greedy operators, the ones with lowest precedence. At the bottom are the highest precedence operators. The very bottom rule describes our primitives. Within a rule, we have just two choices of how to expand our non-terminal operands: we recurse on the same level or we descend just one rung to the next level. If we recurse on the left, we have a left-associative operator. If we recurse on the right, we have a right-assocative operator.
We end up with a grammar that looks like this for our fraction language:
level0 = level0 + level1
| level0 - level1
| level1
level1 = level1 * levelN
| levelN
levelN = INTEGER / INTEGER
| INTEGER
level0 = level0 + level1
| level0 - level1
| level1
level1 = level1 * levelN
| levelN
levelN = INTEGER / INTEGER
| INTEGERWe don't have nodes for all of these operations yet, so we'll need to add them. While we're at it, we'll put all the nodes in their own namespace to avoid name collisions.
Once we get a parser going for this much of our language, we'll expand it to include a few more things.
Parser
We write a grammar to inform our implementation of a recursive descent parser, just as state machines informed our implementation of a lexer. Each non-terminal of a grammar becomes a method in our parser. I like to break off implementing a parser one rung at a time, starting at the bottom. But before we parse levelN, let's have a look at our utility methods:
- a constructor that receives the tokens
-
has(type), which looks to see if the next token has a particular type -
advance, which moves ahead to the next token but returns the one -
parse, which runs through all the tokens and recursively builds a tree
Once we've implemented these universal parsing methods, then we are ready to implement the methods for our particular grammar's non-terminals. Our levelN translates into this:
class Parser
# ...
def levelN
if has(:integer)
top_token = advance
if has(:forward_slash)
raise "Expected /" if not has(:integer)
bottom_token = advance
Ast::Fraction.new(top_token.text.to_i, bottom_token.text.to_i)
else
Ast::Integer.new(top_token.text.to_i)
end
# ...
else
raise "Unexpected token: #{@tokens[@i]}"
end
end
end
class Parser
# ...
def levelN
if has(:integer)
top_token = advance
if has(:forward_slash)
raise "Expected /" if not has(:integer)
bottom_token = advance
Ast::Fraction.new(top_token.text.to_i, bottom_token.text.to_i)
else
Ast::Integer.new(top_token.text.to_i)
end
# ...
else
raise "Unexpected token: #{@tokens[@i]}"
end
end
endWhile we're testing, we'll have the parse method call whatever method we're testing. At this point, our parser can only turn primitives into trees. Once we've fleshed out levelN, we'll implement level1 so it reflects the grammar. It has one production. How do we choose between them? That seems a little tricky. Let's focus on just the first production. Our first draft might look like this:
def level1
left = level1
advance # eat *
right = levelN
Ast::Multiply.new(left, right)
end
def level1 left = level1 advance # eat * right = levelN Ast::Multiply.new(left, right) end
This code has a problem. The very first statement is a recursive call to the method we are trying to write. Our program will quickly reach a stack overflow. How can we fix this? Well, eventually, that recursion is going to bottom out to a call to levelN. Let's make that call first:
def level1
left = levelN
advance # eat *
right = levelN
Ast::Multiply.new(left, right)
end
def level1 left = levelN advance # eat * right = levelN Ast::Multiply.new(left, right) end
This code also has a problem. Since we lost the recursion, we can't parse more than one multiplication. We replace the recursion with iteration that builds up the root of the entire sequence of multiplies in left:
def level1
left = levelN
while has(:asterisk)
advance
right = levelN
left = Ast::Multiply.new(left, right)
end
left
end
def level1
left = levelN
while has(:asterisk)
advance
right = levelN
left = Ast::Multiply.new(left, right)
end
left
endWe craft level0 in a similar manner. We only have to turn recursion into iteration for left-associative operators. Sometimes we see grammar rules written explicitly with recursion, like this:
level0 = level1 (+ level1)*
level0 = level1 (+ level1)*
The advantage of this form is that it translates directly into the code that we ultimately wrote. The disadvantage is that the associativity is not baked into the grammar. It's not clear how the many additions should be grouped. That's why I and Treeformer use the original BNF. Quantifiers like * appeared in EBNF.
Right-associative operators can be implemented with recursion. Because they recurse on the right, the recursion won't be infinite. Suppose we had this grammar rule:
level5 = level6 : level5
| level6
level5 = level6 : level5
| level6We'd write this recursive code to parse it:
def level5
left = level6
if has(:colon)
right = level5
left = new Colonize(left, right)
end
left
end
def level5
left = level6
if has(:colon)
right = level5
left = new Colonize(left, right)
end
left
endPrint and Program
So far we've only been able to match a single expression. Let's grow our language to allow multiple print statements.
program = statement program
| EOF
statement = PRINT level0
program = statement program
| EOF
statement = PRINT level0To support this, we'll need to add some new tokens to our lexer. Linebreaks can be matched like any character. We'll all add a rule to match all identifiers. Then we'll see if the identifier is a keyword like print. After the lexing is finished, we'll append a fake EOF token.
We'll add a Program and Print node to our AST hierarchy. That also means the visitors will get some new visit_ methods.
Then we add methods to our parser. If we have extra time, we'll add variables.
TODO
Here's your list of things to do before we meet next:
See you next time.
Sincerely,
P.S. It's time for a haiku!