Lecture: Visitor, Lexer, Grammar
Dear students:
In this class, we continue to learn about programming languages in two ways: we're building an interpreter for a small language, and we're building it in Ruby, a language that few of you have used before. The first step of working with programs in an invented language is to translate the program into a tree structure in a language we already know. The second step is to operate on that tree. We've been thinking about this process in reverse. Last week we started modeling the tree and operating on it. We'll continue expanding the model, but today we want to build the tools that turn source code into a tree.
What we talk about today is meant to mirror what you will need to do for milestones 1 and 2 in the spreadsheet project. You should take copious notes because everything we do today is on purpose.
Visiting
Last class we started writing a language for performing fraction arithmetic. We ended with only two different kinds of nodes in our tree representation: fractions and addition operations. Each has an evaluate
method. There may be other operations we'd like to perform on ASTs, like these:
- execute the program
- reformat the code
- translate to a different language
- check the types
- rename a variable in a particular scope
- generate documentation
- lint the code
- generate machine code
In fact, let's translate our AST to Ruby real quick. We'll add methods:
class Fraction
# ...
def translate
"Rational(#{@top}, #{@bottom})"
end
end
class Add
# ...
def translate
"#{@leftNode.translate} + #{@rightNode.translate}"
end
end
class Fraction # ... def translate "Rational(#{@top}, #{@bottom})" end end class Add # ... def translate "#{@leftNode.translate} + #{@rightNode.translate}" end end
Okay, these weren't too bad. But let's look farther forward. By the end, our language might have 30 different node types, and maybe 8 different tree operations we want to perform. This approach makes for some very scattered logic. If there are \(n\) classes in our hierarchy, we'll have \(\frac{1}{n}\) of each tree algorithm in each class. Wouldn't it be nice if we could put all the evaluate logic together? And all the serialization logic together in a separate structure? We can, using a little indirection.
Visitor Pattern
Let's take all the translate
methods out of the individual classes and put them into one Translator
class that is focused entirely on translaton. As we extract the methods, we rename them so they are unique. Since we're not in the classes anymore, the nodes must come in as a parameter. We end up with something like this:
class Translator
def visit_fraction(node)
# ...
end
def visit_add(node)
# ...
end
end
class Translator def visit_fraction(node) # ... end def visit_add(node) # ... end end
Likewise, all the evaluate
methods go in a class named Evaluator
:
class Evaluator
def visit_fraction(node)
# ...
end
def visit_add(node)
# ...
end
end
class Evaluator def visit_fraction(node) # ... end def visit_add(node) # ... end end
We'll make a class for each tree operation. The classes will be focused and easier to maintain. We will like our code and our smart selves.
We have lost something by moving these methods away from the objects. No longer can we ask a node to go translate itself or evaluate itself. To translate a node now, we have to know its type so we can call the appropriate method.
# The old way was flexible.
# node.evaluate
# The new way is not flexible.
evaluator = Evaluator.new
evaluator.visit_add(node)
# The old way was flexible. # node.evaluate # The new way is not flexible. evaluator = Evaluator.new evaluator.visit_add(node)
However, we are not done. All our tree processors have the same interface. They visit nodes. They are visitors. We can give each object exactly one method that tells it how to be visited:
class Fraction
# ...
def visit(visitor)
visitor.visit_fraction(self)
end
end
class Add
# ...
def visit(visitor)
visitor.visit_add(self)
end
end
class Fraction # ... def visit(visitor) visitor.visit_fraction(self) end end class Add # ... def visit(visitor) visitor.visit_add(self) end end
With these bridging methods, we again put the authority back in the hands of the objects themselves. This node knows how to get itself visited:
node.visit(Translator.new)
node.visit(Translator.new)
This way of organizing objects without polluting the hierarchy with fragments of their operations is called the visitor pattern. In some languages where types are checked early, like Java, we would need to make a Visitor
interface or abstract class. Not so in Ruby. A visitor visits if it has the right visit
methods.
Object-oriented programming is neat, but it makes objects the organizing principle of our code. Sometimes we want to organize by behavior rather than class. That's when the visitor pattern comes in handy. It lets us factor out all the behaviors into their own class.
Lexing
Recall this fraction program from last time:
1/4 + 5/6
1/4 + 5/6
We built a tree representation of this program by hand. Now we want to build a tool that turns text into a tree. The problem is easier if we build two tools, one to break the text into small chunks and another to organize the chunks into structures. Our brain does something similar when we read. We chunk the letters into words, and then organize the words into grammatical structures.
The words of a program are its tokens. What are the tokens of the program above? Here are a few:
-
1
-
/
-
4
-
+
-
5
-
/
-
6
Most of them are just a single character, but that's just a coincidence. Operators could be multiple symbols, like >=
. We need to build a machine that eats characters and reports when it has digested a token. Such a machine is called a lexer. Let's build one for our language.
Lexers are made out of many state machines. We design a statement machine for each type of token in our language. Many machines will have just a single accepting state and a single transition. The integer machine will have two states with a loop.
Ultimately, we need these machines running in code. Here's how we might write a routine for lexing a forward-slash:
if has('/')
capture
emit_token(:forward_slash)
end
if has('/') capture emit_token(:forward_slash) end
Those little machines recognize only one type of token. We need a mega-machine that recognizes all of them, so we glue them together. Once they are glued, we write the entire lexing routine, which might look like this:
if has('/')
capture
emit_token(:forward_slash)
elsif has('+')
capture
emit_token(:plus)
elsif has_digit
while has_digit
capture
emit_token(:integer)
end
if has('/') capture emit_token(:forward_slash) elsif has('+') capture emit_token(:plus) elsif has_digit while has_digit capture emit_token(:integer) end
This code belongs in a lexer abstraction. We'll write it together and also define the helper methods like has
, has_digit
, capture
, and emit_token
. Once it can lex simple expressions, we'll expand it to support negative integers and other operators.
Grammar
Lexers only know what individual pieces of a program look like. We also need rules that describe the structures of a program. For that, we write a grammar. The grammatic rule for a fraction might look like this:
fraction = INTEGER / INTEGER
fraction = INTEGER / INTEGER
In a program, an expression is a structure that yields a value. This is in contrast to a statement, which doesn't produce a value but does achieve some side effect, like printing. We might trying writing the grammar of expressions with this rule:
expr = expr + expr
| fraction
expr = expr + expr | fraction
There's a problem with this rule, however. It allows 1/4 + 5/6 + 3/8
to be parsed in two different ways. We know that addition is left-associative, so we tweak the rule to only allow for recursion in the left-operand:
expr = expr + fraction
| fraction
fraction = INTEGER / INTEGER
expr = expr + fraction | fraction fraction = INTEGER / INTEGER
Suppose now we want to add multiplication to the language. We might try adding a production to expr
:
expr = expr + fraction
| expr * fraction
| fraction
fraction = INTEGER / INTEGER
expr = expr + fraction | expr * fraction | fraction fraction = INTEGER / INTEGER
But this lets us parse a + b * c
as (a + b) * c
. Multiplication is supposed to have higher precedence. We achieve that by making a ladder of rules. Low-precedence operators show up higher in the grammar and therefore higher in the AST.
level0 = level0 + level1
| level1
level1 = level1 * levelN
| levelN
levelN = INTEGER / INTEGER
level0 = level0 + level1 | level1 level1 = level1 * levelN | levelN levelN = INTEGER / INTEGER
At each level, we have two possible actions: descend to the next level or recurse on the same level.
TODO
Here's your list of things to do before we meet next:
See you next time.
Sincerely,
P.S. It's time for a haiku!