Lecture: Visitor, Lexer, Grammar

Dear Computer

Chapter 1: Programming Language

Lecture: Visitor, Lexer, Grammar

Dear students:

In this class, we continue to learn about programming languages in two ways: we're building an interpreter for a small language, and we're building it in Ruby, a language that few of you have used before. The first step of working with programs in an invented language is to translate the program into a tree structure in a language we already know. The second step is to operate on that tree. We've been thinking about this process in reverse. Last week we started modeling the tree and operating on it. We'll continue expanding the model, but today we want to build the tools that turn source code into a tree.

What we talk about today is meant to mirror what you will need to do for milestones 1 and 2 in the spreadsheet project. You should take copious notes because everything we do today is on purpose.

Visiting

Last class we started writing a language for performing fraction arithmetic. We ended with only two different kinds of nodes in our tree representation: fractions and addition operations. Each has an evaluate method. There may be other operations we'd like to perform on ASTs, like these:

In fact, let's translate our AST to Ruby real quick. We'll add methods:

Ruby
class Fraction
  # ...

  def translate
    "Rational(#{@top}, #{@bottom})"
  end
end

class Add
  # ...

  def translate
    "#{@leftNode.translate} + #{@rightNode.translate}"
  end
end
class Fraction
  # ...

  def translate
    "Rational(#{@top}, #{@bottom})"
  end
end

class Add
  # ...

  def translate
    "#{@leftNode.translate} + #{@rightNode.translate}"
  end
end

Okay, these weren't too bad. But let's look farther forward. By the end, our language might have 30 different node types, and maybe 8 different tree operations we want to perform. This approach makes for some very scattered logic. If there are \(n\) classes in our hierarchy, we'll have \(\frac{1}{n}\) of each tree algorithm in each class. Wouldn't it be nice if we could put all the evaluate logic together? And all the serialization logic together in a separate structure? We can, using a little indirection.

Visitor Pattern

Let's take all the translate methods out of the individual classes and put them into one Translator class that is focused entirely on translaton. As we extract the methods, we rename them so they are unique. Since we're not in the classes anymore, the nodes must come in as a parameter. We end up with something like this:

Ruby
class Translator
  def visit_fraction(node)
    # ...
  end

  def visit_add(node)
    # ...
  end
end
class Translator
  def visit_fraction(node)
    # ...
  end

  def visit_add(node)
    # ...
  end
end

Likewise, all the evaluate methods go in a class named Evaluator:

Ruby
class Evaluator
  def visit_fraction(node)
    # ...
  end

  def visit_add(node)
    # ...
  end
end
class Evaluator
  def visit_fraction(node)
    # ...
  end

  def visit_add(node)
    # ...
  end
end

We'll make a class for each tree operation. The classes will be focused and easier to maintain. We will like our code and our smart selves.

We have lost something by moving these methods away from the objects. No longer can we ask a node to go translate itself or evaluate itself. To translate a node now, we have to know its type so we can call the appropriate method.

Ruby
# The old way was flexible.
# node.evaluate

# The new way is not flexible.
evaluator = Evaluator.new
evaluator.visit_add(node)
# The old way was flexible.
# node.evaluate

# The new way is not flexible.
evaluator = Evaluator.new
evaluator.visit_add(node)

However, we are not done. All our tree processors have the same interface. They visit nodes. They are visitors. We can give each object exactly one method that tells it how to be visited:

Ruby
class Fraction
  # ...
  def visit(visitor)
    visitor.visit_fraction(self)
  end
end

class Add
  # ...
  def visit(visitor)
    visitor.visit_add(self)
  end
end
class Fraction
  # ...
  def visit(visitor)
    visitor.visit_fraction(self)
  end
end

class Add
  # ...
  def visit(visitor)
    visitor.visit_add(self)
  end
end

With these bridging methods, we again put the authority back in the hands of the objects themselves. This node knows how to get itself visited:

Ruby
node.visit(Translator.new)
node.visit(Translator.new)

This way of organizing objects without polluting the hierarchy with fragments of their operations is called the visitor pattern. In some languages where types are checked early, like Java, we would need to make a Visitor interface or abstract class. Not so in Ruby. A visitor visits if it has the right visit methods.

Object-oriented programming is neat, but it makes objects the organizing principle of our code. Sometimes we want to organize by behavior rather than class. That's when the visitor pattern comes in handy. It lets us factor out all the behaviors into their own class.

Lexing

Recall this fraction program from last time:

Ruby
1/4 + 5/6
1/4 + 5/6

We built a tree representation of this program by hand. Now we want to build a tool that turns text into a tree. The problem is easier if we build two tools, one to break the text into small chunks and another to organize the chunks into structures. Our brain does something similar when we read. We chunk the letters into words, and then organize the words into grammatical structures.

The words of a program are its tokens. What are the tokens of the program above? Here are a few:

Most of them are just a single character, but that's just a coincidence. Operators could be multiple symbols, like >=. We need to build a machine that eats characters and reports when it has digested a token. Such a machine is called a lexer. Let's build one for our language.

Lexers are made out of many state machines. We design a statement machine for each type of token in our language. Many machines will have just a single accepting state and a single transition. The integer machine will have two states with a loop.

Ultimately, we need these machines running in code. Here's how we might write a routine for lexing a forward-slash:

Ruby
if has('/')
  capture
  emit_token(:forward_slash)
end
if has('/')
  capture
  emit_token(:forward_slash)
end

Those little machines recognize only one type of token. We need a mega-machine that recognizes all of them, so we glue them together. Once they are glued, we write the entire lexing routine, which might look like this:

Ruby
if has('/')
  capture
  emit_token(:forward_slash)
elsif has('+')
  capture
  emit_token(:plus)
elsif has_digit
  while has_digit
    capture
  emit_token(:integer)
end
if has('/')
  capture
  emit_token(:forward_slash)
elsif has('+')
  capture
  emit_token(:plus)
elsif has_digit
  while has_digit
    capture
  emit_token(:integer)
end

This code belongs in a lexer abstraction. We'll write it together and also define the helper methods like has, has_digit, capture, and emit_token. Once it can lex simple expressions, we'll expand it to support negative integers and other operators.

Grammar

Lexers only know what individual pieces of a program look like. We also need rules that describe the structures of a program. For that, we write a grammar. The grammatic rule for a fraction might look like this:

fraction = INTEGER / INTEGER
fraction = INTEGER / INTEGER

In a program, an expression is a structure that yields a value. This is in contrast to a statement, which doesn't produce a value but does achieve some side effect, like printing. We might trying writing the grammar of expressions with this rule:

expr = expr + expr
     | fraction
expr = expr + expr
     | fraction

There's a problem with this rule, however. It allows 1/4 + 5/6 + 3/8 to be parsed in two different ways. We know that addition is left-associative, so we tweak the rule to only allow for recursion in the left-operand:

expr = expr + fraction
     | fraction

fraction = INTEGER / INTEGER
expr = expr + fraction
     | fraction

fraction = INTEGER / INTEGER

Suppose now we want to add multiplication to the language. We might try adding a production to expr:

expr = expr + fraction
     | expr * fraction
     | fraction

fraction = INTEGER / INTEGER
expr = expr + fraction
     | expr * fraction
     | fraction

fraction = INTEGER / INTEGER

But this lets us parse a + b * c as (a + b) * c. Multiplication is supposed to have higher precedence. We achieve that by making a ladder of rules. Low-precedence operators show up higher in the grammar and therefore higher in the AST.

level0 = level0 + level1
       | level1

level1 = level1 * levelN
       | levelN

levelN = INTEGER / INTEGER
level0 = level0 + level1
       | level1

level1 = level1 * levelN
       | levelN

levelN = INTEGER / INTEGER

At each level, we have two possible actions: descend to the next level or recurse on the same level.

TODO

Here's your list of things to do before we meet next:

Complete the middle quiz as desired.
Start working on milestone 1. Devote a little time each day, and it will be done. Commit to your GitHub repository after every work session.
Keep learning Ruby through YouTube or a book.
Default to #general in Discord for messages. I am not the star of this show. We are a community. Use a DM if you need to share personal matters. Don't share graded work like screenshots of book questions in #general.
Tell me through a DM if you are partnering with someone on the project.

See you next time.

Sincerely,

P.S. It's time for a haiku!

I learned English first But not till I learned Spanish Did I learn language
← Lecture: Hello LanguageLab: Lexing and Parsing →