Lecture: Grammars and Parsing

Dear students:

Over the last two weeks we've been building an interpreter for Vex programs. In particular, we wrote model classes and a lexer. Today we continue building that interpreter: we're going to turn a Vex program into a tree using a parser. Together the lexer and parser form the bulk of your work for milestone 2. Our discussion today is the last on Vex and designing our own language.

Grammar

The lexer returns a flat list of program pieces, of tokens. Our goal is to build a tree out of the pieces. The machine that turns a list into a tree is the parser. Grammar rules tell the parser how to shape the tree. The grammar encodes the precedence and the associativity of the operators, so we have to carefully write the rules. All valid programs may be generated from this one set of rules.

Programming language grammars start with a rule that describes the structure an entire program. The starting rule often looks like this:

program = statement* EOF

program = statement* EOF

Which of these is a terminal? Which is a non-terminal? EOF is a fake token that the lexer inserts to make a visible marker of the program's end.

Each non-terminal must have its own rules or productions describing its possible structures. The example program shows print statements and assignment statements, so statement might have these productions:

statement = IDENTIFIER EQUALS level0
          | PRINT LEFT_PARENTHESIS level0 RIGHT_PARENTHESIS

statement = IDENTIFIER EQUALS level0
          | PRINT LEFT_PARENTHESIS level0 RIGHT_PARENTHESIS

These productions introduce a new non-terminal: level0. In this class, any non-terminal that starts with level describes an expression, something that produces a value. An expression can be a literal or an operation like addition or multiplication. Each level has a precedence. We'll arrange them in what I like to think of as the Ladder of Predecence. The Ladder has three rules:

You enter the ladder at the top. If you need to parse an arbitrary expression, you invoke the level0 rule, always.
At each rung, you can parse a substructure at the current rung (recurse) or drop to the next rung (descend).
Only at the bottom rung can you parse a substructure from back at the top of the ladder.

At the bottom is levelN, with the highest precedence, which has these productions:

levelN = LEFT_BRACKET INTEGER COMMA INTEGER RIGHT_BRACKET
       | INTEGER
       | IDENTIFIER

levelN = LEFT_BRACKET INTEGER COMMA INTEGER RIGHT_BRACKET
       | INTEGER
       | IDENTIFIER

Not only do these expressions have really high precedence, but they also don't appear directly next to each other. We don't see 5 2 or v 7. They are non-associative. I've tried out various alternate names for these, like island or unit or atom. I'm sticking with levelN for now.

The next rung up has the second highest precedence. Let's say that multiplication (*) occupies this level. What is the structure of one of these expressions? Using our first two rules, we might write this as our first draft:

level1 = levelN * levelN
       | levelN

level1 = levelN * levelN
       | levelN

These productions don't quite describe the structures we want to express. What if we want to multiply three vectors? Something like a * b * c? That's not legal currently. We fix the problem by allowing rules to be recursive.

In tree form, what should a * b * c look like? If we want a and b to be multiplied first, the tree should look this:

    *
   / \
  *   c
 / \
a   b

    *
   / \
  *   c
 / \
a   b

This picture shows that we need to allow recursion in the left operand:

level1 = level1 * levelN
       | levelN

level1 = level1 * levelN
       | levelN

If we allow both operands to be recursive, then we'd have an ambiguity in our grammar. We could interpret the expression as either (a * b) * c or a * (b * c). Ambiguities aren't good. Different compilers will produce different trees.

The rule for level0 has a similar structure, but it sits atop the level1 expressions:

level0 = level0 + level1
       | level1

level0 = level0 + level1
       | level1

On a ladder, we drop from one rung only to the one right below. We follow the same principle in our grammar. The productions either stay at their level or drop one level below to a higher precedence level.

We generally don't climb back up the Ladder, because that would allow a lower precedence operation to sneak in under a higher-precedence operation. However, we make an exception when there's a syntactic boundary that safely resets the precedence. These boundaries generally appear in levelN. For example, levelN jumps back up to level0 on parentheses because the parentheses let the precedence to start over:

levelN = ...
       | ( level0 )

levelN = ...
       | ( level0 )

This grammar might seem like just an inconsequential thinking exercise. Where's the code? But it's important. This is how programming languages are specified. The stack of nonterminals determines operator precedence. The placement of recursion within a production determines operator associativity. Grammars are included in a language's technical documentation. This grammar very nearly writes the parsing code for us.

Parser

We write a grammar to inform our implementation of a recursive descent parser, just as state machines informed our implementation of a lexer. Each non-terminal of a grammar becomes a method in our parser. I like to break off implementing a parser one rung at a time, starting at the bottom. But before we parse levelN, let's have a look at our utility methods:

a constructor that receives the tokens
has(type), which looks to see if the next token has a particular type
advance, which moves ahead to the next token but returns the one
parse, which runs through all the tokens and recursively builds a tree

Once we've implemented these universal parsing methods, then we are ready to implement the methods for our particular grammar's non-terminals. One production of levelN translates into this:

Ruby

class Parser
  # ...

  def levelN
    if has(:integer)
      token = advance
      Ast::Integer.new(token.text.to_i)
    # ...
    else
      raise "Unexpected token: #{@tokens[@i]}"
    end
  end
end

class Parser
  # ...

  def levelN
    if has(:integer)
      token = advance
      Ast::Integer.new(token.text.to_i)
    # ...
    else
      raise "Unexpected token: #{@tokens[@i]}"
    end
  end
end

While we're testing, we'll have the parse method call whatever method we're testing. Eventually we'll have it point at our starting non-terminal program. At this point, our parser can only turn integer literals into trees. Once we've fleshed out levelN, we'll implement level1 so it reflects the grammar. It has two productions. How do we choose between them? That seems a little tricky. Let's focus on just the first production. Our first draft might look like this:

Ruby

def level1
  left = level1
  advance # eat *
  right = levelN
  Ast::Multiply.new(left, right)
end

def level1
  left = level1
  advance # eat *
  right = levelN
  Ast::Multiply.new(left, right)
end

This code has a problem. The very first statement is a recursive call to the method we are trying to write. Our program will quickly reach a stack overflow. How can we fix this? Well, eventually, that recursion is going to bottom out to a call to levelN. Let's make that call first:

Ruby

def level1
  left = levelN
  advance # eat *
  right = levelN
  Ast::Multiply.new(left, right)
end

def level1
  left = levelN
  advance # eat *
  right = levelN
  Ast::Multiply.new(left, right)
end

This code also has a problem. Since we lost the recursion, we can't parse more than one multiplication. We replace the recursion with iteration that builds up the root of the entire sequence of multiplies in left:

Ruby

def level1
  left = levelN
  while has(:asterisk)
    advance
    right = levelN
    left = Ast::Multiply.new(left, right)
  end
  left
end

def level1
  left = levelN
  while has(:asterisk)
    advance
    right = levelN
    left = Ast::Multiply.new(left, right)
  end
  left
end

We craft level0, statement, and program in a similar manner.

TODO

Here's your list of things to do before we meet next:

Keep working on milestone 1 for the spreadsheet project. Commit to GitHub after every work session. The first ready date is Sunday.

Next Tuesday is Assessment Day, so there's no class. It's also the Career Fair, which you should attend to learn more about the world. A reading will still go out on Thursday and be due early Tuesday. We'll have a lab on Thursday.

Complete the middle quiz as desired.

See you next time.

Sincerely,

P.S. It's time for a haiku!

I love grammar jokes Where was the chicken going? To the terminal

Dear Computer

Lecture: Grammars and Parsing

Grammar

Parser

TODO