Lecture: Parsing

Dear Computer

Chapter 2: Naming Things

Lecture: Parsing

Dear students:

Over the last two weeks we've been building an interpreter for a fraction language. In particular, we wrote model classes and a lexer. Today we continue building that interpreter: we're going to turn a fraction program into a tree using a parser. Together the lexer and parser form the bulk of your work for milestone 2.

Grammar

Last week we explored how to build a grammar. Our grammar in where a language's precedence and associativity are born. The key idea is to build a ladder. At the top of the grammar are the least greedy operators, the ones with lowest precedence. At the bottom are the highest precedence operators. The very bottom rule describes our primitives. Within a rule, we have just two choices of how to expand our non-terminal operands: we recurse on the same level or we descend just one rung to the next level. If we recurse on the left, we have a left-associative operator. If we recurse on the right, we have a right-assocative operator.

We end up with a grammar that looks like this for our fraction language:

BNF
level0 = level0 + level1
       | level0 - level1
       | level1

level1 = level1 * levelN
       | levelN

levelN = INTEGER / INTEGER
       | INTEGER
level0 = level0 + level1
       | level0 - level1
       | level1

level1 = level1 * levelN
       | levelN

levelN = INTEGER / INTEGER
       | INTEGER

We don't have nodes for all of these operations yet, so we'll need to add them. While we're at it, we'll put all the nodes in their own namespace to avoid name collisions.

Once we get a parser going for this much of our language, we'll expand it to include a few more things.

Parser

We write a grammar to inform our implementation of a recursive descent parser, just as state machines informed our implementation of a lexer. Each non-terminal of a grammar becomes a method in our parser. I like to break off implementing a parser one rung at a time, starting at the bottom. But before we parse levelN, let's have a look at our utility methods:

Once we've implemented these universal parsing methods, then we are ready to implement the methods for our particular grammar's non-terminals. Our levelN translates into this:

Ruby
class Parser
  # ...

  def levelN
    if has(:integer)
      top_token = advance
      if has(:forward_slash)
        raise "Expected /" if not has(:integer)
        bottom_token = advance
        Ast::Fraction.new(top_token.text.to_i, bottom_token.text.to_i)
      else
        Ast::Integer.new(top_token.text.to_i)
      end
    # ...
    else
      raise "Unexpected token: #{@tokens[@i]}"
    end
  end
end
class Parser
  # ...

  def levelN
    if has(:integer)
      top_token = advance
      if has(:forward_slash)
        raise "Expected /" if not has(:integer)
        bottom_token = advance
        Ast::Fraction.new(top_token.text.to_i, bottom_token.text.to_i)
      else
        Ast::Integer.new(top_token.text.to_i)
      end
    # ...
    else
      raise "Unexpected token: #{@tokens[@i]}"
    end
  end
end

While we're testing, we'll have the parse method call whatever method we're testing. At this point, our parser can only turn primitives into trees. Once we've fleshed out levelN, we'll implement level1 so it reflects the grammar. It has one production. How do we choose between them? That seems a little tricky. Let's focus on just the first production. Our first draft might look like this:

Ruby
def level1
  left = level1
  advance # eat *
  right = levelN
  Ast::Multiply.new(left, right)
end
def level1
  left = level1
  advance # eat *
  right = levelN
  Ast::Multiply.new(left, right)
end

This code has a problem. The very first statement is a recursive call to the method we are trying to write. Our program will quickly reach a stack overflow. How can we fix this? Well, eventually, that recursion is going to bottom out to a call to levelN. Let's make that call first:

Ruby
def level1
  left = levelN
  advance # eat *
  right = levelN
  Ast::Multiply.new(left, right)
end
def level1
  left = levelN
  advance # eat *
  right = levelN
  Ast::Multiply.new(left, right)
end

This code also has a problem. Since we lost the recursion, we can't parse more than one multiplication. We replace the recursion with iteration that builds up the root of the entire sequence of multiplies in left:

Ruby
def level1
  left = levelN
  while has(:asterisk)
    advance
    right = levelN
    left = Ast::Multiply.new(left, right)
  end
  left
end
def level1
  left = levelN
  while has(:asterisk)
    advance
    right = levelN
    left = Ast::Multiply.new(left, right)
  end
  left
end

We craft level0 in a similar manner. We only have to turn recursion into iteration for left-associative operators. Sometimes we see grammar rules written explicitly with recursion, like this:

EBNF
level0 = level1 (+ level1)*
level0 = level1 (+ level1)*

The advantage of this form is that it translates directly into the code that we ultimately wrote. The disadvantage is that the associativity is not baked into the grammar. It's not clear how the many additions should be grouped. That's why I and Treeformer use the original BNF. Quantifiers like * appeared in EBNF.

Right-associative operators can be implemented with recursion. Because they recurse on the right, the recursion won't be infinite. Suppose we had this grammar rule:

BNF
level5 = level6 : level5
       | level6
level5 = level6 : level5
       | level6

We'd write this recursive code to parse it:

Ruby
def level5
  left = level6
  if has(:colon)
    right = level5
    left = new Colonize(left, right)
  end
  left
end
def level5
  left = level6
  if has(:colon)
    right = level5
    left = new Colonize(left, right)
  end
  left
end

So far we've only been able to match a single expression. Let's grow our language to allow multiple print statements.

BNF
program = statement program
        | EOF

statement = PRINT level0
program = statement program
        | EOF

statement = PRINT level0

To support this, we'll need to add some new tokens to our lexer. Linebreaks can be matched like any character. We'll all add a rule to match all identifiers. Then we'll see if the identifier is a keyword like print. After the lexing is finished, we'll append a fake EOF token.

We'll add a Program and Print node to our AST hierarchy. That also means the visitors will get some new visit_ methods.

Then we add methods to our parser. If we have extra time, we'll add variables.

TODO

Here's your list of things to do before we meet next:

Keep working on milestone 1 for the project. Commit to GitHub after every work session. The first ready date is Sunday.
If you're working with a partner, message me on Discord.
Complete the middle quiz as desired.

See you next time.

Sincerely,

P.S. It's time for a haiku!

The math of parsing Put 2 and 2 together Until you get 1