Lecture: Parsing

Dear students:

Over the last two weeks we've been building an interpreter for a fraction language. In particular, we wrote model classes and a lexer. Today we continue building that interpreter: we're going to turn a fraction program into a tree using a parser. Together the lexer and parser form the bulk of your work for milestone 2.

Grammar

Last week we explored how to build a grammar that honors precedence and associativity. The key idea is to build a ladder. At the top of the grammar are the least greedy operators, the ones with lowest precedence. At the bottom are the highest precedence operators. The very bottom rule describes our primitives. Within a rule, we have just two choices of how to expand our subtracts: we recurse on the same level or we descend just one rung to the next level. If we recurse on the left, we have a left-associative operator. If we recurse on the right, we have a right-assocative operator.

We end up with a grammar that looks like this for our fraction language:

BNF

level0 = level0 + level1
       | level0 - level1
       | level1

level1 = level1 * level2
       | level1 / level2
       | level2

levelN = INTEGER / INTEGER

level0 = level0 + level1
       | level0 - level1
       | level1

level1 = level1 * level2
       | level1 / level2
       | level2

levelN = INTEGER / INTEGER

We don't have nodes for all of these operations yet, so we'll need to add them. While we're at it, we'll put all the nodes in their own namespace to avoid name collisions.

Once we get a parser going for this much of our language, we'll expand it to include a few more things.

Parser

We write a grammar to inform our implementation of a recursive descent parser, just as state machines informed our implementation of a lexer. Each non-terminal of a grammar becomes a method in our parser. I like to break off implementing a parser one rung at a time, starting at the bottom. But before we parse levelN, let's have a look at our utility methods:

a constructor that receives the tokens
has(type), which looks to see if the next token has a particular type
advance, which moves ahead to the next token but returns the one
parse, which runs through all the tokens and recursively builds a tree

Once we've implemented these universal parsing methods, then we are ready to implement the methods for our particular grammar's non-terminals. Our levelN translates into this:

Ruby

class Parser
  # ...

  def levelN
    if has(:integer)
      top_token = advance
      raise "Expected /" if not has(:forward_slash)
      bottom_token = advance
      Ast::Fraction.new(top_token.text.to_i, bottom_token.text.to_i)
    # ...
    else
      raise "Unexpected token: #{@tokens[@i]}"
    end
  end
end

class Parser
  # ...

  def levelN
    if has(:integer)
      top_token = advance
      raise "Expected /" if not has(:forward_slash)
      bottom_token = advance
      Ast::Fraction.new(top_token.text.to_i, bottom_token.text.to_i)
    # ...
    else
      raise "Unexpected token: #{@tokens[@i]}"
    end
  end
end

While we're testing, we'll have the parse method call whatever method we're testing. At this point, our parser can only turn fraction primitives into trees. Once we've fleshed out levelN, we'll implement level1 so it reflects the grammar. It has two productions. How do we choose between them? That seems a little tricky. Let's focus on just the first production. Our first draft might look like this:

Ruby

def level1
  left = level1
  advance # eat *
  right = levelN
  Ast::Multiply.new(left, right)
end

def level1
  left = level1
  advance # eat *
  right = levelN
  Ast::Multiply.new(left, right)
end

This code has a problem. The very first statement is a recursive call to the method we are trying to write. Our program will quickly reach a stack overflow. How can we fix this? Well, eventually, that recursion is going to bottom out to a call to levelN. Let's make that call first:

Ruby

def level1
  left = levelN
  advance # eat *
  right = levelN
  Ast::Multiply.new(left, right)
end

def level1
  left = levelN
  advance # eat *
  right = levelN
  Ast::Multiply.new(left, right)
end

This code also has a problem. Since we lost the recursion, we can't parse more than one multiplication. We replace the recursion with iteration that builds up the root of the entire sequence of multiplies in left:

Ruby

def level1
  left = levelN
  while has(:asterisk)
    advance
    right = levelN
    left = Ast::Multiply.new(left, right)
  end
  left
end

def level1
  left = levelN
  while has(:asterisk)
    advance
    right = levelN
    left = Ast::Multiply.new(left, right)
  end
  left
end

We craft level0 in a similar manner. We only have to turn recursion into iteration for left-associative operators. Right-associative operators can be implemented with recursion. Because they recurse on the right, the recursion won't be infinite.

Print and Program

So far we've only been able to match a single expression. Let's grow our language to allow multiple print statements.

BNF

program = (statement LINEBREAK)* EOF

statement = PRINT level0

program = (statement LINEBREAK)* EOF

statement = PRINT level0

To support this, we'll need to add some new tokens to our lexer. Linebreaks can be matched like any character. We'll all add a rule to match all identifiers. Then we'll see if the identifier is a keyword like print. After the lexing is finished, we'll append a fake EOF token.

We'll add a Program and Print node to our AST hierarchy. That also means the visitors will get some new visit_ methods.

Then we add methods to our parser.

Typechecking

Right now our language has only one type: fraction. Let's add in booleans. They are primitives and go in levelN of our grammar:

BNF

levelN = INTEGER / INTEGER
       | BOOLEAN

levelN = INTEGER / INTEGER
       | BOOLEAN

We add nodes and parsing logic for these. But something bigger also emerges: the possibility of type errors. Someone might try to add booleans. We can't let this happen. When and where do check for type errors? It depends on the language. If we're C or Java, we check right after we build the tree but before we execute it. If we're Ruby or Python, we check during execution. That's what we'll do with our fraction language.

In the evaluator, we add logic like this:

Ruby

def visit_add(node)
  left_primitive = node.left_node.visit(self)
  raise "+ expects fractions" if !left_primitive.is_a?(Ast::Fraction)
  # ...
end

def visit_add(node)
  left_primitive = node.left_node.visit(self)
  raise "+ expects fractions" if !left_primitive.is_a?(Ast::Fraction)
  # ...
end

From here we could keep growing our language. We could add parentheses. We could add variables. But that's a job for milestones 1 and 2.

TODO

Here's your list of things to do before we meet next:

Keep working on milestone 1 for the project. Commit to GitHub after every work session. The first ready date is Sunday.

If you're working with a partner, message me on Discord.

Complete the middle quiz as desired.

See you next time.

Sincerely,

P.S. It's time for a haiku!

The math of parsing Put 2 and 2 together Until you get 1

← Namespaces Lab: Scope and Ruby →

Dear Computer

Lecture: Parsing

Grammar

Parser

Print and Program

Typechecking

TODO