Parsing

Dear Computer

Chapter 2: Translation

Parsing

Just as we used state diagrams to think about how lexers operate, we use grammars, derivations, and parse trees to think about how parsers operate. Now we apply those ideas as we code up a parser that turns grammatical forms into abstract syntax trees. The parser takes us from syntax to semantic meaning.

There are many algorithms for implementing a parser. One of the more human-friendly algorithms is recursive descent. We build a recursive descent parser by translating the grammar into a Parser class, with each non-terminal becoming a method that inches forward through the tokens according to its productions. Every call to these non-terminal methods produces an AST node. We choose between productions with a conditional statement that peeks at the next tokens. For example, the level2 method of our recent example might be implemented like this:

Ruby
def level2
  if has(:integer)
    token = advance
    Ast::Integer.new(token[:text].to_i)
  elsif has(:identifier)
    token = advance
    Ast::Identifier.new(token[:text])
  else
    raise "Unexpected token: #{@tokens[@i]}"
  end
end
def level2
  if has(:integer)
    token = advance
    Ast::Integer.new(token[:text].to_i)
  elsif has(:identifier)
    token = advance
    Ast::Identifier.new(token[:text])
  else
    raise "Unexpected token: #{@tokens[@i]}"
  end
end

This method is part of the larger Parser class with utility methods has and advance.

The level0 method from our example grammar must return either an Add or Subtract node. Implementing it is more involved, but it demonstrates how recursive descent earned its name. Recall the rule for level0 from our grammar:

level0 = level0 + level1
       | level0 - level1
       | level1
level0 = level0 + level1
       | level0 - level1
       | level1

Our parser traverses the precedence ladder just as we did by hand. Since the methods are usually arranged in the source file from lowest precedence to highest, each climb is a “descent” to the next level. Each remain is a recursive call to the current method. So how do we translate the level0 rule to code? Somehow we have to choose which of the three productions to apply. To simplify the problem, let's pretend for a moment that the third production doesn't exist. With it gone, we know what to do: parse a left0 operand as the left operand. That turns into this code:

Ruby
def level0
  left = level0
  # ...
end
def level0
  left = level0
  # ...
end

Here we see the recursive nature of the parser, but there's a problem with this. The very first thing the method does is call itself. It's called left-recursion, and it'll never stop. To fix this, we need some descent. We bring the third production back into play; it is our base case. Let's go ahead and parse the level1 operand that must eventually occur:

Ruby
def level0
  left = level1
  # ...
end
def level0
  left = level1
  # ...
end

After parsing the left operand, we might encounter + or -. If we do, we need to advance past the operator, parse the right operand, and build an appropriate node. That gives us this conditional logic for the three productions:

Ruby
def level0
  left = level1
  if has(:plus)
    advance
    right = level1
    Ast::Add.new(left, right)
  elsif has(:hyphen)
    advance
    right = level1
    Ast::Subtract.new(left, right)
  else
    left
  end
end
def level0
  left = level1
  if has(:plus)
    advance
    right = level1
    Ast::Add.new(left, right)
  elsif has(:hyphen)
    advance
    right = level1
    Ast::Subtract.new(left, right)
  else
    left
  end
end

One issue remains: by removing the problematic recursion, we have lost the ability to parse more than one operator at level0. We can get that ability back by looping around the conditional statement. On each iteration, the node we build gets fed along as the new left operand for the next operation. Step through to see how the tree grows with each iteration:

We accumulate an ever larger tree with this code:

Ruby
def level0
  left = level1

  while has(:plus) || has(:hyphen)
    if has(:plus)
      advance
      right = level1
      left = Ast::Add.new(left, right)
    else
      advance
      right = level1
      left = Ast::Subtract.new(left, right)
    end
  end

  left
end
def level0
  left = level1

  while has(:plus) || has(:hyphen)
    if has(:plus)
      advance
      right = level1
      left = Ast::Add.new(left, right)
    else
      advance
      right = level1
      left = Ast::Subtract.new(left, right)
    end
  end

  left
end

All left-associative operators suffer this problem of left-recursion, and we must recast them as iteration. Right-associative operators can be very naturally implemented with recursion. Their left operand is a descent rather than a recursion.

Many translators for mainstream languages use similar handwritten recursive descent parsers. However, there are a number of parser generators that take in a grammar and automatically construct a parser using less intuitive algorithms. We won't examine them. If translating source code seems interesting to you, consider taking a course on compilers.

Summary

In this chapter, we explored the challenges of getting a machine to understand us when we write programs. We philosophized on why we study programming languages and why we have so many. Then we investigated the machinery of recognizing the language in which we express programs: a lexer that emits a sequence of tokens and a parser that assembles an abstract syntax tree. This machinery is packaged up either in a compiler, which translates a program to a different form in a language that the computer already knows how to execute, or in an interpreter, which executes the code immediately. A language's grammar is the protocol that shapes a program. Humans follow it when writing code, and the computer follows it when reading code.

← Ladder of Precedence