Parsing

Dear Computer

Chapter 1: Programming Language

Parsing

Just as we used state diagrams to think about how lexers operate, we use grammars to think about how parsers operate. Now we apply those ideas as we code up a parser that recognizes grammatical forms.

There are various algorithms for implementing a parser. One of the more intuitive algorithms produces a parser that looks a lot like the grammar. You start by writing abstractions for each program structure, like IfStatement, ForLoop, Assignment, Multiply, or Program. For example, you might implement Program like this in Java:

Java
class Program {
  private Statement[] statements;

  public Program(Statement[] statements) {
    this.statements = statements;
  }
}
class Program {
  private Statement[] statements;

  public Program(Statement[] statements) {
    this.statements = statements;
  }
}

Then you write a Parser abstraction. A parser walks through the list of tokens from the lexer from left to right and tries to organize them into grammatical structures. Transforming a flat list into a tree is easier if you give the parser a method for each non-terminal of the grammar. A non-terminal's method returns an instance of a corresponding program structure. For example, the method for parsing program would return a Program, and the method for parsing expression might return a Multiply or Negate.

The body of a non-terminal's method uses conditional statements to choose which production to apply and loops to repeat parsing actions. Non-terminals within a production are parsed by recursively calling their methods. In pseudocode, the routine for handling the program non-terminal, whose production was defined as program = statement* EOF, might look something like this:

def program()
  statements = []
  while token isn't EOF
    s = statement()
    append s to statements
  return new Program(statements)
def program()
  statements = []
  while token isn't EOF
    s = statement()
    append s to statements
  return new Program(statements)

The method returns a Program instance. It is essentially the root of a tree representation of the parsed program.

The statement method needs to choose between its two possible productions with a conditional statement:

def statement()
  if current token is IDENTIFIER
    id token = current token
    assert ASSIGN token
    e = expression()
    return new Assignment(id token, e)
  else if token is PUTS
    e = expression()
    return new Puts(e)
def statement()
  if current token is IDENTIFIER
    id token = current token
    assert ASSIGN token
    e = expression()
    return new Assignment(id token, e)
  else if token is PUTS
    e = expression()
    return new Puts(e)

The expression method needs to choose between its three possible productions:

def expression()
  if current token is INTEGER
    return new Integer(current token)
  else if current token is IDENTIFIER
    return new Variable(current token)
  else
    a = expression()
    assert PLUS token
    b = expression()
    return new Add(a, b)
def expression()
  if current token is INTEGER
    return new Integer(current token)
  else if current token is IDENTIFIER
    return new Variable(current token)
  else
    a = expression()
    assert PLUS token
    b = expression()
    return new Add(a, b)

Do you see how the parser has the exact same structure as the grammar? This method of building a parser that mirrors the grammar is called recursive descent because it recursively descends from the starting terminal through the productions down to terminals. A recursive descent parser effectively builds a left-most derivation to match the tokens it encounters. Compared to other parsing algorithms, it's human-friendly. Many translators for mainstream languages use similar handwritten recursive descent parsers.

Recursive descent parsing does put the burden of advancing through the productions on the developer. There are a number of helpful parser generators that will take in a grammar and automatically construct a parser using less intuitive algorithms. We won't examine them. If translating source code seems interesting to you, consider taking a course on compilers.

The tree representation of the program that the parser assembles is an abstract syntax tree (AST). Some program translators might walk the tree and evaluate it directly in tree form. Others will turn the tree into machine code or bytecode.

Summary

In this chapter, we explored the challenges of getting an unthinking machine to understand us when we write programs. We philosophized on why we study programming languages and why we have so many. Then we investigated the machinery of recognizing the language of a program: a lexer that emits a sequence of tokens and a parser that assembles an abstract syntax tree. This machinery is packaged up either in a compiler, which translates a program to a different form in a language that the computer already knows how to execute, or in an interpreter, which executes the code immediately. A language's grammar is the protocol that shapes a program. Humans follow it when writing code, and the computer follows it when reading code.

← Parse TreesLecture: Hello Ruby →