Parsing

Just as we used state diagrams to think about how lexers operate, we use grammars, derivations, and parse trees to think about how parsers operate. Now we apply those ideas as we code up a parser that recognizes grammatical forms.

There are various algorithms for implementing a parser. One of the more intuitive algorithms produces a parser that looks a lot like the grammar. We start by writing abstractions for each node that might appear in a tree representation of a program, like IfStatement, ForLoop, Assignment, Multiply, or Program. For example, we might implement Program like this in Java:

Java

class Program {
  private Statement[] statements;

  public Program(Statement[] statements) {
    this.statements = statements;
  }
}

class Program {
  private Statement[] statements;

  public Program(Statement[] statements) {
    this.statements = statements;
  }
}

And we might implement an Add node with this class:

Java

class Add {
  private Expression leftNode;
  private Expression rightNode;

  public Add(Expression leftNode, Expression rightNode) {
    this.leftNode = leftNode;
    this.rightNode = rightNode;
  }
}

class Add {
  private Expression leftNode;
  private Expression rightNode;

  public Add(Expression leftNode, Expression rightNode) {
    this.leftNode = leftNode;
    this.rightNode = rightNode;
  }
}

Once we have these model classes, we write a Parser abstraction. A parser walks through the list of tokens from the lexer from left to right and tries to organize them into grammatical structures. Each non-terminal in the grammar gets its own method. That method returns an instance of a corresponding program structure. For example, the method for parsing program would return a Program, and the method for parsing expression might return an Integer, a Identifier, or an Add.

The body of a non-terminal's method uses conditional statements to choose which production to apply and loops to repeat parsing actions. Non-terminals within a production are parsed by recursively calling their methods. In pseudocode, the routine for handling the program non-terminal, whose production was defined as program = statement* EOF, might look something like this:

def program()
  statements = []
  while token isn't EOF
    s = statement()
    append s to statements
  return new Program(statements)

def program()
  statements = []
  while token isn't EOF
    s = statement()
    append s to statements
  return new Program(statements)

The method returns a Program instance. It is essentially the root of a tree representation of the parsed program.

The statement method needs to choose between its two possible productions with a conditional statement:

def statement()
  if current token is IDENTIFIER
    id token = current token
    assert ASSIGN token
    e = expression()
    return new Assignment(id token, e)
  else if token is PUTS
    e = expression()
    return new Puts(e)

def statement()
  if current token is IDENTIFIER
    id token = current token
    assert ASSIGN token
    e = expression()
    return new Assignment(id token, e)
  else if token is PUTS
    e = expression()
    return new Puts(e)

The expression method needs to choose between its three possible productions:

def expression()
  if current token is INTEGER
    return new Integer(current token)
  else if current token is IDENTIFIER
    return new Variable(current token)
  else
    a = expression()
    assert PLUS token
    b = expression()
    return new Add(a, b)

def expression()
  if current token is INTEGER
    return new Integer(current token)
  else if current token is IDENTIFIER
    return new Variable(current token)
  else
    a = expression()
    assert PLUS token
    b = expression()
    return new Add(a, b)

Do you see how the parser has the exact same structure as the grammar? This method of building a parser that mirrors the grammar is called recursive descent because it recursively descends from the starting terminal through the productions down to terminals. A recursive descent parser effectively builds a left-most derivation to match the tokens it encounters. Compared to other parsing algorithms, it's human-friendly. Many translators for mainstream languages use similar handwritten recursive descent parsers.

Recursive descent parsing does put the burden of advancing through the productions on the developer. There are a number of helpful parser generators that will take in a grammar and automatically construct a parser using less intuitive algorithms. We won't examine them. If translating source code seems interesting to you, consider taking a course on compilers.

The tree representation of the program that the parser assembles is an abstract syntax tree (AST). Abstract syntax trees are simplified versions of parse trees. They express the structure of the parsed program, but they don't include details about how substructures are derived from the grammar rules. For example, the expression TRUE && FALSE || TRUE is reflected in this abstract syntax tree, which is simpler than the parse tree we saw earlier:

Some program translators might walk an abstract syntax tree and evaluate it directly in tree form. Others will turn the tree into machine code or bytecode.

Summary

In this chapter, we explored the challenges of getting a machine to understand us when we write programs. We philosophized on why we study programming languages and why we have so many. Then we investigated the machinery of recognizing the language in which we express programs: a lexer that emits a sequence of tokens and a parser that assembles an abstract syntax tree. This machinery is packaged up either in a compiler, which translates a program to a different form in a language that the computer already knows how to execute, or in an interpreter, which executes the code immediately. A language's grammar is the protocol that shapes a program. Humans follow it when writing code, and the computer follows it when reading code.

← Parse Trees Lecture: Hello Language →

Dear Computer

Parsing

Summary