Modeling a Program

In this class, we approach our subject from two directions: we study several different mainstream languages, and we design and implement a language of our very own. In studying several languages, we'll see how many different choices there are and how our options can't be ranked in a simple manner from right to wrong. Each option has tradeoffs. Making something easy, fast, or safe generally makes something else hard, slow, or dangerous. In developing our own language, we'll see how humans make choices about the programming experience. You may develop some empathy for the folks who chose that tedious syntax or wrote those obscure error messages.

There are many steps to developing a new language. We start now, but we'll spread the rest across the semester.

The first step is deciding how to model a program written in our language. The computer doesn't know our language. We must translate a program written in our unknown language into a program written in a known language. For now, our known language is Ruby. The computer does know Ruby because someone taught it how to translate Ruby into machine code.

Abstract Syntax Tree

We need a Ruby data structure flexible enough represent any legal program. Since programs are both ordered and made up of subprograms, we'll use a tree. The nodes of a tree have ordered children and the children themselves may be the roots of subtrees.

a diagram showing the cells of an array packed tightly in memory

An AST is syntactic in the sense that it mirrors the structure of the original program—before anything has been evaluated. It is abstract in the sense that not every single syntactic detail is represented. For instance, an AST likely doesn't have nodes for whitespace, semi-colons, comments, or parentheses. It is a tree because it is made up of parent and child nodes.

We've been talking about naming data, but we name other things in programs, like classes. A class is essentially a bundle of names. We model the nodes of an abstract syntax tree with classes that lets us give names to a program's parts.

Expression Nodes

An expression is a program part that produces a value. Consider the expression 2 * x + 1. It has this AST representation:

The tree reveals the four model classes we need to write in Ruby:

Integer, for integer literals
Identifier, for variable names
Multiply
Add

Here's how we might implement and instantiate Integer:

Ruby

class Integer
  def initialize(raw_value)
    @raw_value = raw_value
  end
end

two = Integer.new(2)

class Integer
  def initialize(raw_value)
    @raw_value = raw_value
  end
end

two = Integer.new(2)

This code is dangerous, however. Ruby has a builtin class named Integer. We must either pick a different name or put our Integer in a separate namespace. Let's go with the namespace option. All our model classes will be placed in the Ast module:

Ruby

module Ast
  class Integer
    def initialize(raw_value)
      @raw_value = raw_value
    end
  end
end

two = Ast::Integer.new(2)

module Ast
  class Integer
    def initialize(raw_value)
      @raw_value = raw_value
    end
  end
end

two = Ast::Integer.new(2)

Instance variables in Ruby are not public. If we are operating outside the class scope, we can't access @raw_value. However, we can add a getter method. In Ruby, methods are public by default.

Getters tend to add brainless noise to programs. Ruby cuts down on this noise with attr_reader, a syntactic shortcut that generates getters with a single line of code. It accepts one or more instance variable names in symbol form.

Currently we've got two nodes for our language, and they both represent data. Let's add an operator that will make new data from old. Here's how we might define the Multiply node:

Ruby

module Ast
  # ...

  class Multiply
    attr_reader :left_node, :right_node
    def initialize(left_node, right_node)
      @left_node = left_node
      @right_node = right_node
    end
  end
end

double_x = Multiply.new(
  Integer.new(2),
  Identifier.new('x'),
)

module Ast
  # ...

  class Multiply
    attr_reader :left_node, :right_node
    def initialize(left_node, right_node)
      @left_node = left_node
      @right_node = right_node
    end
  end
end

double_x = Multiply.new(
  Integer.new(2),
  Identifier.new('x'),
)

Observe how the children of Multiply are other nodes, not raw Ruby values.

Many nodes in an abstract syntax tree have the same shape. Add and Multiply are both binary operators. They have two children, and there are many other operators like them. Instead of repeating ourselves, let's factor out a superclass:

Ruby

module Ast
  # ...

  class BinaryOperator
    attr_reader :left_node, :right_node
    def initialize(left_node, right_node)
      @left_node = left_node
      @right_node = right_node
    end
  end

  class Multiply < BinaryOperator
  end

  class Add < BinaryOperator
  end
end

module Ast
  # ...

  class BinaryOperator
    attr_reader :left_node, :right_node
    def initialize(left_node, right_node)
      @left_node = left_node
      @right_node = right_node
    end
  end

  class Multiply < BinaryOperator
  end

  class Add < BinaryOperator
  end
end

This refactoring leaves the subclasses empty and seemingly useless. In later steps, we'll add methods that give them different behavior.

Statement Nodes

Programs contain many expressions, but these are usually glued together in statements. A statement is a program part that achieves some side effect, like assigning a variable, running a loop, or performing I/O. Unlike expressions, they do not produce a value.

Here's how we might define a node for a print statement:

Ruby

module Ast
  # ...

  class Print
    attr_reader :expression_node
    def initialize(expression_node)
      @expression_node = expression_node
    end
  end
end

show = Ast::Print.new(Ast::Integer.new(0))
p show

module Ast
  # ...

  class Print
    attr_reader :expression_node
    def initialize(expression_node)
      @expression_node = expression_node
    end
  end
end

show = Ast::Print.new(Ast::Integer.new(0))
p show

The child of a print node is a single expression. It has to be an expression rather than a statement because we need a value to print. If this were Java, we'd use a type to ensure an expression node. In Ruby, we expect the client to read the documentation.

We now have a way to represent programs in an unknown language as trees in a known language. That's great, but also underwhelming for a couple of reasons. First, making trees by hand is tedious. We'll fix that next chapter. Second, trees represent programs, but we can't yet run the tree. We'll fix that in class.

Summary

Modern computers still have at their core a processor and a bank of mutable memory cells—a von Neumann computer. High-level programming languages make it possible for us to attach names to those memory cells. Good names make code easier for humans to understand and use correctly. As we write code that operates on names, we have to consider whether we care about a name's location (its lvalue) or the value stored in its memory cell (its rvalue). Languages provide two different mechanisms for locking memory. We can lock the name so it can't be associated with a different value, or we can lock the value so that it can't be mutated. Memory cells are also protected by limiting their scope to just the parts of a program that need access. To help organize scopes, we employ various encapsulation strategies like functions, classes, and namespaces. The first step in developing our own programming language is naming its parts using model classes.

← Namespaces

Dear Computer

Modeling a Program

Abstract Syntax Tree

Expression Nodes

Statement Nodes

Summary