Modeling a Program
In this class, we approach our subject from two directions: we study several different mainstream languages, and we design and implement a language of our very own. In studying several languages, we'll see how many different choices there are and how our options can't be ranked in a simple manner from right to wrong. Each option has tradeoffs. Making something easy, fast, or safe generally makes something else hard, slow, or dangerous. In developing our own language, we'll see how humans make choices about the programming experience. You may develop some empathy for the folks who chose that tedious syntax or wrote those obscure error messages.
There are many steps to developing a new language. We start now, but we'll spread the rest across the semester.
The first step is deciding how to model a program written in our language. The computer doesn't know our language. We must translate a program written in our unknown language into a program written in a known language. For now, our known language is Ruby. The computer does know Ruby because someone taught it how to translate Ruby into machine code.
Abstract Syntax Tree
We need a Ruby data structure flexible enough represent any legal program. Since programs are both ordered and made up of subprograms, we'll use a tree. The nodes of a tree have ordered children and the children themselves may be the roots of subtrees.


An AST is syntactic in the sense that it mirrors the structure of the original program—before anything has been evaluated. It is abstract in the sense that not every single syntactic detail is represented. For instance, an AST likely doesn't have nodes for whitespace, semi-colons, comments, or parentheses. It is a tree because it is made up of parent and child nodes.
We've been talking about naming data, but we name other things in programs, like classes. A class is essentially a bundle of names. We model the nodes of an abstract syntax tree with classes that lets us give names to a program's parts.
Expression Nodes
An expression is a program part that produces a value. Consider the expression 2 * x + 1. It has this AST representation:


The tree reveals the four model classes we need to write in Ruby:
-
Integer, for integer literals -
Identifier, for variable names -
Multiply -
Add
Here's how we might implement and instantiate Integer:
class Integer
def initialize(raw_value)
@raw_value = raw_value
end
end
two = Integer.new(2)
class Integer
def initialize(raw_value)
@raw_value = raw_value
end
end
two = Integer.new(2)This code is dangerous, however. Ruby has a builtin class named Integer. We must either pick a different name or put our Integer in a separate namespace. Let's go with the namespace option. All our model classes will be placed in the Ast module:
module Ast
class Integer
def initialize(raw_value)
@raw_value = raw_value
end
end
end
two = Ast::Integer.new(2)
module Ast
class Integer
def initialize(raw_value)
@raw_value = raw_value
end
end
end
two = Ast::Integer.new(2)Instance variables in Ruby are not public. If we are operating outside the class scope, we can't access @raw_value. However, we can add a getter method. In Ruby, methods are public by default.
Getters tend to add brainless noise to programs. Ruby cuts down on this noise with attr_reader, a syntactic shortcut that generates getters with a single line of code. It accepts one or more instance variable names in symbol form.
Currently we've got two nodes for our language, and they both represent data. Let's add an operator that will make new data from old. Here's how we might define the Multiply node:
module Ast
# ...
class Multiply
attr_reader :left_node, :right_node
def initialize(left_node, right_node)
@left_node = left_node
@right_node = right_node
end
end
end
double_x = Multiply.new(
Integer.new(2),
Identifier.new('x'),
)
module Ast
# ...
class Multiply
attr_reader :left_node, :right_node
def initialize(left_node, right_node)
@left_node = left_node
@right_node = right_node
end
end
end
double_x = Multiply.new(
Integer.new(2),
Identifier.new('x'),
)Observe how the children of Multiply are other nodes, not raw Ruby values.
Many nodes in an abstract syntax tree have the same shape. Add and Multiply are both binary operators. They have two children, and there are many other operators like them. Instead of repeating ourselves, let's factor out a superclass:
module Ast
# ...
class BinaryOperator
attr_reader :left_node, :right_node
def initialize(left_node, right_node)
@left_node = left_node
@right_node = right_node
end
end
class Multiply < BinaryOperator
end
class Add < BinaryOperator
end
end
module Ast
# ...
class BinaryOperator
attr_reader :left_node, :right_node
def initialize(left_node, right_node)
@left_node = left_node
@right_node = right_node
end
end
class Multiply < BinaryOperator
end
class Add < BinaryOperator
end
endThis refactoring leaves the subclasses empty and seemingly useless. In later steps, we'll add methods that give them different behavior.
Statement Nodes
Programs contain many expressions, but these are usually glued together in statements. A statement is a program part that achieves some side effect, like assigning a variable, running a loop, or performing I/O. Unlike expressions, they do not produce a value.
Here's how we might define a node for a print statement:
module Ast
# ...
class Print
attr_reader :expression_node
def initialize(expression_node)
@expression_node = expression_node
end
end
end
show = Ast::Print.new(Ast::Integer.new(0))
p show
module Ast
# ...
class Print
attr_reader :expression_node
def initialize(expression_node)
@expression_node = expression_node
end
end
end
show = Ast::Print.new(Ast::Integer.new(0))
p showThe child of a print node is a single expression. It has to be an expression rather than a statement because we need a value to print. If this were Java, we'd use a type to ensure an expression node. In Ruby, we expect the client to read the documentation.
We now have a way to represent programs in an unknown language as trees in a known language. That's great, but also underwhelming for a couple of reasons. First, making trees by hand is tedious. We'll fix that next chapter. Second, trees represent programs, but we can't yet run the tree. We'll fix that in class.
Summary
Modern computers still have at their core a processor and a bank of mutable memory cells—a von Neumann computer. High-level programming languages make it possible for us to attach names to those memory cells. Good names make code easier for humans to understand and use correctly. As we write code that operates on names, we have to consider whether we care about a name's location (its lvalue) or the value stored in its memory cell (its rvalue). Languages provide two different mechanisms for locking memory. We can lock the name so it can't be associated with a different value, or we can lock the value so that it can't be mutated. Memory cells are also protected by limiting their scope to just the parts of a program that need access. To help organize scopes, we employ various encapsulation strategies like functions, classes, and namespaces. The first step in developing our own programming language is naming its parts using model classes.