Lecture: Language to Language
Dear students:
In this class, we approach our subject from two directions: we implement a programming language and we study several different mainstream languages. In implementing a language, we'll see how humans make choices about the programming experience. You may develop some empathy for the folks who wrote those error messages. In studying several languages, we'll see how many different choices there are and how our options can't be ranked in a simple manner from right to wrong. Each option has a tradeoff. Making some computation easy, fast, or safe generally makes something other computation hard, slow, or dangerous.
Today we begin work on both of these fronts. We're going to start implementing an interpreter for Settle, a little language for manipulating sets—like colors, vegetables, and mammals. The computer doesn't know Settle. Our job as language designers is to translate the language it doesn't know into a language it does know. That target language is Ruby.
What we talk about today is meant to mirror what you will need to do for milestones 1 and 2 in the spreadsheet project. You should take copious notes. Translating is a big task, and we'll spend at least two lectures talking about it.
Abstract Syntax Tree
Language tools like compilers and interpreters first translate a program from the unknown language into a tree. This tree is called an abstract syntax tree (AST). We'll draw a tree for this Settle program:
coverings = {fur feathers scales}
coverings = coverings + {tshirt armor}
print(coverings)
print(size(coverings))
An AST is syntactic in the sense that represents the structure of the original program—before anything has been evaluated. It is abstract in the sense that not every single syntactic detail is represented. For instance, an AST likely doesn't track the location of whitespace, semi-colons, or parentheses.
A tree is made of nodes. In an object-oriented language, each node is an instance of a class in a hierarchy of program anatomy. There's a class for set literals, another for union operations, and yet another for variable statements. These classes form the model of the interpreted program. Before we attend to the model, let's chunk the letters of source code up into tokens.
Lexing
What are the kinds of tokens we expect to see in a Settle program? Tokens are the smallest meaningful unit of a program. Individuals characters are generally too small. The r
in return has no meaning on its own. However, a ;
does have meaning on its own.
From the example above, I identify these tokens:
-
{
-
}
-
=
-
+
-
(
-
)
-
print
-
size
- identifiers
We need little machines that can tell us when we've encountered one of these tokens. The punctuation machines just have a single accepting state and a single transition. The idenitifier machine can also have just two states with a loop. We'll draw these machines together.
Ultimately, however, we need these machines running in code. Here's how we might write a routine for lexing a left curly brace:
if has('{')
capture
emit_token(:left_curly)
end
Those little machines recognize only one type of token. We need a mega-machine that recognizes all of them, so we glue them together. Once they are glued, we write the entire lexing routine, which might look like this:
if has('{')
capture
emit_token(:left_curly)
elsif has('}')
capture
emit_token(:right_curly)
elsif has('=')
capture
emit_token(:equal_sign)
elsif has('+')
capture
emit_token(:plus)
elsif has('(')
capture
emit_token(:left_parenthesis)
elsif has(')')
capture
emit_token(:right_parenthesis)
elsif has_alphabetic()
while has_alphabetic()
capture
emit_token(:identifier)
end
This code belongs in a lexer abstraction. We'll write it together and also define the helper methods like has
, has_alphabetic
, capture
, and emit_token
.
Model
Software is divided roughly into two concerns: model and view. The view is the interface, which is likely to change frequently, and the model is the data and behaviors that are independent of any particular view. The view of a language is enshrined in the lexer and parser. Before we can write the parser, we need the model.
Let's build the classes that represent the model of the Settle language. Most every anatomical part of a Settle program needs a model representation. We'll start with Set
. Ruby has a builtin Set
class, but we're going to build this worse one that wraps around an array:
module Ast
class Set
attr_reader :items
def initialize(items)
@items = items
end
end
end
We place our model classes in a module. A module is an organizational tool that helps us cluster related classes and functions together and avoid name collisions. The attr_reader
bit is a method call. We pass it the name of an instance variable—as a symbol—and it defines a getter for that instance variable.
The data in a Settle program could be a set. Suppose someone writes size(proteins)
. What other kind of data do we need? An integer, for which we write this model class:
class Integer
attr_reader :value
def initialize(value)
@value = value
end
end
The data are the leaf nodes of an AST. The internal nodes represent operations or structures. We need a model class for each of these:
- add
- subtract
- intersect
- xor
- size
- assignment
- program
Now that we have these classes, we can build Settle programs manually.
require_relative './ast.rb'
meats = Ast::Set.new(['pork', 'chicken', 'beef'])
plants = Ast::Set.new(['tofu', 'lentils', 'edamame'])
print_set = Ast::Print.new(meats)
print_size = Ast::Print.new(Ast::Size.new(meats))
program = Ast::Program.new([
meats,
plants,
print_set,
print_size,
])
Building ASTs manually is not exactly fun, but it's a step. Next week we'll write a parser that operationalizes the grammar to build up a program directly from Settler source.
TODO
Here's your list of things to do before we meet next:
See you for our first, in which we continue exploring lexers, grammars, and precedence.
Sincerely,
P.S. It's time for a haiku!