Lecture: Language to Language

Dear Computer

Chapter 1: Programming Language

Lecture: Language to Language

Dear students:

In this class, we approach our subject from two directions: we implement a programming language and we study several different mainstream languages. In implementing a language, we'll see how humans make choices about the programming experience. You may develop some empathy for the folks who wrote those error messages. In studying several languages, we'll see how many different choices there are and how our options can't be ranked in a simple manner from right to wrong. Each option has a tradeoff. Making some computation easy, fast, or safe generally makes something other computation hard, slow, or dangerous.

Today we begin work on both of these fronts. We're going to start implementing an interpreter for Settle, a little language for manipulating sets—like colors, vegetables, and mammals. The computer doesn't know Settle. Our job as language designers is to translate the language it doesn't know into a language it does know. That target language is Ruby.

What we talk about today is meant to mirror what you will need to do for milestones 1 and 2 in the spreadsheet project. You should take copious notes. Translating is a big task, and we'll spend at least two lectures talking about it.

Abstract Syntax Tree

Language tools like compilers and interpreters first translate a program from the unknown language into a tree. This tree is called an abstract syntax tree (AST). We'll draw a tree for this Settle program:

coverings = {fur feathers scales}
coverings = coverings + {tshirt armor}
print(coverings)
print(size(coverings))
coverings = {fur feathers scales}
coverings = coverings + {tshirt armor}
print(coverings)
print(size(coverings))

An AST is syntactic in the sense that represents the structure of the original program—before anything has been evaluated. It is abstract in the sense that not every single syntactic detail is represented. For instance, an AST likely doesn't track the location of whitespace, semi-colons, or parentheses.

A tree is made of nodes. In an object-oriented language, each node is an instance of a class in a hierarchy of program anatomy. There's a class for set literals, another for union operations, and yet another for variable statements. These classes form the model of the interpreted program. Before we attend to the model, let's chunk the letters of source code up into tokens.

Lexing

What are the kinds of tokens we expect to see in a Settle program? Tokens are the smallest meaningful unit of a program. Individuals characters are generally too small. The r in return has no meaning on its own. However, a ; does have meaning on its own.

From the example above, I identify these tokens:

We need little machines that can tell us when we've encountered one of these tokens. The punctuation machines just have a single accepting state and a single transition. The idenitifier machine can also have just two states with a loop. We'll draw these machines together.

Ultimately, however, we need these machines running in code. Here's how we might write a routine for lexing a left curly brace:

Ruby
if has('{')
  capture
  emit_token(:left_curly)
end
if has('{')
  capture
  emit_token(:left_curly)
end

Those little machines recognize only one type of token. We need a mega-machine that recognizes all of them, so we glue them together. Once they are glued, we write the entire lexing routine, which might look like this:

Ruby
if has('{')
  capture
  emit_token(:left_curly)
elsif has('}')
  capture
  emit_token(:right_curly)
elsif has('=')
  capture
  emit_token(:equal_sign)
elsif has('+')
  capture
  emit_token(:plus)
elsif has('(')
  capture
  emit_token(:left_parenthesis)
elsif has(')')
  capture
  emit_token(:right_parenthesis)
elsif has_alphabetic()
  while has_alphabetic()
    capture
  emit_token(:identifier)
end
if has('{')
  capture
  emit_token(:left_curly)
elsif has('}')
  capture
  emit_token(:right_curly)
elsif has('=')
  capture
  emit_token(:equal_sign)
elsif has('+')
  capture
  emit_token(:plus)
elsif has('(')
  capture
  emit_token(:left_parenthesis)
elsif has(')')
  capture
  emit_token(:right_parenthesis)
elsif has_alphabetic()
  while has_alphabetic()
    capture
  emit_token(:identifier)
end

This code belongs in a lexer abstraction. We'll write it together and also define the helper methods like has, has_alphabetic, capture, and emit_token.

Model

Software is divided roughly into two concerns: model and view. The view is the interface, which is likely to change frequently, and the model is the data and behaviors that are independent of any particular view. The view of a language is enshrined in the lexer and parser. Before we can write the parser, we need the model.

Let's build the classes that represent the model of the Settle language. Most every anatomical part of a Settle program needs a model representation. We'll start with Set. Ruby has a builtin Set class, but we're going to build this worse one that wraps around an array:

Ruby
module Ast
  class Set
    attr_reader :items

    def initialize(items)
      @items = items
    end
  end
end
module Ast
  class Set
    attr_reader :items

    def initialize(items)
      @items = items
    end
  end
end

We place our model classes in a module. A module is an organizational tool that helps us cluster related classes and functions together and avoid name collisions. The attr_reader bit is a method call. We pass it the name of an instance variable—as a symbol—and it defines a getter for that instance variable.

The data in a Settle program could be a set. Suppose someone writes size(proteins). What other kind of data do we need? An integer, for which we write this model class:

Ruby
class Integer
  attr_reader :value

  def initialize(value)
    @value = value
  end
end
class Integer
  attr_reader :value

  def initialize(value)
    @value = value
  end
end

The data are the leaf nodes of an AST. The internal nodes represent operations or structures. We need a model class for each of these:

Now that we have these classes, we can build Settle programs manually.

Ruby
require_relative './ast.rb'

meats = Ast::Set.new(['pork', 'chicken', 'beef'])
plants = Ast::Set.new(['tofu', 'lentils', 'edamame'])
print_set = Ast::Print.new(meats)
print_size = Ast::Print.new(Ast::Size.new(meats))
program = Ast::Program.new([
  meats,
  plants,
  print_set,
  print_size,
])
require_relative './ast.rb'

meats = Ast::Set.new(['pork', 'chicken', 'beef'])
plants = Ast::Set.new(['tofu', 'lentils', 'edamame'])
print_set = Ast::Print.new(meats)
print_size = Ast::Print.new(Ast::Size.new(meats))
program = Ast::Program.new([
  meats,
  plants,
  print_set,
  print_size,
])

Building ASTs manually is not exactly fun, but it's a step. Next week we'll write a parser that operationalizes the grammar to build up a program directly from Settler source.

TODO

Here's your list of things to do before we meet next:

Complete the middle quiz as desired.
Read the introduction to the spreadsheet project and milestone 1.
Start working on milestone 1. Devote a little time each day, and it will be done. Commit to your GitHub repository after every work session.
Keep learning Ruby through YouTube or a book.

See you for our first, in which we continue exploring lexers, grammars, and precedence.

Sincerely,

P.S. It's time for a haiku!

She speaks French; I don't There we were on our big day When she said "Adieu"
← Lecture: Hello RubyLab: Lexing and Parsing →