Milestone 1: Model

Dear Computer

Project: Spreadsheet

Milestone 1: Model

In this first milestone, you'll develop the model of your spreadsheet. A model is the set of abstractions that represent the core data and tasks of an application. Models are meant to be separate from any particular user interface, making them reusable and easy to test. In the case of your spreadsheet, your model is a set of abstractions representing a two-dimensional grid of cells and formulas and literal values that can go in the cells.

Your spreadsheet will eventually have its own programming language for expressing formulas. Milestone 2 is about lexing, parsing, and translating that language into an executable form. This milestone doesn't touch on the language itself, but it does have you build the abstractions that your someday interpreter will use to build and evaluate abstract syntax trees. You'll design the interface for this grid in milestone 3—with hopefully no changes to the model.

Expression Hierarchy

A model of a program is built out of abstractions for assignment statements, conditionals, loops, the many different kinds of expressions, and so on. For this milestone, write abstractions for just these expression structures:

If you were writing your spreadsheet in Java, you'd define each of these as a class and make them all implementations of an interface named Expression. Choose an appropriate tool in your chosen language to define these abstractions. Give each abstraction the following interface:

Do not add any evaluation methods to these classes. That code will be placed elsewhere.

The constructor's job is to build the root node of a tree representation of the expression. In milestone 2, these constructors will be called by the parser to construct an abstract syntax tree of a spreadsheet program.

The constructors for the primitive abstractions accept primitives from your target language. In a Java-like language, you might create instances of the integer primitives 5 and 3 with this code:

a = new IntegerPrimitive(5);
b = new IntegerPrimitive(3);
a = new IntegerPrimitive(5);
b = new IntegerPrimitive(3);

A cell address primitive represents a column and row pair and would be instantiated with integers like this:

address = new CellAddress(9, 5);
address = new CellAddress(9, 5);

The constructors for all the operation abstractions accept one or two values that could be of any expression type. For example, you might create an instance of a modulo operation between a sum and a cell with this code:

c = new Add(a, b);
d = new CellRvalue(
  new IntegerPrimitive(4),
  new IntegerPrimitive(7)
);
e = new Modulo(c, d);
c = new Add(a, b);
d = new CellRvalue(
  new IntegerPrimitive(4),
  new IntegerPrimitive(7)
);
e = new Modulo(c, d);

The rvalue and lvalue abstractions accept arbitrary expressions for their column and row values. That means the column and row can be computed. For example:

f = new CellRvalue(
  new Add(new IntegerPrimitive(1), new IntegerPrimitive(4)),
  new IntegerPrimitive(7)
);
f = new CellRvalue(
  new Add(new IntegerPrimitive(1), new IntegerPrimitive(4)),
  new IntegerPrimitive(7)
);

In contrast, the cell address abstraction is a primitive. Its column and row are raw integers that do not need to be evaluated.

Operator classes do not accept primitives of the target language. This code, for example, is illegal:

e = new Modulo(5, 3);  // bad: operands must be expressions
e = new Modulo(5, 3);  // bad: operands must be expressions

The statistical functions all accept two cell lvalues representing the top-left cell and the bottom-right cell of a rectangular area over which the value is computed. For example, consider this tree representing a Maximum expression:

max = new Maximum(
  new CellLvalue(
    new IntegerPrimitive(2),
    new IntegerPrimitive(1)
  ),
  new CellLvalue(
    new IntegerPrimitive(6),
    new IntegerPrimitive(5)
  )
);
max = new Maximum(
  new CellLvalue(
    new IntegerPrimitive(2),
    new IntegerPrimitive(1)
  ),
  new CellLvalue(
    new IntegerPrimitive(6),
    new IntegerPrimitive(5)
  )
);

When evaluated, this tree will compute maximum value found across the 5×5 rectangle of cells between columns 2 and 6 and rows 1 and 5.

Compound operations expressions are built out of simpler subexpressions, like this:

product = new Multiply(
  new Add(
    new IntegerPrimitive(1),
    new IntegerPrimitive(2)
  ),
  new Add(
    new IntegerPrimitive(3),
    new IntegerPrimitive(4)
  )
);
product = new Multiply(
  new Add(
    new IntegerPrimitive(1),
    new IntegerPrimitive(2)
  ),
  new Add(
    new IntegerPrimitive(3),
    new IntegerPrimitive(4)
  )
);

Such complex expressions are painful to assemble by hand. The interpreter you write in the next milestone translates the more compact and conventional (1 + 2) * (3 + 4) into the unwieldy target representation shown above.

Though you know that Add should only accept number expressions, Or should only accept boolean expressions, and CellLvalue should only accept integer expressions, you should only assume that the operand parameters are of some abstract expression type. The constructors should not typecheck their operands. Your spreadsheet uses a dynamically-typed language, and typechecking requires information that is only available at evaluation time. Suppose one of the operands is a cell rvalue or a variable reference. You won't know the type of value it represents until the tree is evaluated, and you therefore cannot typecheck at construction time.

Nor should any constructor evaluate its operands. Some of these operations require access to other cells of the grid to figure out their value, and these constructors know nothing about the grid. Evaluation will happen later. At the constructor stage, you are only building a tree, not reducing it down to a primitive.

Here's some advice: start by implementing just a few of these abstractions, like integer primitives and a couple of arithmetic operations. Complete the serialization and evaluation steps below for this small set. Once these steps are working, go back and implement the remaining abstractions.

Visitor

There are two behaviors that we want abstract syntax trees to offer: serialization (which builds a string representation of the tree) and evaluation (which computes a formula's primitive value). If we are red-blooded object-oriented programmers, we would be inclined to implement these behaviors with methods scattered across our class hierarchy:

interface Expression {
  // Emit a string representation.
  String serialize();

  // Evaluate to an int, float, boolean, string, or address.
  Primitive evaluate(Runtime runtime);
}

class IntegerPrimitive implements Expression {
  // ...

  String serialize() {
    // ...
  }

  Primitive evaluate(Runtime runtime) {
    // ...
  }
}

class Add implements Expression {
  // ...

  String serialize() {
    // ...
  }

  Primitive evaluate(Runtime runtime) {
    // ...
  }
}
interface Expression {
  // Emit a string representation.
  String serialize();

  // Evaluate to an int, float, boolean, string, or address.
  Primitive evaluate(Runtime runtime);
}

class IntegerPrimitive implements Expression {
  // ...

  String serialize() {
    // ...
  }

  Primitive evaluate(Runtime runtime) {
    // ...
  }
}

class Add implements Expression {
  // ...

  String serialize() {
    // ...
  }

  Primitive evaluate(Runtime runtime) {
    // ...
  }
}

This implementation strategy would work, but it will result in code that is messy and hard to maintain. Suppose you have \(n\) abstractions in your expression hierarchy. If you spread the serialize and evaluate operations across the hierarchy, you'll have \(\frac{1}{n}\) of their solutions in one file, \(\frac{1}{n}\) in another, and so on. If something needs to change, you'll have to navigate through all the files. And a full-fledged compiler will support many more operations than just serialization and evaluation.

Suppose we could take all those tiny fragments of the solution and group them together in their own file? We can—by following the visitor design pattern. A visitor is an abstraction of objects that traverse a tree and processes its nodes. We'll make one visitor that serializes the tree and another that evaluates it. In Java, they'd have this common interface:

interface Visitor<P, R> {
  R visit(IntegerPrimitive node, P payload);
  R visit(Add node, P payload);
}
interface Visitor<P, R> {
  R visit(IntegerPrimitive node, P payload);
  R visit(Add node, P payload);
}

Type parameter R represents the return type of the visitor, and type parameter P represents some arbitrary bundle of data that gets carried through the tree traversal.

Then we'd change our expression hierarchy to have a universal traverse method that works for any Visitor. Each node defers the processing to the method of the visitor appropriate to its type:

interface Expression {
  <P, R> R traverse(Visitor<P, R> visitor, P payload);
}

class IntegerPrimitive implements Expression {
  // ...
  <P, R> R traverse(Visitor<P, R> visitor, P payload) {
    return visitor.visit(this, payload);
  }
}

class Add implements Expression {
  // ...
  <P, R> R traverse(Visitor<P, R> visitor, P payload) {
    return visitor.visit(this, payload);
  }
}
interface Expression {
  <P, R> R traverse(Visitor<P, R> visitor, P payload);
}

class IntegerPrimitive implements Expression {
  // ...
  <P, R> R traverse(Visitor<P, R> visitor, P payload) {
    return visitor.visit(this, payload);
  }
}

class Add implements Expression {
  // ...
  <P, R> R traverse(Visitor<P, R> visitor, P payload) {
    return visitor.visit(this, payload);
  }
}

This reusable traverse method means that our hierarchy won't get polluted by a bunch of fragments of tree operations. For instance, all the serialization code would be tidily organized in this single class:

class Serializer implements Visitor<Void, String> {
  String visit(IntegerPrimitive node, Void nothing) {
    // ...
  }

  String visit(Add node, Void nothing) {
    // visit left expression
    // visit right expression
    // combine results
  }

  // visit methods for other expression types
}
class Serializer implements Visitor<Void, String> {
  String visit(IntegerPrimitive node, Void nothing) {
    // ...
  }

  String visit(Add node, Void nothing) {
    // visit left expression
    // visit right expression
    // combine results
  }

  // visit methods for other expression types
}

To serialize a tree under this visitor pattern, we call its traverse method with a new instance of the Serializer class:

Expression tree = new Add(...);
String text = tree.traverse(new Serializer(), null);
Expression tree = new Add(...);
String text = tree.traverse(new Serializer(), null);

Evaluation is implemented and performed similarly. Unlike serialization, it will require a payload, as described later. For serialization, the payload is null.

If you are using a statically-typed language, you can likely implement the visitor pattern just as described for Java. If you are using a dynamically-typed language, you've got a problem: dynamically-typed languages don't generally support overloading of methods. You can't give a visitor class \(n\) methods named visit with different type signatures. However, you can move the static types into the method names:

class Serializer 
  function visitIntegerPrimitive(node, payload)
    # ...

  function visitAdd(node, payload)
    # traverse left
    # traverse right
    # combine results
class Serializer 
  function visitIntegerPrimitive(node, payload)
    # ...

  function visitAdd(node, payload)
    # traverse left
    # traverse right
    # combine results

The gotcha is that the expression classes can't just all have the same traverse method. They need to explicitly call their renamed method:

class IntegerPrimitive
  function traverse(visitor, payload)
    visitor.visitIntegerPrimitive(this, payload)

class Add
  function traverse(visitor, payload)
    visitor.visitAdd(this, payload)
class IntegerPrimitive
  function traverse(visitor, payload)
    visitor.visitIntegerPrimitive(this, payload)

class Add
  function traverse(visitor, payload)
    visitor.visitAdd(this, payload)

The visitor pattern is at the heart of many translation tools. If you are using an object-oriented programming language like Ruby, use it in your implementation. If you are using a functional language like Haskell or Rust, you are likely already organizing your code by operation, so the visitor pattern emerges naturally.

Serialize

Implement serialization using the visitor pattern. The end result is a string representation of the tree. You will primarily serialize trees as you test your code.

The format is not specified, but the string should show the operations and operands in a readable manner. Consider this expression and its serialization:

product = new Multiply(
  new Add(
    new IntegerPrimitive(1),
    new IntegerPrimitive(2)
  ),
  new IntegerPrimitive(3),
);
text = product.traverse(new Serializer(), null);
print(text);
product = new Multiply(
  new Add(
    new IntegerPrimitive(1),
    new IntegerPrimitive(2)
  ),
  new IntegerPrimitive(3),
);
text = product.traverse(new Serializer(), null);
print(text);

The text ((1 + 2) * 3) is a reasonable result. Feel free to use parentheses liberally to show precedence. If you have time and interest, add logic to parenthesize only when a child operation has lower precedence than its parent.

Evaluate

Define a second visitor that evaluates an abstract syntax tree. Each node in the expression hierarchy must evaluate to one of your primitive abstractions—not one of your target language's primitives. An add operation will yield either your float or integer primitive. A comparison will yield a boolean primitive. And so on.

If an expression has operands, the first thing you must do in the visit method is evaluate them. Your whole expression is really a tree, and the evaluate operation recurses through it to produce a primitive. After evaluating an operation's operands, you must perform type checking. Assert that the arithmetic operations have numeric operands that are compatible. Assert that the logical operations have boolean operands. Assert that the bitwise operations have integer operands. Assert that the builtin functions have cell lvalue operands. And so on.

Use your target language's error mechanism to bail from the evaluation on a type error and return control to the caller, who can either handle the error or ignore it. In a later milestone you will have the caller present an error message to the user via the interface.

All told, the method for evaluating an add operation will follow this pseudocode structure:

class Evaluator implements Visitor<Runtime, Primitive> {
  Primitive visitAdd(Add node, Runtime runtime) {
    // evaluate left node to model primitive
    // if left's model primitive has a bad type
    //   raise an exception
    // evaluate right node to model primitive
    // if right's model primitive has a bad type
    //   raise an exception
    // return new float or int primitive for sum
  }
}
class Evaluator implements Visitor<Runtime, Primitive> {
  Primitive visitAdd(Add node, Runtime runtime) {
    // evaluate left node to model primitive
    // if left's model primitive has a bad type
    //   raise an exception
    // evaluate right node to model primitive
    // if right's model primitive has a bad type
    //   raise an exception
    // return new float or int primitive for sum
  }
}

For many operations, what type of primitive you return depends on the types of the operands. Also, logical operators should short-circuit. If the first operand determines your answer, don't evaluate the second.

The visit methods for evaluating primitives have an easy job. The nodes are already primitives, so they don't need to be evaluated further.

Grid

Some of the operations, like cell rvalues and statistical functions, require access to other cells. That means you need to store the cells in some sort of collection. Define a grid abstraction that manages all the cells. Choose a data structure that makes sense to you. Model each cell as a bundle of three pieces of state:

Give the grid the following operations:

Create a grid and populate it with abstract syntax trees.

Runtime

The grid will persist indefinitely as your program runs, and its job is only to manage the collection of cells. In future milestones, we'll need a separate place to put transient state like local variables or global data that isn't attached to any particular cell. We need something other than the grid for temporary execution data.

Define a Runtime abstraction to manage an executing program's runtime environment. Have it hold a reference to the grid. Pass it along as the payload to the evaluation visitor. When an operation needs to look up a cell, it queries the runtime, which in turn queries the grid. Thread this Runtime object through all the visit calls. Most expressions will not access the runtime directly, but they may lead to a child node that does. Do not forget to pass the runtime along. A statically-typed language will not let you forget, but a dynamically-typed language like Ruby will not check your work.

Test that you can place expressions and evaluate them. If a cell rvalue or statistical function accesses a cell that hasn't been defined, raise an error to which you can respond later on when you build the user interface.

Submission

To submit your milestone, follow the instructions. In particular, do these three things:

In your video, demonstrate adding several expressions to a grid and serializing and evaluating them. Include at least these expressions that are expressed in a hypothetical syntax:

The syntax isn't relevant yet since you're building trees manually out of your model classes. #[...] is a cell rvalue, and [...] is a cell lvalue. Ensure that evaluating these produces the correct results. Show also some expressions that fail to typecheck.

In your video, don't comment on or show every single line of code in your video. Select a few representative snippets. Do comment on programming language ideas that interest you or challenged you.

← Recording and SubmittingMilestone 2: Interpreter →