Milestone 1: Model

Dear Computer

Project: Spreadsheet

Milestone 1: Model

In this first milestone, you'll develop the model of your spreadsheet. A model is the set of abstractions that represent the core structures of an application. Models are meant to be separate from any particular user interface, making them reusable and easy to test. In the case of your spreadsheet, your model is a set of abstractions representing a two-dimensional grid of cells and formulas and data that can go in the cells.

Your spreadsheet will eventually have its own programming language for expressing formulas. Milestone 2 is about lexing, parsing, and translating that language into an executable form. This milestone doesn't touch on the language itself, but it does have you build the abstractions that your someday interpreter will use to build and evaluate abstract syntax trees. You'll design the interface for this grid in milestone 3—with hopefully no changes to the model.

Expression Hierarchy

A model of a program is built out of abstractions for assignment statements, conditionals, loops, the many different kinds of expressions, and so on. For this milestone, write abstractions for these expression structures:

If you were writing your spreadsheet in Java (which isn't an option), you'd define each of these as a class and make them all implementations of an interface named Expression. Choose an appropriate tool in your chosen language to define these abstractions. Give each abstraction the following interface:

Do not add any serialization or evaluation methods to these classes. That code will be placed elsewhere.

The constructor's job is to build the root node of a tree representation of the expression. In milestone 2, these constructors will be called by the parser to construct an abstract syntax tree of a spreadsheet program.

The constructors for the primitive abstractions accept primitives from your target language. In a Java-like language, you might create instances of the integer primitives 5 and 3 with this code:

a = new IntegerPrimitive(5);
b = new IntegerPrimitive(3);
a = new IntegerPrimitive(5);
b = new IntegerPrimitive(3);

A cell address primitive represents a column and row pair and would be instantiated with integers like this:

address = new CellAddress(9, 5);
address = new CellAddress(9, 5);

Only the five model primitive constructors may accept raw values like this.

Every other node class will accept other nodes as parameters. These parameter nodes may be of any expression type. For example, you might create an instance of a modulo operation between a sum and a cell with this code:

c = new Add(a, b);
d = new CellRvalue(
  new IntegerPrimitive(4),
  new IntegerPrimitive(7)
);
e = new Modulo(c, d);
c = new Add(a, b);
d = new CellRvalue(
  new IntegerPrimitive(4),
  new IntegerPrimitive(7)
);
e = new Modulo(c, d);

The rvalue and lvalue abstractions accept arbitrary expressions for their column and row values. That means the column and row may be computed rather than literal. For example:

f = new CellRvalue(
  new Add(new IntegerPrimitive(1), new IntegerPrimitive(4)),
  new IntegerPrimitive(7)
);
f = new CellRvalue(
  new Add(new IntegerPrimitive(1), new IntegerPrimitive(4)),
  new IntegerPrimitive(7)
);

In contrast, the cell address abstraction is a primitive. Its column and row are raw integers that do not need to be evaluated.

Operator classes do not accept primitives of the target language. This code, for example, is illegal:

e = new Modulo(5, 3);  // bad: operands must be nodes
e = new Modulo(5, 3);  // bad: operands must be nodes

The statistical functions all accept two cell lvalues representing the top-left cell and the bottom-right cell of a rectangular area over which the value is computed. For example, consider this tree representing a Maximum expression:

max = new Maximum(
  new CellLvalue(
    new IntegerPrimitive(2),
    new IntegerPrimitive(1)
  ),
  new CellLvalue(
    new IntegerPrimitive(6),
    new IntegerPrimitive(5)
  )
);
max = new Maximum(
  new CellLvalue(
    new IntegerPrimitive(2),
    new IntegerPrimitive(1)
  ),
  new CellLvalue(
    new IntegerPrimitive(6),
    new IntegerPrimitive(5)
  )
);

When evaluated, this tree will compute the maximum value found across the 5×5 rectangle of cells between columns 2 and 6 and rows 1 and 5. It considers all 25 numbers in the rectangle, not just the two corners.

Compound operations expressions are built out of simpler subexpressions, like this:

product = new Multiply(
  new Add(
    new IntegerPrimitive(1),
    new IntegerPrimitive(2)
  ),
  new Add(
    new IntegerPrimitive(3),
    new IntegerPrimitive(4)
  )
);
product = new Multiply(
  new Add(
    new IntegerPrimitive(1),
    new IntegerPrimitive(2)
  ),
  new Add(
    new IntegerPrimitive(3),
    new IntegerPrimitive(4)
  )
);

Such complex expressions are painful to assemble by hand. The interpreter you write in the next milestone translates the more compact and conventional (1 + 2) * (3 + 4) into the unwieldy tree representation shown above.

Though you know that Add should only accept number expressions, Or should only accept boolean expressions, and CellLvalue should only accept integer expressions, you should only assume that the operand parameters are of some abstract expression type. The constructors should not typecheck their operands. Your spreadsheet uses a dynamically-typed language, and typechecking requires information that is only available at evaluation time. Suppose one of the operands is a cell rvalue or a variable reference. You won't know the type of value it represents until the tree is evaluated, and you therefore cannot typecheck at construction time.

Nor should any constructor evaluate its operands. Some of these operations require access to other cells of the grid to figure out their value, and these constructors know nothing about the grid. Evaluation will happen later. At the constructor stage, you are only building a tree, not executing it.

Here's some advice: start by implementing just a few of these abstractions, like integer primitives and a couple of arithmetic operations. Complete the serialization and evaluation steps below for this small set. Once these steps are working, go back and implement the remaining abstractions.

Visitor

There are two behaviors that we want abstract syntax trees to offer: serialization (which builds a string representation of the tree) and evaluation (which computes a formula's primitive value). If we are red-blooded object-oriented programmers, we would be inclined to implement these behaviors with methods scattered across our class hierarchy:

interface Expression {
  // Emit a string representation.
  String serialize();

  // Evaluate to an int, float, boolean, string, or address.
  Primitive evaluate(Runtime runtime);
}

class IntegerPrimitive implements Expression {
  // ...

  String serialize() {
    // ...
  }

  Primitive evaluate(Runtime runtime) {
    // ...
  }
}

class Add implements Expression {
  // ...

  String serialize() {
    // ...
  }

  Primitive evaluate(Runtime runtime) {
    // ...
  }
}
interface Expression {
  // Emit a string representation.
  String serialize();

  // Evaluate to an int, float, boolean, string, or address.
  Primitive evaluate(Runtime runtime);
}

class IntegerPrimitive implements Expression {
  // ...

  String serialize() {
    // ...
  }

  Primitive evaluate(Runtime runtime) {
    // ...
  }
}

class Add implements Expression {
  // ...

  String serialize() {
    // ...
  }

  Primitive evaluate(Runtime runtime) {
    // ...
  }
}

This implementation strategy would work, but it will result in code that is messy and hard to maintain. Suppose you have \(n\) abstractions in your expression hierarchy. If you spread the serialize and evaluate operations across the hierarchy, you'll have \(\frac{1}{n}\) of their solutions in one file, \(\frac{1}{n}\) in another, and so on. If something needs to change, you'll have to navigate through all the files. And a full-fledged compiler will support many more operations than just serialization and evaluation.

Suppose we could take all those tiny fragments of the solution and group them together in their own file? We can—by following the visitor design pattern. A visitor is an abstraction of objects that traverse a tree and processes its nodes. We'll make one visitor that serializes the tree and another that evaluates it. In Java, they'd have this common interface:

interface Visitor<R> {
  R visit(IntegerPrimitive node);
  R visit(Add node);
}
interface Visitor<R> {
  R visit(IntegerPrimitive node);
  R visit(Add node);
}

Type parameter R represents the return type of the visitor.

Then we'd change our expression hierarchy to have a universal visit method that works for any Visitor. Each node defers the processing to the method of the visitor appropriate to its type:

interface Expression {
  <R> R visit(Visitor<R> visitor);
}

class IntegerPrimitive implements Expression {
  // ...
  <R> R visit(Visitor<R> visitor) {
    return visitor.visit(this);
  }
}

class Add implements Expression {
  // ...
  <R> R visit(Visitor<R> visitor) {
    return visitor.visit(this);
  }
}
interface Expression {
  <R> R visit(Visitor<R> visitor);
}

class IntegerPrimitive implements Expression {
  // ...
  <R> R visit(Visitor<R> visitor) {
    return visitor.visit(this);
  }
}

class Add implements Expression {
  // ...
  <R> R visit(Visitor<R> visitor) {
    return visitor.visit(this);
  }
}

This reusable visit method in the node classes means that our hierarchy won't get polluted by a bunch of fragments of tree operations. For instance, all the serialization code would be tidily organized in this single class:

class Serializer implements Visitor<String> {
  String visit(IntegerPrimitive node) {
    // ...
  }

  String visit(Add node) {
    // visit left expression
    // visit right expression
    // combine results
  }

  // visit methods for other expression types
}
class Serializer implements Visitor<String> {
  String visit(IntegerPrimitive node) {
    // ...
  }

  String visit(Add node) {
    // visit left expression
    // visit right expression
    // combine results
  }

  // visit methods for other expression types
}

To serialize a tree under this visitor pattern, we call its visit method with a new instance of the Serializer class:

Expression tree = new Add(...);
String text = tree.visit(new Serializer());
Expression tree = new Add(...);
String text = tree.visit(new Serializer());

Evaluation is implemented and performed similarly, but we pass an Evaluator visitor to visit.

If you are using a statically-typed language, you can likely implement the visitor pattern just as described for Java. If you are using a dynamically-typed language, you've got a problem: dynamically-typed languages don't generally support overloading of methods. You can't give a visitor class \(n\) methods named visit with different type signatures. However, you can move the static types into the method names:

class Serializer 
  function visitIntegerPrimitive(node)
    # ...

  function visitAdd(node)
    # visit left
    # visit right
    # combine results
class Serializer 
  function visitIntegerPrimitive(node)
    # ...

  function visitAdd(node)
    # visit left
    # visit right
    # combine results

A consequence of this renaming is that the node classes can't all share an inherited visit method. They each need their own visit method that explicitly calls their renamed method in the visitor:

class IntegerPrimitive
  function visit(visitor)
    visitor.visitIntegerPrimitive(this)

class Add
  function visit(visitor)
    visitor.visitAdd(this)
class IntegerPrimitive
  function visit(visitor)
    visitor.visitIntegerPrimitive(this)

class Add
  function visit(visitor)
    visitor.visitAdd(this)

The visitor pattern is at the heart of many translation tools. If you are using an object-oriented programming language like Ruby, use it in your implementation. If you are using a functional language like Haskell or Rust, you are likely already organizing your code by operation, so the visitor pattern emerges naturally.

Serialize

Implement serialization using the visitor pattern. The end result is a string representation of the tree. You will primarily serialize trees so you can test and debug your code. You won't need to serialize trees for the actual spreadsheet.

The exact format of the serialization is not specified, but the string should show the operations and operands in a readable manner. Consider this expression and its serialization:

product = new Multiply(
  new Add(
    new IntegerPrimitive(1),
    new IntegerPrimitive(2)
  ),
  new IntegerPrimitive(3),
);
text = product.visit(new Serializer(), null);
print(text);
product = new Multiply(
  new Add(
    new IntegerPrimitive(1),
    new IntegerPrimitive(2)
  ),
  new IntegerPrimitive(3),
);
text = product.visit(new Serializer(), null);
print(text);

The text ((1 + 2) * 3) is a reasonable result. Feel free to use parentheses liberally to show precedence. If you have time and interest, add logic to parenthesize only when a child operation has lower precedence than its parent.

Evaluate

Define a second visitor that evaluates an abstract syntax tree. Each node in the expression hierarchy must evaluate to one of your primitive nodes—not one of your target language's primitives. An add operation will yield either your float or integer node. A comparison will yield a boolean node. And so on.

If an expression has operands, the first thing you must do in the visit method is evaluate them. Your whole expression is really a tree, and the evaluate operation recurses through it to produce a primitive. After evaluating an operation's operands, you must perform type checking. Assert that the arithmetic operations have numeric operands that are compatible. Assert that the logical operations have boolean operands. Assert that the bitwise operations have integer operands. Assert that the builtin functions have cell lvalue operands. And so on.

Use your target language's error mechanism to bail from the evaluation on a type error and return control to the caller, which can either handle the error or ignore it. In a later milestone you will have the caller present an error message to the user via the interface.

All told, the method for evaluating an add operation will follow this pseudocode structure:

class Evaluator implements Visitor<Primitive> {
  Primitive visitAdd(Add node) {
    // evaluate left node to model primitive
    // if left's model primitive has a bad type
    //   raise an exception
    // evaluate right node to model primitive
    // if right's model primitive has a bad type
    //   raise an exception
    // return new float or int primitive for sum
  }
}
class Evaluator implements Visitor<Primitive> {
  Primitive visitAdd(Add node) {
    // evaluate left node to model primitive
    // if left's model primitive has a bad type
    //   raise an exception
    // evaluate right node to model primitive
    // if right's model primitive has a bad type
    //   raise an exception
    // return new float or int primitive for sum
  }
}

For many operations, what type of primitive you return depends on the types of the operands. Also, logical operators should short-circuit. If the first operand determines your answer, don't evaluate the second.

The visit methods for evaluating primitives have an easy job. The nodes are already primitives, so they don't need to be evaluated further.

Grid

Some of the operations, like cell rvalues and statistical functions, need to access other cells. That means you need to store the cells in some sort of collection. Define a grid abstraction that manages all the cells. Choose a data structure that makes sense to you. Model each cell as a bundle of three pieces of state:

Give the grid the following operations:

Create a grid and populate it with abstract syntax trees.

Runtime

The grid will persist indefinitely as your program runs, and its job is only to manage the collection of cells. In future milestones, we'll need a separate place to put transient state like local variables or global data that isn't attached to any particular cell. We need something other than the grid for temporary execution data.

Define a Runtime abstraction to manage an executing program's runtime environment. Have it hold a reference to the grid. Pass an instance of Runtime along to the Evaluator constructor, and have Evaluator hang on to it. When an operation's visit method needs to look up a cell, it queries the runtime, which in turn queries the grid. Most expressions will not need to access the runtime.

Test that you can place expressions and evaluate them. If a cell rvalue or statistical function accesses a cell that hasn't been defined, raise an error to which you can respond later on when you build the user interface.

Submission

To submit your milestone, follow the instructions. In particular, do these three things:

In your video, demonstrate adding several expressions to a grid and serializing and evaluating them. Include at least these expressions that are expressed in a hypothetical syntax:

The syntax isn't relevant yet since you're building trees manually out of your model classes. #[...] is a cell rvalue, and [...] is a cell lvalue. Ensure that evaluating these produces the correct results. Show also some expressions that fail to typecheck.

In your video, don't comment on or show every single line of code in your video. Select a few representative snippets. Do comment on programming language ideas that interest you or challenged you.

← Recording and SubmittingMilestone 2: Interpreter →