Milestone 2: Interpreter
In this second milestone, you'll develop an interpreter for your spreadsheet. The interpreter will translate arbitrary source code from the spreadsheet programmer into the model classes that you wrote for milestone 1.
Grammar
Write out a grammar for your spreadsheet's expression syntax in a plain text file. Use the EBNF syntax discussed in the readings and lecture. Treat this grammar with tender devotion because it will help you identify the tokens identified by your lexer and the routines that structure your parser.
There should be five or more levels of precedence in your grammar. Your starting non-terminal should be expression
. It expands directly to a non-terminal (level0
) for the operations at the top rung of your precedence ladder. That non-terminal leads to a non-terminal of the next precedence level. And so on. The bottom rung of the grammar (levelN
) includes the primitives, cell lvalues and rvalues, parenthesized expressions, and statistical functions.
In your grammar, include all the operators for which you created models in the first milestone. In a later milestone, you'll introduce non-terminals for blocks, loops, conditionals, assignments, and so on.
Token
Define a token abstraction. Model an individual token as its type, source text, and starting and ending indices in the source code. You'll need the text and indices to produce useful error messages like I found undeclared variable "slar" at index 7.
Lexer
Define a lexer that accepts an expression in text form and tokenizes it into a list of tokens. For example, lexing the expression 5 <= 32.0
yields this list in pseudocode:
[
(:integer_literal, "5", 0, 0),
(:less_than_or_equal, "<=", 2, 3),
(:float_literal, "32.0", 5, 8)
]
[ (:integer_literal, "5", 0, 0), (:less_than_or_equal, "<=", 2, 3), (:float_literal, "32.0", 5, 8) ]
The integers are the indices of the token in the source code.
Lexers can get messy and hard to maintain without good abstractions. Follow the strategy for designing lexers described in the reading and in lecture using the vocabulary of has
, capture
, emit_token
, and so on. Do not assume that tokens are separated by whitespace. Calling a split method to divide up the source into tokens is not an appropriate alternative to the algorithm we discussed in class. Both 7+2
and 7 + 2
may be lexed with the algorithm with discussed, but not with a split.
Ensure that your lexer doesn't overstep its authority. It doesn't try to make sense of the tokens, doesn't ensure that they are in a particular order, doesn't reference any of your model classes, and doesn't construct or evaluate any abstract syntax trees. It just chunks characters into tokens.
Parser
Define a parser that accepts a list of tokens and assembles an abstract syntax tree using the model abstractions you wrote in milestone 1. For example, parsing the expression 5 <= 32.0
yields this pseudocode structure:
new LessThanOrEqual(
new IntegeralLiteral(5, (0, 0)),
new FloatLiteral(32.0, (5, 8)),
(0, 8)
)
new LessThanOrEqual( new IntegeralLiteral(5, (0, 0)), new FloatLiteral(32.0, (5, 8)), (0, 8) )
Note that the model abstractions must now track an additional piece of state: the indices of the source that correspond to the node. You'll need to store these locations so that you can generate error messages when the tree is evaluated.
Include these helper methods in your parser:
parse
to parse the complete list of tokens and return the root node of the ASThas
to see if the next token has a certain typeadvance
to move forward in the tokens listDefine a method for each non-terminal in your grammar. The parsing starts at expression
and works its way down through all precedence levels to the bottom run. As you implement these methods, model the sequences and alternatives of the productions with conditional statements, loops, and calls to other non-terminal methods.
Each non-terminal method should return an instance of one of your model abstractions. After all the tokens have been gobbled up by your non-terminal methods, the final value returned by parse
is the root of the abstract syntax tree.
Recall from the lecture that left-associative grammar rules translate directly into methods that will crash your interpreter. Consider for example, this pair of productions for binary operators @
and #
, which are of equal precedence:
lower = lower @ higher
| lower # higher
| higher
lower = lower @ higher | lower # higher | higher
A direct translation would look like this:
function lower()
left = lower()
if has(Token::At)
advance
right = higher()
return new Ast::At(left, right)
else has(Token::Hash)
advance
right = higher()
return new Ast::Hash(left, right)
function lower() left = lower() if has(Token::At) advance right = higher() return new Ast::At(left, right) else has(Token::Hash) advance right = higher() return new Ast::Hash(left, right)
This implementation leads to infinite recursion on the first line of the method. We fix the issue by replacing the recursive call with its eventual base case, which is a call to higher
, replacing the recursion with iteration, and accumulating the tree into left
:
function lower()
left = higher()
while has(Token::At) or has(Token::Hash)
if has(Token::At)
advance
right = higher()
left = new Ast::At(left, right)
else has(Token::Hash)
advance
right = higher()
left = new Ast::Hash(left, right)
return left
function lower() left = higher() while has(Token::At) or has(Token::Hash) if has(Token::At) advance right = higher() left = new Ast::At(left, right) else has(Token::Hash) advance right = higher() left = new Ast::Hash(left, right) return left
Right-associative operators like exponentiation do not have this problem of infinite recursion. Their productions are translated directly into recursive code. They are not expressed as cleanly with a loop.
Parsers sit at the boundary between the messy human and the exacting computer, so your implementation should detect and gracefully handle errors in the programmer's source code. If you encounter an unexpected token or run out of tokens early, throw an exception with an explanatory message and the location in the source code where the error occurred. In milestone 3, you'll catch that exception and show the message to the user.
Write tests that lex and parse the source code for a variety of expressions, evaluate the ASTs, and assert that they produce the expected primitive values. Ensure that your parser doesn't overstep its authority. It doesn't try to perform any type checking or evaluate any expressions. It only builds trees.
Submission
To submit your milestone, follow the instructions. In particular, do these three things:
- Submit your project to the official GitHub repository that your instructor made for you.
- Record a 3–5 minute screencast of you walking through your code and commenting on it. Do not exceed 5 minutes. Your instructor has many videos to watch. Post it to Canvas Studio according to the instructions.
- Complete the ready date submission form on Canvas.
In your video, demonstrate lexing, parsing, and evaluating several different kinds of expressions that include casting, cell references, and operators from different precedence levels. Include at least these expressions:
-
Arithmetic:
(5 + 2) * 3 % 4
-
Rvalue lookup and shift:
#[0, 0] + 3
-
Rvalue lookup and comparison:
#[1 - 1, 0] < #[1 * 1, 1]
-
Logic and comparison:
(5 > 3) && !(2 > 8)
-
Sum:
1 + sum([0, 0], [2, 1])
-
Casting:
float(10) / 4.0
-
Exponentiation:
2 ** 3 ** 2
-
Negation and bitwise and:
45 & ---(1 + 3)
Your syntax might differ from these examples.
Show also that malformed code yields meaningful error messages that the programmer can use to fix the problem.
Don't comment on or show every single line of code in your video. Select a few representative snippets. Do comment on programming language ideas that interest you or challenged you.