Lab: Lexing and Parsing
Your goal in this lab is to learn more about lexing and parsing. This exercise is designed to help you understand what the interpreters and compilers you use every day do once they receive your source code and to help prepare you for writing your own lexer and parser for the spreadsheet project, which will be part of milestone 2.
Alchemoji
You are designing an alchemy notation system for a client. The base materials are emoji. A material may be joined with another using the ↑
operator. Or a material may be removed from another using the ↓
operator. The joining and removing yields a single new material, which is marked by the →
operator. For example:
-
🏠 ↓ 💰 → 🏚️
-
🦨 ↑ 📦 ↑ 📫 → 🚓
-
🧛♂️ ↑ 🙋♀️ ↓ 🩸 → 🧛♀️
Complete the following exercises to help you reason about how a grammar defines the structure of alchemoji equations. Write your answers on paper or in an electronic document.
-
Ignore the emoji language for a moment, and imagine a different language that allows integers and a binary
@
operator. You see the expression5 @ 2 @ 9
. Draw the abstract syntax tree of this expression if the language has this grammar:Note that an abstract syntax tree is simpler than a parse tree. The AST's internal nodes should be operators (expression = expression @ INT | INT
expression = expression @ INT | INT
@
for this language), and its leaf nodes are integer terminals. Verify the correctness of your tree using Treeformer. Paste the grammar in the textbox, click abstract syntax tree to place the root node, and expand nodes to match your solution. If you can't generate a tree whose shape matches your own, your solution has a problem. -
Draw the abstract syntax tree of
5 @ 2 @ 9
if the language has this grammar:Use Treeformer to verify the correctness of your tree.expression = INT @ expression | INT
expression = INT @ expression | INT
-
Consider the arithmetic expression
a + b + c
. Given the conventions of mathematics, what does its abstract syntax tree look like? -
Consider the arithmetic expression
2 ^ 3 ^ 2
. That's exponentiation, not xor. Given the conventions of mathematics, what does its abstract syntax tree look like? Verify your understanding with a calculator like this one. -
Investigate the terms left-associative and right-associative. How do these terms apply to the arithmetic
+
and-
operators? How about arithmetic^
? -
Draw the abstract syntax tree of
1 @ 2 @ 3 # 4 # 5
if the language has this grammar:Verify the correctness of your tree using Treeformer. What are the associativities ofatty = atty @ hashy | hashy hashy = INT # hashy | INT
atty = atty @ hashy | hashy hashy = INT # hashy | INT
@
and#
? Which operator has higher precedence? - We now return to the alchemical emoji language. Draw abstract syntax trees for the three example expressions above.
- Did you make the alchemoji operators left- or right-associative?
-
Write a grammar that describes the structure of an alchemical equation based on the examples given. Use
equation
,join
, andremove
as non-terminals. Make→
have the lowest precedence,↓
middle precedence, and↑
highest precedence. Revisit the logical operations grammar to consider a similar grammar. - Paste your grammar into Treeformer. Adjust it until you can generate the trees for the example alchemoji equations.
-
Craft your own alchemoji expression that has a mix of at least three materials left of the yield. Use at least one
↑
and at least one↓
. Tell a story with your expression. - Draw an abstract syntax tree for your expression. Verify that you can generate it in Treeformer.
Range Lexing
Consider a language in which numeric ranges like these can be expressed:
-5..7
8.2..15
3..-5
-5..7 8.2..15 3..-5
Complete the following exercises to help you reason about how a lexer recognizes the tokens of this language.
- The tokens of this language are either integer literals, float literals, or the range operator. Draw three separate state machines for these three types of tokens. Note that numeric literals may be negative. Assume that floats have numbers on both sides of the decimal point.
- Draw a new a state machine that merges the three separate machines into one. On any given run, the machine matches only a single token, not an entire range sequence. It should have three different accepting states.
-
Translate your unified state machine into pseudocode for a
lex
function that returns a list of all valid tokens found in a string. Follow the approach we used in lecture for Settle. Assume that routineshas(char)
,has_digit
,capture
,emit_token(type)
, anderror
have already been written.
Submission
To receive credit for this lab, you must submit a digital version of your answers on Canvas by Monday noon. Late labs or forgot-to-submits are not accepted because Monday at noon is when your instructor has time to grade.