Lecture: Grammars and Lexing

Dear Computer

Chapter 1: Programming Language

Lecture: Grammars and Lexing

Dear students:

Last week we turned programs into an unknown language into tree representations in a known language. We also looked at how to operate on such trees. In particular, we wrote visitors to translate and evaluate the trees. So far we've been dealing with small trees that we can build by hand. Since we don't generally write small programs, we need a machine that turns big programs into big trees. That machine must turn a flat sequence of text into a tree or arbitrary depth. How do we write such a machine? Today we'll find out.

Grammar

To build a machine that understands programs written in some language, we need the rules that describe the structure of a program. These rules are our language's grammar.

The rule for a fraction might look like this:

fraction = INT / INT
fraction = INT / INT

Capitalized words aren't broken down further. We call them terminals. Uncapitalized words are therefore non-terminals. Some non-terminals can expand to one of several possible forms. The possible expansions are productions.

In a program, an expression is a structure that yields a value. This is in contrast to a statement, which doesn't produce a value but does achieve some side effect, like printing. We might trying writing the grammar of expressions with this rule:

expression = expression + expression
           | fraction
expression = expression + expression
           | fraction

There's a problem with this rule, however. It allows 1/4 + 5/6 + 3/8 to be parsed in two different ways. We know that addition is left-associative, so we tweak the rule to only allow for recursion in the left-operand:

expression = expression + fraction
           | fraction

fraction = INT / INT
expression = expression + fraction
           | fraction

fraction = INT / INT

Suppose now we want to add multiplication to the language. We might try adding a production to expression:

expression = expression + fraction
           | expression * fraction
           | fraction

fraction = INT / INT
expression = expression + fraction
           | expression * fraction
           | fraction

fraction = INT / INT

Do you see a problem here? Think of an expression that would not form the expected AST given these rules.

These rules force us to parse a + b * c as (a + b) * c. Multiplication is supposed to have higher precedence. We achieve that by making a ladder of rules. Low-precedence operators show up higher in the grammar and therefore higher in the AST.

level0 = level0 + level1
       | level1

level1 = level1 * levelN
       | levelN

levelN = INT / INT
level0 = level0 + level1
       | level1

level1 = level1 * levelN
       | levelN

levelN = INT / INT

At each level, we have two possible actions: descend to the next level or recurse on the same level.

Where do we put integer literals? On the bottom rung. Where do we put parenthesized expressions? On the bottom rung. Everything on the bottom rung can be thought of as having really high precedence, like literals, self-contained associations, function calls, variables, and so on.

We can add a few more rules to describe other non-expression structures, like Program and Statement:

program = statement program
        | EOF

statement = PRINT expression
          | ID EQUALS expression
          | expression
program = statement program
        | EOF

statement = PRINT expression
          | ID EQUALS expression
          | expression

Grammars give shape to a language. It's also a plan for code that we're about to write. Consider this fraction program:

Ruby
print 5/10 + 28/4
print 5/10 + 28/4

By fitting the source code to the grammar, we see that we have a program with one statement that prints an expression in which two fractions are added.

We could build a tree representation of this program by hand, but that doesn't scale. We want to build a tool that turns text into a tree. The problem is easier if we build two tools, a lexer to break the text into small chunks and a parser to organize the chunks into structures. Our brain does something similar when we read. We chunk the letters into words, and then organize the words into grammatical structures. We'll start writing a lexer today and a parser next week.

Lexing

The words of a program are its lexemes or tokens. They correspond to the terminal types found in our grammar. What are the tokens of the program above? Here are a few:

Many of them are just a single character, but that's not a rule. Operators could be multiple symbols, like >=. We need to build a machine that eats characters and reports when it has digested a token. Such a machine is called a lexer. Let's build one for our language. But let's think before we code.

State Machines

Lexing is accomplished by feeding letters through a state machine. We'll draw state machines that recognize the following word-like tokens:

With your neighbors, pick one of the token types below and draw a state machine that recognizes it:

  1. A textual description of a musical note (for readers of music only; how do you communicate its possible qualities?)
  2. T-shirt sizes (XS, S, M, L, and XL with an arbitrary number of extra Xs)
  3. Integral dimensions like 6×4×12 or 640×480
  4. ASCII single-ended or double-ended arrows with a shaft of at least one hyphen, like <----, ->, <----->, <->, or --->
  5. Parenthesized numbers between (15) and (130)
  6. A sequence of moves in cardinal directions (N, E, S, W) but with no more than a single move south

Be prepared to draw your machine on the board.

Lexer

Lexers are made out of many state machines. We design a statement machine for each type of token in our language. Many machines will have just a single accepting state and a single transition. The integer machine will have two states with a loop.

Ultimately, we need these machines running in code. Here's how we might write a routine for lexing a forward-slash:

Ruby
if has('/')
  capture
  emit_token(:forward_slash)
end
if has('/')
  capture
  emit_token(:forward_slash)
end

Those little machines recognize only one type of token. We need a mega-machine that recognizes all of them, so we glue them together. Once they are glued, we write the entire lexing routine, which might look like this:

Ruby
if has('/')
  capture
  emit_token(:forward_slash)
elsif has('+')
  capture
  emit_token(:plus)
elsif has_digit
  while has_digit
    capture
  emit_token(:integer)
end
if has('/')
  capture
  emit_token(:forward_slash)
elsif has('+')
  capture
  emit_token(:plus)
elsif has_digit
  while has_digit
    capture
  emit_token(:integer)
end

This code belongs in a lexer abstraction. We'll write it together and also define the helper methods like has, has_digit, capture, and emit_token. Once it can lex simple expressions, we'll expand it to support negative integers and other operators.

TODO

Here's your list of things to do before we meet next:

Complete the middle quiz as desired.
Keep working on milestone 1. Devote a little time each day, and it will be done. Commit to your GitHub repository after every work session.
Keep learning Ruby through YouTube or a book.

See you next time.

Sincerely,

P.S. It's time for a haiku!

Machines grasp language Does this mean that they're human? Or that we're machine?