Grammars
Write a love note to a lexer, and you'll be disappointed. All it will see of your passion is a sequence of tokens. For a love note to do its wooing, the recipient must be understand what it says. Lexers do not attach any meaning to the tokens. You, on the other hand, make sense of text by determining the role of each token. Your brain assembles the tokens into coherent sentences that follow the grammar of your language. You parse out the subject, the action, and direct and indirect objects. Likewise, if a compiler or interpreter wishes to understand the tokens of a program given to it by the lexer, it must parse the tokens according to the grammar of the programming language.
Grammars are important to both language designers and language users. In this course, you will be both. Let's examine how we express grammars. Soon we'll use the grammar to structure the code of the parser.
Dog Grammar
Humans and dogs share a language of short commands. Its grammar looks something like this:
command = SIT
| STAY
| DOWN
| PAW
| WAIT
command = SIT | STAY | DOWN | PAW | WAIT
The =
may be read as “is”. The |
marks a choice and is read as “or”. The notation we use to describe a grammar is called Extended Backus-Naur Form (EBNF).
The identifier command
represents an abstract structure in the grammar. In fact, it is the only abstract structure in this grammar. It can be expanded or made concrete in five ways. To give a command
is to say SIT
, STAY
, DOWN
, PAW
, or WAIT
.
The casing of the terms is significant. Uppercase terms are terminals, which means that they can't be broken down into further grammatical forms. Terminals correspond to tokens from the lexer. Lowercase terms are non-terminals, which are abstract structures that must be expanded into other grammatical forms. The rules that describe how a non-terminal may be expanded are called productions. This grammar has 1 non-terminal, 5 terminals, and 5 productions.
Band Name Grammar
Grammars generally have many non-terminals, terminals, and productions. Not all productions expand immediately to terminals. They may lead to further non-terminals, which must themselves be expanded. For example, consider this grammar for generating the name of a band:
name = THE number adjective noun
| number adjective noun
number = TWO | THREE | FOUR | FIVE
adjective = BEAUTIFUL | FRAGILE | EMPTY | BUSY | LONELY
noun = SIRS | THEORIES | MONTHS
name = THE number adjective noun | number adjective noun number = TWO | THREE | FOUR | FIVE adjective = BEAUTIFUL | FRAGILE | EMPTY | BUSY | LONELY noun = SIRS | THEORIES | MONTHS
By convention, expansion starts with the first non-terminal listed in the grammar. In this case, any utterance that follows this grammar may be expanded from name
.
Programming Grammar
Grammars for programming languages tend to be even more complex, with non-terminals for all the different program constructs like assignments, functions, loops, and expressions. Let's examine a small and imaginary language in which you can store, add, and print integers. Here's an example program written in this language:
a = 10
puts a
puts a + 1
a = a + 1
b = a
puts a + b
a = 10 puts a puts a + 1 a = a + 1 b = a puts a + b
Let's work backward from this program to build the language's grammar. The grammar will describe the rules that any valid program written in this language must follow. A program is 0 or more statements right up to the end of the file, which leads us to this EBNF rule:
program = statement* EOF
program = statement* EOF
The *
is a quantifier that means that its predecessor occurs 0 or more times. Thus a program is 0 or more statements followed by an end-of-file token, which lexers often artificially insert to prevent index-out-bounds errors after all real tokens have been consumed.
A statement is either an assignment or a puts
statement, which leads us to this compound rule:
statement = IDENTIFIER ASSIGN expression
| PUTS expression
statement = IDENTIFIER ASSIGN expression | PUTS expression
An expression is an integer literal, a variable name, or an addition, which leads us to this rule:
expression = INTEGER
| IDENTIFIER
| expression PLUS expression
expression = INTEGER | IDENTIFIER | expression PLUS expression
There are 3 three non-terminals, 7 terminals, and 6 productions in this grammar. Contrast this simple grammar with Python's.