Grammars
Write a love note to a lexer, and you'll be disappointed. All it will see of your passion is a sequence of tokens. For a love note to do its wooing, the recipient must understand what it says. Lexers do not attach any meaning to the tokens. You, on the other hand, make sense of text by assembling the tokens into coherent sentences according to the grammar of your language. You parse out the subject, the action, and direct and indirect objects. Likewise, if a compiler or interpreter wishes to understand the tokens of a program given to it by the lexer, it must parse the tokens according to the programming language's grammar.
Grammars are important to both language designers and language users. They communicate a language's syntax, and they effectively outline write the parser code.
Dog Grammar
Humans and dogs share a language of short commands. Its grammar looks something like this:
command = SIT
| STAY
| DOWN
| PAW
| WAIT
command = SIT
| STAY
| DOWN
| PAW
| WAITThe = may be read as “is”. The | marks a choice and is read as “or”. The notation we use to describe a grammar is a simplified version of Backus-Naur Form (BNF).
The identifier command represents an abstract structure in the grammar. In fact, it is the only abstract structure in this grammar. To express a command is to say one of SIT, STAY, DOWN, PAW, or WAIT.
The casing of the grammar's terms is significant. Uppercase terms—and terms made of punctuation—are terminals, which means that they can't be broken down into simpler grammatical forms. Terminals correspond to tokens from the lexer. Lowercase terms are non-terminals, which are abstract descriptions of grammatical forms. A non-terminal is a list of one or more rules called productions. This grammar has 1 non-terminal, 5 terminals, and 5 productions.
Band Name Grammar
In general, grammars have many non-terminals, terminals, and productions. Not all productions expand immediately to terminals. They may lead to further non-terminals, which must themselves be expanded. For example, consider this grammar for generating the name of a band:
name = THE number adjective noun
| number adjective noun
number = TWO | THREE | FOUR | FIVE
adjective = BEAUTIFUL | FRAGILE | EMPTY | BUSY | LONELY
noun = SIRS | THEORIES | MONTHS
name = THE number adjective noun
| number adjective noun
number = TWO | THREE | FOUR | FIVE
adjective = BEAUTIFUL | FRAGILE | EMPTY | BUSY | LONELY
noun = SIRS | THEORIES | MONTHSBy convention, expansion starts with the first non-terminal listed in the grammar. It is the entry point—much like a main function. In this case, any utterance that follows this grammar must be expanded from name.
Programming Grammar
Grammars for programming languages tend to be even more complex, with non-terminals for all the different program constructs like assignments, functions, loops, and expressions. Let's examine a small and imaginary language in which you can store, add, and print integers. Here's an example program written in this language:
a = 10
puts a
puts a + 1
a = a + 1
b = a
puts a + b
a = 10 puts a puts a + 1 a = a + 1 b = a puts a + b
Let's work backward from this program to build the language's grammar. The grammar describes the rules that any valid program written in this language must follow. A program is 0 or more statements right up to the end of the file. There's an expanded version of Backus-Naur Form that lets us express this with the quantifier *, but this form discards information we need to write a parser. So, we'll quantify with this recursive rule instead:
program = statement program EOF
| EOF
program = statement program EOF
| EOFA program is either empty, or it's a statement followed by a program.
A statement is either an assignment or a puts statement, which leads us to this compound rule:
statement = IDENTIFIER EQUALS expression
| PUTS expression
statement = IDENTIFIER EQUALS expression
| PUTS expressionAn expression is an integer literal, a variable name, or an addition, which leads us to this rule:
expression = INTEGER
| IDENTIFIER
| expression PLUS expression
expression = INTEGER
| IDENTIFIER
| expression PLUS expressionAltogether, there are 3 three non-terminals, 7 terminals, and 6 productions in this grammar. Contrast this simple grammar with Python's.