Lexing

Before our source code can be executed, the compiler or interpreter must first understand what we wrote in our program. This understanding is broken into two stages: lexing and parsing. The code that lexes is a lexer, and the code that parses is a parser.

The lexer's job is to read through the source code and collect the individual characters up into tokens. When we read text, we don't process individual characters. We chunk them into words. The lexer in a compiler or interpreter does the same thing with code. In fact, the name lexer comes from the Greek word lexis, which means word. A token is sometimes called a lexeme, which is a term linguists use to describe the basic unit of language.

Programs are made up of several different kinds of tokens. Keyword tokens are words that have special meaning in the language, like for, while, class, true, and return. Identifier tokens are names for variables, methods, classes, and labels. Literal tokens are raw data like 1929, 3.14159f, '?', and "seethe". Operator tokens are symbols like +, ==, !, and &. Common separators include , and ;. In many languages, four different kinds of enclosing brackets are found: [], (), <>, and {}. Languages diverge widely on their choice of comment tokens. Single-line comments start with the tokens //, #, --, or ;. Multi-line comments are surrounded by the token pairs /* and */, #| and |#, {- and -}, or =begin and =end.

Whitespace is often discarded. It is used to separate tokens but is not itself included in any tokens unless it's part of a string literal.

Lexing is a superficial task. If a lexer sees the letters ++i, it stops at producing the tokens ++ and i. It doesn't try to make any sense of the tokens. Identifying the semantic meaning of syntactic forms is the job of the parser.

Integer Literals

Lexing by hand is a fun exercise, but we need to get a computer to do it for us. To recognize a particular kind of token, like identifiers or string literals, we build a little recognizing machine. That machine will ultimately be written in code, but let's first think about how it works using a picture—in particular, a state diagram. Here's the state diagram of a machine for recognizing integer literals:

The diagram shows that this machine will be in one of three states. The arrows show how we move between states when we encounter a particular character. The unrooted arrow indicates that we start in the nothing yet state. At this point, we haven't read any of the characters in the source code. If we see a digit, we move to the saw digit state. If we see more digits, we stay in that state. Alternatively, suppose we see a hyphen first. Then we'd go into the saw hyphen state. If we then see a digit, we'd find ourselves in the saw digit state.

If we run out of characters or if we encounter a character for which there isn't a transition from our current state, we stop reading. If our current state is an accepting state, then we have found ourselves a legal token. Accepting states are marked with a double outline.

If we're not in an accepting state, then we either need to move to a different recognizing machine or we have bad source code and must issue an error. For example, if the source is -@, we read the -, land in saw hyphen, and cannot transition on @. We're not in an accepting state, so we reject this text as it's not an integer literal. On the other hand, if the source is 62 * 37, we read the digits 62, land in saw digits, and cannot transition on the space. We're in an accepting state, so we record the integer literal token and then try lexing another token from the remaining input * 37.

Lexing Integers

Now that we've explored the state diagram, let's consider how we might translate it to lexing code. Suppose our lexer has state for the source code, the token currently being lexed, and a list of tokens accepted so far. Its has and has_digit methods compare the current character in the source against some pattern. Its capture method appends the current character from the source to the current token. The emit_token method appends the current token to the list of accepted tokens. Then we might express our machine with code like this:

Ruby

if has('-')
  capture
end

if has_digit
  while has_digit
    capture
  end
  emit_token(:integer)
end

if has('-')
  capture
end

if has_digit
  while has_digit
    capture
  end
  emit_token(:integer)
end

This code allows but does not require an initial hyphen. It requires at least one digit but allows for more.

Identifiers

What would a machine for recognizing identifiers look like? In many languages, identifiers must start with a letter or an underscore. Subsequent symbols can be letters, numbers, or underscores. Here's a machine that recognizes identifiers that follow these rules:

Could we have drawn this machine with fewer states?

Single-letter identifiers like i will drop us off in saw alpha. These are valid identifiers, so we make this an accepting state. Having more than one accepting state is legal. Suppose we run this machine on ii. We are in an accepting state after the first i, so we could stop there. But if we read the second i, we also land in an accepting state. Recognizing ii seems like a better idea. This is the principle of maximal munch. Lexers consume as much of the source code that they can—but only if it leads to an accepting state.

The state diagram for identifiers translates into the following lexing code:

Ruby

if has_alphabetic
  capture
  while has_alphanumeric
    capture
  end
  emit_token(:identifier)
end

if has_alphabetic
  capture
  while has_alphanumeric
    capture
  end
  emit_token(:identifier)
end

In general, a transition to a next state in the diagram translates to a conditional statement in the code. A transition that takes us back to an earlier state translates to a loop.

String Literals

What would a machine for recognizing string literals look like? String literals begin and end with a double quotation mark. Nearly any character can appear between these marks—even another double quotation mark. How does the lexer distinguish between an internal quotation mark and a delimiter? Many programming languages allow programmers to cancel the special meaning of a character by escaping it—often by preceding the character with a backslash. We add an extra state to our machine to handle backslash-escaped characters:

This machine might look like this lexing code:

Ruby

if has('"')
  capture
  while has_otherwise('"')
    if has('\\')
      capture
    end
    capture
  end
  if has('"')
    capture
    emit_token(:string)
  else
    raise 'unclosed string literal'
  end
end

if has('"')
  capture
  while has_otherwise('"')
    if has('\\')
      capture
    end
    capture
  end
  if has('"')
    capture
    emit_token(:string)
  else
    raise 'unclosed string literal'
  end
end

There are a couple of new tricks used here. Function has_otherwise checks if there is a next character and that does not match the target. If there's no closing quotation mark, we raise an exception.

Multiple Tokens

These token-recognizing machines have so far been discussed in isolation. To build a complete lexer, the recognizers for all the different kinds of tokens must be glued together into a machine that recognizes multiple types of tokens. Explore multi-token lexers built by hand in the following exercises.

Not only must a lexer recognize multiple kinds of tokens, it must keep iterating through the source code to generate the entire sequence of tokens.

The list of tokens produced by the lexer is passed along to the next stage of translation: parsing.

← From Source to Execution Grammars →

Dear Computer

Lexing

Integer Literals

Lexing Integers

Identifiers

String Literals

Multiple Tokens