Syntax and Semantics
You are walking down the sidewalk and see a piece of paper sailing in the wind. The paper lands at your feet, and you find these marks printed on it:
Probably you will need to scroll horizontally to see all the marks.
You suspect this paper is a secret message and attempt to decode it. First you make the following observations:
- There are two kinds of black marks: 1-unit squares and rectangles that are three times as wide as the squares.
- Between many of the black marks, there are small 1-unit gaps.
- There are several medium 3-unit gaps.
- There is one large 7-unit gap.
Given your knowledge of how many languages work, you conclude that each sequence of black marks separated by 1-unit gaps is a letter. The letters are separated from one another by a 3-unit gap. These form words, which are themselves separated by a 7-unit gap.
The smallest parts of this language are the two kinds of black marks, but we don't attach much significance to individual marks. In a human or programming language, the smallest parts are characters. These also don't have much significance—not until they are formed into words. In a program, the individual characters form tokens, like $rate
, ~
, <=
, ++
, class
, !=
, ;
, while
, ||
, }
, framesPerSecond
, and TAU
.
When you make conclusions about how the black marks assemble into longer forms, you describe the syntax of the language. A language's syntax is a set of rules dictating how its atoms may be arranged into larger molecules. The syntax of a human language describes how words and punctuation form sentences. The syntax of a programming language describes how tokens may be arranged into valid programs. Knowing the syntax of a language is important, but the rules by themselves do not help us understand what is being communicated. To understand a language, we must also know what the arrangement of atoms means. This meaning is a language's semantics.
The marks on the paper do not appear to follow any syntax or semantics that you know. However, you flip the paper over and find this handy tree showing how to translate the sequence of short and long marks into letters:
You translate each atom of the message into an English letter by descending left when you see a 1-unit square and descending right when you see a 3-unit rectangle. Altogether, the letters form an English phrase whose semantics you know.
The black mark language is better known as Morse code, as you may have guessed. Likely you don't have any preconceived notions about what dot-dash-dot-dot means. But once you translate the marks into letters, you can apply the semantics of English to interpret the message.
You likely do have preconceived notions about what -x
means in source code. That makes you dangerous. Probably you think it means “negative x
”. However, you can't be truly certain without knowing the semantics of your program. For example, consider this C++ program:
What is negative x
if x
is the string stressed
? Predict the output of the program, and then run it. The semantics of code aren't always what we expect.
C++ allows the -
operator to be overloaded for classes, so we can't determine the meaning of -x
without first knowing the type of x
. Variable x
is declared as a string, and -
has been defined to reverse the string using a special constructor that is passed the reverse iterators rbegin
and rend
.
As we continue our investigation of programming languages, you'll want to view syntactic form and semantic meaning as two independent but interrelated dimensions of a language. People who think and write about programming languages organize their thoughts along these two dimensions. Developers who write software that interprets code break their algorithms into these two stages of concerns. They first use the syntax of the language to chunk tokens into established grammatical forms. Then they translate the forms into semantic representations like assignment statements, arithmetic operations, and function calls that perform the intended computation.