Composite Non-Scalar Types

A composite non-scalar type describes data that has multiple values. Non-scalar types differ in various ways. They may be ordered or unordered. Individual elements may be named or indexed. They may be fixed in size or growable. Their elements may all have the same type, or they may be heterogeneous.

Many object-oriented languages provide classes for managing collections of multiple values. This discussion focuses only on the composite types that are acknowledged in the language's grammar. If an abstraction exists only as a class in the language's standard library but is not supported with special syntax, we do not count it as a core type.

Arrays

An array is an ordered sequence of data. The elements of an array are typically stored in contiguous memory, which makes iteration and indexing fast. When a cell is read, both it and its neighbors are pulled into cache memory, which is must faster to access than main memory. When the cell's neighbors are read, the values are quickly pulled from the cache.

In C, C++, and Java, the size of the array must be known at the time the array is created. This size might be explicitly stated, as in this Java program:

Java

int[] luckyNumbers = new int[13];

int[] luckyNumbers = new int[13];

This array has 13 cells. The Java Virtual Machine guarantees that these cells are initialized to 0. C and C++ make no such guarantee.

Or the size might be implicitly derived from an array literal:

Java

int[] snoozeOffsets = {5, 10, 15};

int[] snoozeOffsets = {5, 10, 15};

Arrays provide a subscripting or indexing operator to access individual elements. Most languages use the [] operator. Matlab and Basic use (). When used in an rvalue context, the subscript operator reads an element. When used in an lvalue context like an assignment statement, the subscript operator produces the element's location so the value at that location can be overwritten.

Many languages associate the leading element of the array with index 0. Starting with 0 rather than 1 has several advantages.

Treating the index as an offset eases address calculations. With 0-based indexing, element i lives at base + i * element_size. With 1-based indexing, element i lives at base + (i - 1) * element_size.
To repeatedly cycle through an array's elements, we can compactly advance the index with i = (i + 1) % n, which wraps back to 0 when we've just gone out of bounds. To use modular arithmetic with 1-based indexing, we have to convert to a 0-based index, apply the mod operation, and convert back to a 1-based index.

Dijkstra said 0 is better after contrasting the forms of ranges. A range has a lower bound $\text{lo}$ and an upper bound $\text{hi}$. Each bound may be expressed in an inclusive or exclusive manner. For example, these are the four possible range expressions for the numbers 9, 10, and 11:

	inclusive $\text{hi}$	exclusive $\text{hi}$
inclusive $\text{lo}$	$$9 \leq i \leq 11$$	$$9 \leq i \lt 12$$
exclusive $\text{lo}$	$$8 \lt i \leq 11$$	$$8 \lt i \lt 12$$

Often we need to count the number of elements in a range. The count calculation varies depending on the range's form:

fully inclusive: $\text{hi} - \text{lo} + 1$
fully exclusive: $\text{hi} - \text{lo} - 1$
mixed inclusivity: $\text{hi} - \text{lo}$

The count calculation for mixed inclusivity is clearest. But there are two range forms of mixed inclusivity to choose from. Dijkstra argued that an inclusive lower bound is better than an exclusive one because the form $-1 \lt i \leq 10$ contains only natural numbers, yet its lower bound is not natural. The superior range is therefore $\text{lo} \leq i \lt \text{hi}$.

With that settled, the next question is where to start counting: 0 or 1? Suppose we have a collection of size $\text{n}$. If we start at 1, its range of indices would be $1 \leq i \lt \text{n} + 1$. If we start at 0, its range of indices would be $0 \leq i \lt \text{n}$. Dijkstra thought the 0-based form was easier to read and harder to get wrong.

Nevertheless, 0-based indexing is not universal. In R, Fortran, Matlab, and other scientific and mathematically oriented languages, indexing starts at 1.

A robust array type also lets programmers query an array's size. C and C++ do not provide this operation. Programmers are responsible for knowing and communicating the size of an array.

Values of an array may be overwritten, but new cells cannot be added nor old cells removed. Array operations are expected to be constant-time operations, but structural modifications are not. Most programming languages provide abstractions of an array that do support adding and removing in ways that minimize the performance costs. Java has ArrayList; C++ has vector; JavaScript and Ruby have Array; and Python has list. These abstractions typically start with a backing array that has extra cells. If the extra cells fill up, the backing array is automatically copied over to a new and bigger array.

In Ruby, Python, and JavaScript, arrays are heterogeneous. For example, each element of this Ruby array has a different type:

Ruby

meltingPot = ['cornflower', [71], false, Date.new]

meltingPot = ['cornflower', [71], false, Date.new]

In these languages, the programmer is expected to treat an element according to its specific type. In Java, C, and C++, arrays are homogeneous. Each element must match the array's declared type. The compiler ensures that only operations supported by the declared type are applied to array elements.

Formatting an array of string literals is sometimes a syntactic burden. We must surround each string with quotation marks and separate each element with a comma. Ruby provides an alternative form with the %w syntax. Contrast this form with the traditional array literal:

Ruby

units = ["percent", "pixels", "inches", "centimeters"]
units = %w{percent pixels inches centimeters}

units = ["percent", "pixels", "inches", "centimeters"]
units = %w{percent pixels inches centimeters}

The %w is a mnemonic for words.

Strings

Strings are sequences of characters. Common operations include reading individual characters or substrings, searching for and replacing substrings, concatenating other strings, comparing to other strings or against patterns, removing whitespace, querying the string's length, and altering its case.

Some languages do not have a special string type. A string in C, for example, is just an array of characters. Since arrays in C do not track their length, C programmers have adopted the practice of ending a string with a special null terminator character that has decimal value 0 in the ASCII and Unicode standards. Poor management of null-terminated strings frequently leads to errors. If the null terminator gets overwritten or isn't allocated space, the program may run off the end of the string into memory it doesn't own. Hackers exploit such buffer overflows to take over systems.

The designers of Java chose to make strings immutable. James Gosling explained this choice in an interview:

From a strategic point of view, [immutable Strings] tend to more often be trouble free. And there are usually things you can do with immutables that you can't do with mutable things, such as cache the result. If you pass a string to a file open method, or if you pass a string to a constructor for a label in a user interface, in some APIs (like in lots of the Windows APIs) you pass in an array of characters. The receiver of that object really has to copy it, because they don't know anything about the storage lifetime of it. And they don't know what's happening to the object, whether it is being changed under their feet.

You end up getting almost forced to replicate the object because you don't know whether or not you get to own it. And one of the nice things about immutable objects is that the answer is, “Yeah, of course you do.” Because the question of ownership, who has the right to change it, doesn't exist.

One of the things that forced Strings to be immutable was security. You have a file open method. You pass a String to it. And then it's doing all kind of authentication checks before it gets around to doing the OS call. If you manage to do something that effectively mutated the String, after the security check and before the OS call, then boom, you're in. But Strings are immutable, so that kind of attack doesn't work. That precise example is what really demanded that Strings be immutable.

The delimiters that surround a string literal in source code influence how the string is built. In Ruby, literals may be surrounded by either single or double quotation marks. In a double-quoted literal, the standard escape characters are recognized: \n is a linefeed that brings the cursor to the next line, \r is a carriage return that brings the cursor back to the beginning of the current line, \t is a tab, \" is a non-delimiting double quotation mark, and \a makes the terminal alert the user with a sound or visual effect. In a single-quoted literal, most of these escape sequences are disabled. The backslashes are considered to be separate characters. Only \' has a special meaning.

A string literal with many embedded double-quotation marks is a pain to escape and read. Ruby therefore provides the %q literal form that obviates escaping double-quotation marks. Contrast the two forms:

Ruby

puts "Barfield: \"The poet's material, then, is memory.\""
puts %q{Barfield: "The poet's material, then, is memory."}

puts "Barfield: \"The poet's material, then, is memory.\""
puts %q{Barfield: "The poet's material, then, is memory."}

The %q is a mnemonic for quoted.

Strings are often built by joining literals and programmatic expressions. These may be combined using explicit concatenation, but some languages support interpolation, which is the injection of code into a string literal. The syntax of interpolation varies widely across languages. In Ruby, the interpreter replaces any #{} found in a string literal with the value to which the enclosed code evaluates. Contrast these two forms:

Ruby

proportion = 0.7
label = "proportion " + proportion + " is " + proportion * 100 + "%"
label = "proportion #{proportion} is #{proportion * 100}%"

proportion = 0.7
label = "proportion " + proportion + " is " + proportion * 100 + "%"
label = "proportion #{proportion} is #{proportion * 100}%"

Many languages do not allow normal string literals to span multiple lines. Linebreaks are inserted using escaped characters. However, some languages provide syntax for a heredoc, which is a document or file that is right here in the source code that may contain literal linebreaks. This Ruby program uses a heredoc to assemble an HTML document on the fly:

Ruby

html = <<HTML
<html>
  <head>
    <title>#{title}</title>
  </head>
  <body>
    #{content}
  </body>
</html>
HTML

html = <<HTML
<html>
  <head>
    <title>#{title}</title>
  </head>
  <body>
    #{content}
  </body>
</html>
HTML

The heredoc is delimited by <<HTML and HTML. The token HTML was chosen by the programmer. Any identifier is legal according to the Ruby standard, but the convention is to use capitalized words.

Java 15 introduces heredocs under the name text blocks. These are delimited with triple quotation marks:

Java

String html = """
  <html>
    <head>
      <title>%s</title>
    </head>
    <body>
      %s
    </body>
  </html>""".formatted(title, content);

String html = """
  <html>
    <head>
      <title>%s</title>
    </head>
    <body>
      %s
    </body>
  </html>""".formatted(title, content);

Text blocks do not currently support interpolation, but they may be treated like a format string whose %-placeholders are replaced by computed values.

Records

Like an array, a record type bundles together multiple values. Unlike an array, records give each of their values a name rather than a numeric index. C's record type is struct. This program demonstrates how a struct type is defined and an instance is initialized with an initializer list:

struct top_score_t {
  char initials[4];
  int score;
};

int main() {
  struct top_score_t entry = {"GWB", 9999};
  return 0;
}

struct top_score_t {
  char initials[4];
  int score;
};

int main() {
  struct top_score_t entry = {"GWB", 9999};
  return 0;
}

An initializer list is legal in a declaration but not a reassignment. Note that four characters are allocated for the three initials. The fourth character holds the null terminator.

The values of a record are sometimes called fields or properties. Each field looks like a variable declaration announcing the field's type and name.

Only a few builtin operations are available for a struct. We may query its address with &, we may ask its size with sizeof, and we may access individual fields using the member operator . and the field name:

printf("%s scored %d!", entry.initials, entry.score);

printf("%s scored %d!", entry.initials, entry.score);

Classes are record types with a few extra semantic powers: new operations may be added to these records in the form of methods, and one class may inherit the fields and operations of another. C++ provides both class and struct elements, but a struct in C++ is not like a struct in C. A C++ struct is a just class in which instance variables and operations are public by default.

In a Ruby class, the instance variables are not declared at the top-level like they are in C++ and Java. Instead, they are created on the fly with assignment statements in the constructor or methods:

Ruby

class TopScore
  def initialize(initials, score)
    @initials = initials
    @score = score
  end

  def when(time)
    @time = time
  end
end

class TopScore
  def initialize(initials, score)
    @initials = initials
    @score = score
  end

  def when(time)
    @time = time
  end
end

In Java, C, and C++, the types and names of the fields must be known statically. That is, the compiler must be able to completely determine the record's shape and operations as the code is compiled. In contrast, everything about a Ruby class is dynamic. It may gain instance variables over the course of its lifetime, and the types of these variables may change through a reassignment.

Several languages provide syntactic support for dynamic records. Ruby calls them hashes, Python calls them dictionaries, and JavaScript confusingly calls them objects. Despite their differences in name, they are all collections of key-value pairs. This program creates a hash in Ruby:

Ruby

entry = {initials: "GWB", score: 9999}

entry = {initials: "GWB", score: 9999}

These dynamic records can be thought of as ad-hoc classes. They don't inherit from other records, and they don't establish a reusable type like a class definition does. Nevertheless, dynamic records are often used in place of classes for their brevity and simplicity.

Tuples

Imagine how we might describe the type of an xy-coordinate pair. Perhaps we'd say something like, “It's a number describing the horizontal position followed by another number describing the vertical position.” A tuple (pronounced toople or tupple) is a generalization of this description. With a tuple we describe a collection of values. Their order matters, but not their name. Some languages allow us to index into them to get all the fields, but others expect us to unpack or destructure them. They do not change in size like an array. Their fields may have heterogeneous types.

Ruby, JavaScript, Java, C, and C++ do not have tuples. Python and Haskell do. This Python script creates a tuple with little fanfare using the () syntax:

Python

entry = ("GWB", 9999)

entry = ("GWB", 9999)

We will explore tuples in more detail when we discuss Haskell and Rust.

Ranges

A range is a sequence of numbers that is defined by its mininum and maximum values and a step or offset value. Ranges are often used to guide the iteration of a loop. For example, this Ruby program iterates through 0, 20, 40, 60, 80, and 100:

Ruby

for x in (0..100).step(20)
  puts x
end

for x in (0..100).step(20)
  puts x
end

A range literal is created using either the .. or ... operators. The first makes the upper-bound inclusive, the second exclusive. The step method is one of several operations supported by ranges.

How much memory do you suppose the range (0..10000000) consumes? Hardly any. Ranges store only their bounds and step value. They do not expand into an array. When we iterate through a range, an iterator is created, but it tracks only its current value.

← Composite Scalar Types Typechecking →

Dear Computer

Composite Non-Scalar Types

Arrays

Strings

Records

Tuples

Ranges