Composite Scalar Types

Dear Computer

Chapter 3: Types

Composite Scalar Types

A composite type is a new type built on top of other types. Some composite types add new operations to a single or scalar value. Others combine multiple values into a composite structure. We discuss composite scalar types first.

Pointers

A pointer is the address of a memory cell. If it points to a cell that holds an int, it is an int pointer. In C, its type is int *. If it points to a cell that holds a CircularQueue, it is a CircularQueue pointer and its type is CircularQueue *.

Pointers are primarily used to efficiently share data across function boundaries. Instead of copying large chunks of memory back and forth through parameters and return values, programmers allocate the data in a stable and accessible location and pass only the data's address back and forth by way of a pointer.

We often store pointers in variables. We could assign a value to a pointer directly, as shown in this C program:

C
int *p = 6225;
int *p = 6225;

This code makes p point at memory cell 6225. However, this code compiles with a warning: incompatible integer to pointer conversion. This is a reasonable warning. Rarely do we know enough about the memory space of a process to be able to assign a literal address. There are two more common ways to assign a pointer.

One way is to first create the pointee, the value at which the pointer points, and then extract the pointer from the pointee. We get the variable's address using the address-of operator &:

C
int i = 17;
int *p = &i;
int i = 17;
int *p = &i;

Pointers derived from variables will point to global or stack memory.

The other way to assign a pointer is to dynamically allocate space on the heap for the pointee using malloc in C or new in C++. These operations return the address of the allocated block:

C
int *p = (int *) malloc(sizeof(int) * 10);
int *p = (int *) malloc(sizeof(int) * 10);

Pointers provide just a few operations. We retrieve the pointee by applying the dereference operator * to the pointer:

C
int pointee = *p;
int pointee = *p;

We compare pointers to other pointers with == and !=. We may also perform arithmetic on them. This C program increments a pointer:

C
p += 7;
p += 7;

The unit of the 7 in this code is not bytes. Pointer arithmetic moves in steps determined by the size of the pointee. Here the compiler knows that p is an int pointer, so the increment advances the pointer forward 7 int cells, which will be an offset of 28 bytes on many computers.

We can also subscript into pointers, which reaches into a memory cell adjacent to the pointee and fetches out a value:

C
int successor = p[1];
int successor = p[1];

We call free in C or delete in C++ on a pointer to release the heap memory that it points to.

Pointers are difficult to type check because they are used to point at several different kinds of data that have incompatible semantics. For example, we should never call free or delete on a pointer to stack memory, but doing so doesn't trigger a type error because stack pointers are indistinguishable from heap pointers. Likewise we shouldn't use the subscript operator unless the pointee is an array of multiple values on the heap. But this too is unchecked.

Another kind of error happens after a pointer is freed. Freeing the pointer is a misnomer in C and C++ as these languages do not actually touch the pointer itself. They only release the memory in which the pointee is stored. After the free, the pointer still holds the pointee's address and may still be dereferenced even if the freed memory has been reallocated. Such a pointer is called a dangling pointer.

Imagine freeing a pointer and then freeing it again. Multiple frees are dangerous. When we dynamically allocate memory, some bookkeeping information is stored adjacent to the pointee. The freed block, along with the stale metadata, may be reserved by some other allocation. A second call to free may read metadata that has been overwritten or lead to double-allocations.

References

The term reference is used in two different ways in programming language discourse. In Java, Ruby, Python, and JavaScript, a reference is effectively a pointer to an object, but its address semantics are hidden. We don't use the dereference operators * or -> in these languages. We can't increment a reference to reach an adjacent memory cell like we can a pointer. We can't even query the address stored in a reference. If we pass this first kind of reference to a subprogram, only a copy of the underlying pointer is passed.

In C++ and Rust, on the other hand, a reference doesn't point at a memory cell. Rather, it is another name for a variable—an alias. This C++ code declares and initializes a reference:

C++
int i = 7;
int& j = i;
int i = 7;
int& j = i;

Either i or j may be used to refer to the memory cell. They are indistinguishable.

If we pass a C++ or Rust reference to a subprogram, we effectively pass the referenced variable itself, not a copy. Changes made to the variable will be visible to the caller.

We will investigate references in more detail later on when we discuss Rust.

Unions

A union is container that is built to hold exactly one value from amongst several different types. This union in C may hold one of five different types:

C
#include <stdio.h>

union any {
  int i;
  char c;
  float f;
  double d;
  char *p;
};

int main(int argc, char **argv) {
  printf("%ld\n", sizeof(union any));
  return 0;
}
#include <stdio.h>

union any {
  int i;
  char c;
  float f;
  double d;
  char *p;
};

int main(int argc, char **argv) {
  printf("%ld\n", sizeof(union any));
  return 0;
}

An instance of the any union may hold an integer, a character, a floating-point number, or a pointer. But it can only hold one of these at a time. If we first assign to field i and then assign to field f, the integer value will be overwritten.

On a 64-bit computer, the output of this program is likely to be 8 bytes. That's the size of the char pointer, which is the largest type of value that can be stored in the union. The union doesn't need any more space than is needed by this largest individual value. Any smaller type will fit within these same 8 bytes.

Unions are a strange beast, especially if your first programming experience was in an object-oriented programming language. C is not an object-oriented language. It doesn't let us create a generic or polymorphic collection like an ArrayList or HashMap that we can use for many different types. However, we can make a collection whose elements are union any. Here we create a new any container, set its value, append it to a list, and then access the appended union:

C
union any new_item;
new_item.c = 'z';
list->add(new_item);
printf("Character: %c\n", list->tail().c);  // prints 'z'
union any new_item;
new_item.c = 'z';
list->add(new_item);
printf("Character: %c\n", list->tail().c);  // prints 'z'

With unions, we can make make generic collections in C.

A second use case for unions is manipulating the bits of the primitive types in ways that normally wouldn't typecheck. For example, we can could write a floating-point number f but then access the value through the i field. This would grant us read and write access to the bits of the float.

Unions don't support many operations. We can get a union's address, query its size, and reach inside and grab a value.

← Primitive TypesComposite Non-scalar Types →