Primitive Types
The simplest types in a programming language are the primitives. These often include:
- integers
- floating-point numbers
- booleans
- characters
Some primitive types are native to the hardware. Developers who value speed will restrict themselves to such primitives since they can be operated on with fast machine instructions. Where direct hardware support is unavailable, the operations are implemented in software.
Integers
When we declare an int
in C, we reserve several bytes in RAM—usually four. The integer is stored in these bytes' transistors—usually using a twos-complement encoding. The arithmetic-logic unit of the CPU physically reacts to the electronic representations of integers to perform operations like addition, multiplication, and masking.
Storing small numbers in 4-byte integers wastes RAM and network capacity. Therefore, languages in which efficiency is a primary concern provide numeric types of different magnitudes. The four common integer types are byte
, short
, int
, and long
. Java defines these as 1 byte, 2 bytes, 4 bytes, and 8 bytes, respectively. But their sizes are not guaranteed in all languages. The C standard, which has adapted to many hardware changes over the decades, specifies the number of bytes only in relative terms. We ask the compiler how many bytes are reserved for a value of a given type using the sizeof
operator. The C standard imposes this criteria on compilers:
2 <= sizeof(short) &&
sizeof(short) <= sizeof(int) &&
sizeof(int) <= sizeof(long)
2 <= sizeof(short) && sizeof(short) <= sizeof(int) && sizeof(int) <= sizeof(long)
There may exist a compliant compiler that makes short
, int
, and long
all 2 bytes. Or another that makes them all 16 bytes. C doesn't have a byte
type; it uses char
for both small numbers and symbols from the character set.
The looseness of the C standard is problematic for developers whose software reads and writes binary files that are transferred between computers. If an int
is stored as 4 bytes on one computer but read as 2 bytes on another, information will be lost. For this reason, Rust, Go, and C99 (the C standard ratified in 1999) define fixed-width types. In C99, these are named int8_t
, int16_t
, int32_t
, and int64_t
.
Rust, Go, and C provide both signed and unsigned versions of the integer types. The ranges of the signed types are half negative, whereas the ranges of the unsigned types start at 0. For example, a signed byte is in the range [-128, 127] and an unsigned byte is in the range [0, 255]. The signed ranges lean negative because in twos-complement half the numbers are negative and half are not negative. One of the not negative numbers is 0, leaving one less positive number.
Interestingly, Java doesn't have unsigned types. In an interview, Java's inventor James Gosling justified their omission:
For me as a language designer, which I don't really count myself as these days, what “simple” really ended up meaning was could I expect J. Random Developer to hold the spec in his head. That definition says that, for instance, Java isn't [simple]—and in fact a lot of these languages end up with a lot of corner cases, things that nobody really understands. Quiz any C developer about unsigned, and pretty soon you discover that almost no C developers actually understand what goes on with unsigned, what unsigned arithmetic is. Things like that made C complex. The language part of Java is, I think, pretty simple. The libraries you have to look up.
Nevertheless, Java 8 introduced methods like Long.divideUnsigned
and Integer.compareUnsigned
, which treat signed integers as if they were unsigned. It did not introduce any true unsigned types, just unsigned operations.
If we need to work with numbers that are too big for the fixed-precision integer types, we can use a big integer type, which supports an arbitrary number of digits. Java provides a BigInteger
class. The OpenJDK implements BigInteger
by breaking up the many digits into an array of int
primitives. Since big integers do not have hardware support, working with them is slower.
Ruby and Python do not provide a collection of differently-sized integer types like C does. Instead, they provide a single integer abstraction. If the integer is small enough, they store it in a native integer type. If the integer exceeds the maximum allowed by the hardware, they automatically and implicitly convert the number to a big integer type.
Floating-point Numbers
Floating-point numbers allow fractions. When we declare a double
in C, we are likely getting eight bytes of RAM, which is twice the typical size of a float
. The bits of the double
are stored in the transistors according to the IEEE-754 floating point standard. The numbers are “floating-point” because the radix point is floated to just after the first 1 in their binary scientific notation:
The numbers after the radix point (\(10101_2\)) form the mantissa. The exponent (\(10_2\)) controls the magnitude of the number. Just as multiplying a decimal number by 10 moves the decimal point to the right, multiplying a binary number by 2 moves the radix point to the right. Therefore the exponent of a float effectively slides the radix point to the left or right.
One bit is reserved to store the sign of the number. The standard orders the three fields in this structure:
In a 32-bit float, 1 bit is reserved for the sign, 8 bits for the exponent, and 23 bits for the mantissa. In a 64-bit float, 1 bit is reserved for the sign, 11 bits for the exponent, and 52 bits for the mantissa.
Because the sign bit is independent of the other two fields, the number 0 has two representations, one positive and one negative. The standard considers them equal, as this C program confirms:
#include <stdio.h>
int main() {
float a = 0.0f;
float b = -0.0f;
printf("%f %s %f\n", a, a == b ? "==" : "!=", b);
}
#include <stdio.h> int main() { float a = 0.0f; float b = -0.0f; printf("%f %s %f\n", a, a == b ? "==" : "!=", b); }
Certain bit configurations represent infinity or not-a-number (NaN). These may result when we divide by 0 or take a root of a negative number.
Historically, integer arithmetic was faster than floating-point arithmetic. On most modern computers, however, floating-point operations are executed on a special floating point unit (FPU), which delivers similar performance to the integer hardware.
Any representation of real numbers in a fixed number of bits is prone to information loss. When we envision a number that can't be stored precisely, we probably imagine it having many fractional digits. In truth, any number that can't be floated to the form \(1.\text{mantissa}_2 \times 2^\text{exponent}\) will lose precision, even whole numbers. The first whole number that can't be stored as a single-precision float is 16,777,217.
Some languages like SQL support simpler fixed-point numbers, where the number of digits after the decimal is chosen by the programmer. If we ask for 10 digits after the decimal, we will be able to represent all numbers with 10 decimal digits. If a computation yields a value with more digits, it will be truncated.
In theory, programming languages could dispense with integer types and get by with just a single numeric type that allowed fractional parts. Integers would have their fractions set to 0. However, fractions aren't always a semantic fit. Some numbers, like counts and array indices, should never permit fractions. Additionally, not every integer can be represented as a float
or double
. Thus most languages provide both types. JavaScript is an exception. Its standard specifies only the Number
type, which is defined to be a 64-bit floating-point number. It is used for array indices, counts, proportions, and scientific measurements alike.
Ruby's and Java's BigDecimal
classes provide more control over the precision and range of numbers with fractions.
Booleans
Some languages have a boolean type that admits only true and false values. Other languages, like C, allow the programmer to use integers to represent true and false values. 0 is interpreted as false, and anything else as true.
Though Ruby has classes for Integer
and Float
, it does not have a Boolean
class. Inspect the classes of various primitive literals by running this script:
Ruby has TrueClass
and FalseClass
, of which true
and false
are the only instances. These two classes do not have a common Boolean
superclass because they share no implementation.
The lack of a Boolean
class hints at a unique quality of Ruby: it doesn't treat a value according to its type. Instead Ruby uses a system called duck typing that we'll discuss soon.
Characters
Though computers were invented as an appliance for mathematicians and physicists to process numbers, characters got their own types in the 1950s. There were various standards for mapping the arrangement of bits to alphabetic symbols. The 7-bit American Standard Code for Information Interchange (ASCII) emerged as the standard in the United States. ASCII is not friendly to languages that use other alphabets, as only \(2^7 = 128\) different symbols can be represented with 7 bits.
Unicode superceded ASCII in the 1990s. It allows for more than just 127 characters by growing the number of bytes used to represent a symbol. Many systems use UTF-8, a standard that is a superset of ASCII. In UTF-8, if the first bit of a byte is 0
, then the remaining 7 bits are interpreted according to the original ASCII standard. If the first bits of a byte are 110
, then the byte is combined with the byte that follows it, giving a lot more bits to represent symbols from other alphabets. If the first bits of a byte are 1110
, then three bytes are combined. If the first bits are 11110
, then four bytes are combined. Through this prefixing scheme, all 1.1 million Unicode characters, including emoji, can be represented.