Strings

Dear Computer

Chapter 10: Everything Revisited

Strings

Rust has two string types: the str type for fixed-length character sequences and the String type for heap-allocated, growable character sequences. A str is like an array in that it's fixed in size. String, on the other hand, is like an ArrayList. It wraps around a string and adds a bunch of operations, like resizing, inserting, and removing. Both abstractions are provided because sometimes we want the power of String but other times we don't want to pay its performance costs.

The type str is used for both string literals and speedy sharing of character data that's wrapped up in a String. In both these cases, the text is owned by some other entity, and we are merely borrowing it. When we borrow a value, we are given a pointer to the original memory. If the original value has type T, the borrowed value has type &T—which we read as “a reference to a T”. In practice, we never use str because low-level text is always borrowed. We use &str, as in this declaration:

Rust
let country: &str = "Vanuatu";
let country: &str = "Vanuatu";

The real str object is stored in the read-only code section of the process. The variable country is a reference to that memory, a &str. The Rust community calls a reference to a sequence of values a slice. So &str is a string slice. Later we will see slices for other collections.

Only a limited set of operations are available on string slices. The contains, find, rfind, starts_with, and ends_with methods search for a substring. The lines method yields an iterator over the lines of the text, much like the lines function of Haskell. The polymorphic parse method behaves like Haskell's read function. The trim methods yield a new slice that doesn't include leading and trailing whitespace. The split methods yield an iterator over subslices separated by delimiters. This code splits a list of comma-separated weekdays and prints each day on its own line:

Rust
let series = "lunes,martes,miércoles";
let days = series.split(",");
for day in days {
    println!("{}", day);
}
let series = "lunes,martes,miércoles";
let days = series.split(",");
for day in days {
    println!("{}", day);
}

The variable series has type &str, but here we are calling str methods on it. A &str is effectively a pointer, but a str is an actual text object. Rust automatically dereferences an &str into a str.

Where possible, str methods neither modify nor copy the character data. For example, the trim method returns a slice that refers to a portion of the exact same memory as the original slice. A similar method in C would have to either mutate the original string or dynamically allocate memory for the trimmed version because C strings must end in a null terminator. Rust slices don't use null terminators. Instead, they mark off a window of memory using a starting pointer and a length. The slice that trim returns is a pointer that points after the leading whitespace of the original slice and a length that ends the slice right before the trailing whitespace.

When str operations do need to modify the character data, they typically return a new String, which causes a heap allocation. The to_uppercase, to_lowercase, and replace methods do this. This code lowers an HTML element name:

Rust
let element: &str = "IMG";
let element: String = element.to_lowercase();
println!("{}", element);
let element: &str = "IMG";
let element: String = element.to_lowercase();
println!("{}", element);

The second element shadows the first and uses a different type. Explicit types are not necessary here, but they emphasize the type change.

Noticeably absent from str is a subscript operator or a char_at method. The Rust designers left these out because they take the complexity of Unicode more seriously than most other languages you have used. When we write text.charAt(i) in Java, we get back a character that has been arrived at through fast but naive offset arithmetic. But human languages are not always encoded using just one character per symbol. Symbols in UTF-8, for example, may be between one and four bytes wide. There's no way to determine what character i is without interpreting each byte to determine a character's width. Rust makes it difficult to ignore the complexities of human language by forbidding strings from being viewed as arrays of characters that can be randomly accessed in constant time. Rather, they are linked lists that must be traversed in linear time.

That said, if we are certain that we have only single-byte characters, we may iterate through the text using an iterator returned by the bytes or chars methods:

Rust
let theme = "FLUTE";

// Print characters one per line
for c in theme.chars() {
    println!("{}", c);
}

println!("{:?}", theme.chars().nth(3));  // prints Some('T')
let theme = "FLUTE";

// Print characters one per line
for c in theme.chars() {
    println!("{}", c);
}

println!("{:?}", theme.chars().nth(3));  // prints Some('T')

The last statement uses the nth method to retrieve a character at a particular index, emulating the behavior of charAt. It yields an Option type.

The String type provides operations that may alter the size of the character data. The pop method removes the last character. The push method appends a character, and push_str appends another string. The insert method adds a character and insert_str a string at an arbitrary index. The index is a usize. The clear method empties the string of all its characters. The replace_range replaces a window of the string with some other string of arbitrary length. The String type also supports all methods of str.

Many of the methods of String require the receiver to be mutable. This code mutates the variable verb by inserting some extra characters:

Rust
let mut verb = String::from("mutate");
verb.insert_str(3, "il");
println!("{}", verb);                   // prints mutilate
let mut verb = String::from("mutate");
verb.insert_str(3, "il");
println!("{}", verb);                   // prints mutilate
← FunctionsTuples →