From Source to Execution

Dear Computer

Chapter 1: Programming Language

From Source to Execution

We've just written a program, and we poured our heart into it. Now we are ready to hand our heart over to the computer so that it can be executed. There are several different ways this may happen, each having its own costs and benefits.

Compile to Machine Code

When code is compiled, a special piece of software called a compiler takes our source code and translates it into an executable machine code file. The executable is written using the instruction set of our computer's CPU. Common instruction sets include x86-64 or ARM. The compiler itself does not run any of our code. It merely builds an executable, which has a life independent of the compiler and the source code that produced it. We may run the executable as often as we like, but the compiler's translation happens only once. The executable consumes the user's input and produces the output, not the compiler.

Compiling a standalone executable is comparatively slow because many decisions about our program are made early. Like where the instruction pointer should jump when a function named f is called. Or checking if a variable has been assigned a value before a statement increments it. The primary benefit of compiling is that many of these checks will not occur during execution. A compiled executable will be faster than the alternatives.

Interpret

An interpreter, on the other hand, does not produce a standalone executable. It takes in our source code and executes it as soon as it understands what we wrote. Decisions about the code's execution are made on the fly. The interpreter is responsible for receiving user input and handing it off to our program.

An interpreter's startup time is comparatively fast since no time is spent trying to make a speedy executable. However, the execution itself is slower because the interpreter is acting as an intermediary between our source code and the CPU. It checks our high-level code and translates it to our CPU's instruction set on demand.

Compile to Bytecode

If the executable produced by a compiler is written in our CPU's instruction set, we can't just put it up on a website for others to download. Someone with a different instruction set will not be able to run our program. Instead we must compile an executable for each instruction set we want to support. This puts a lot of burden on developers who wish to share their software. Alternatively, we could release the source code and let each person compile their own executable. This puts a lot of burden on our users.

Or we can compile to bytecode instead of native machine code. Bytecode is an instruction set for a universal but fake CPU. These fake CPUs are more properly called virtual machines. They are not real machines, but virtual ones. The developer of a virtual machine releases versions for the many operating system and instruction set combinations. Users download our universal bytecode file and execute it on their copy of the virtual machine.

The Java ecosystems from Oracle and the OpenJDK use a virtual machine. The Java bytecode specification defines the instruction set of a fake CPU called the Java Virtual Machine. When we compile a .java file, it is translated into a .class file full of bytecode instructions. Anyone with a Java Virtual Machine installed may run our .class file. The virtual machine will translate from bytecode to the instruction set of the real CPU on demand.

The advantages of bytecode compilation are pre-compiled portable executables and the ability to share software without giving away the source code. Bytecode execution is a middle ground between native compilation and interpretation. Many decisions are made early as our source code is translated to bytecode, but the bytecode itself must still be interpreted by the virtual machine.

Tools, Not the Language

Sooner or later, you will hear someone say that C is a compiled language, JavaScript is an interpreted language, or Java is a bytecode language. That someone is exhibiting a misunderstanding of how technologies are specified and implemented. Which of these three execution models is used is not technically a property of the language. C programs are typically compiled to an executable. But you could write a C interpreter, as others have done. JavaScript source files are typically interpreted, but some browsers compile them down to bytecode in order to accelerate their execution. Java programs are typically compiled to bytecode, but you could build a real processor that used the Java Virtual Machine instruction set. The tooling around the language is what determines the execution model, not the language itself.

← Syntax and SemanticsLexing →