On this page:
3.1 Overview
3.2 Abstract syntax for Abscond
3.3 Meaning of Abscond programs
3.4 Toward a Compiler for Abscond
3.5 An Example
3.6 A Compiler for Abscond
3.7 But is it Correct?
7.4

3 Abscond: a language of numbers

Let’s Make a Programming Language!

    3.1 Overview

    3.2 Abstract syntax for Abscond

    3.3 Meaning of Abscond programs

    3.4 Toward a Compiler for Abscond

    3.5 An Example

    3.6 A Compiler for Abscond

    3.7 But is it Correct?

3.1 Overview

A compiler is just one (optional!) component of a programming language. So if you want to make a compiler, you must first settle on a programming language to compile.

The specification of a programming language consists of two parts: the syntax, which specifies the form of programs, and semantics, which specifies the meaning of programs.

We will simplify matters of syntax by using the Lisp notation of s-expression for program phrases. The job of a parser is to construct an abstract syntax tree from the textual representation of a program. For us, we will rely on the read function to take care of parsing and essentially gloss over this (well understood, arguably boring) topic. We will focus on the abstract syntax by defining BNF grammars.

The heart of a programming language is its semantics and we will spend more time concerned this aspect.

There are a few common ways a language’s meaning is specified:

Each approach has its advantages and disadvantages. Examples are concise and unambiguous, but incomplete. Informal (prose) descriptions can be intuitive, but open to interpretation and ambiguity. Reference implementations provide precise, executable specifications, but may over specify language details by tying them to implementation artifacts. Formal definitions balance precision while allowing for under-specification, but require detailed definitions and training to understand.

We will use a combination of each.

To begin, let’s start with a dead simple programming language called Abscond. The only kind of expression in Abscond are integer literals. Running an abscond program just produces that integer. (Told you it was simple.)

3.2 Abstract syntax for Abscond

An Abscond program consists of a single expression, and the grammar of expressions is very simple:

image

So, 0, 120, -42, etc. are Abscond programs.

A datatype for representing expressions can be defined as:

; type Expr = Integer

3.3 Meaning of Abscond programs

The meaning of an Abscond program is simply the number itself. So 42 evaluates to 42.

We can write an “interpreter” that consumes an expression and produces it’s meaning:

abscond/interp.rkt

  #lang racket
  (provide (all-defined-out))
   
  ; Expr -> Integer
  ; Interpret given expression
  (define (interp e)
    e)
   

Examples:
> (interp 42)

42

> (interp -8)

-8

We can add a command line wrapper program for interpreting Abscond programs saved in files:

abscond/interp-file.rkt

  #lang racket
  (provide (all-defined-out))
  (require "interp.rkt")
   
  ;; String -> Void
  ;; Parse and interpret contents of given filename,
  ;; print result on stdout
  (define (main fn)
    (with-input-from-file fn
      (λ ()
        (let ((c (read-line)) ; ignore #lang racket line
              (p (read)))
          (unless (integer? p) (error "syntax error" p))
          (writeln (interp p))))))
   

The details here aren’t important (and you won’t be asked to write this kind of code), but this program reads the contents of a file given on the command line. If it’s an integer, i.e. a well-formed Abscond program, then it runs the intepreter and displays the result.

For example:

shell

> echo '#lang racket\n42' > 42.rkt
> racket -t interp-file.rkt -m 42.rkt
42

Even though the semantics is obvious, we can provide a formal definition of Abscond using operational semantics.

An operational semantics is a mathematical definition that characterizes the meaning of programs. We will defined the semantics of Abscond as a binary relation between programs and their meanings. So in the setting of Abscond, this binary relation will be a set of pairs of expressions and integers. This relation will be defined inductively using inference rules. For such a simple language, just a single inference rule suffices:

image

Here, we are defining a binary relation, called image, and saying every integer paired with itself is in the relation. So (2,2) is in image, (5,5) is in image, and so on.

The inference rules define the binary relation by defining the evidence for being in the relation. The rule makes use of meta-variables drawn from the non-terminals of the language grammar. A pair is in the relation if you can construct an instance of the rule (substituting some integer for i) in the rule.

(This part probably seems opaque at the moment, but it will become clearer as we work through more examples, so don’t worry.)

The operational semantics defines the meaning of Abscond programs. The intepreter computes that meaning. We can view the semantics as a specification, and the interpreter as an implementation.

Characterizing the correctness of the interpreter boils down to the following statement:

Interpreter Correctness: For all expressions e and integers i, if (e,i) in image, then (interp e) equals i.

We now have a complete (if overly simple) programming language with an operational semantics and an interpreter, which is (obviously) correct. Now let’s write a compiler.

3.4 Toward a Compiler for Abscond

A compiler, like an interpreter, is an implementation of a programming language. The key difference is that a compiler stages the work of interpreting a program into two phases. The first translates the original program (the source language) into a program in another programming language (the target language). The second runs this program. These phases, often called compile-time and run-time. The program that does the translation is the compiler. The program that does the running of the translated program is the run-time system.

So in general, the relationship between an interpreter and compiler is

(source-interp e) = (target-interp (source-compile e))

We can in principle choose any target language we’d like. For this class, we will choose the x86-64 instruction set architecture.

There are several reasons for this choice:

So our compiler will emit x86 assembly code. To make our lives a bit easier, we will write the run-time system in C. Let’s start with the Abscond runtime:

abscond/main.c

#include <stdio.h>
#include <inttypes.h>

int64_t entry();

int main(int argc, char** argv) {
  int64_t result = entry();
  printf("%" PRId64 "\n", result);
  return 0;
}

This C program provides the main entry point for running an Abscond program. It must be linked against an object file that provides the definition of enter_abscond; this is the code our compiler will emit.

The enter_abscond function computes the result of running the Abscond code, i.e. an integer. Here we are taking advantage of the x86-64 architecture by using 64-bit signed integers by using the int64_t C type.

The runtime system calls the function and prints the result.

We can compile the run-time system to get an object file. We’ll use gcc for compiling C:

shell

> gcc -m64 -c -o main.o main.c

This creates main.o; linking this file together with an object file that contains enter_abscond will produce an executable that, when run, will carry out the execution of an Abscond program.

3.5 An Example

Before trying to write the Abscond compiler, let’s first make an example of what we would like the compiler to produce for a particular example. Let’s say the Abscond program is 42. What should the assembly code for this program look like? Here we have to learn a bit about the x86-64 assembly language.

abscond/42.s

    global entry
    section .text
entry:
    mov rax, 42
    ret

Note: on macOS, labels must be prepended with _, while on Linux they are not; e.g. _entry vs entry.

Above is a x86-64 program, written in NASM syntax. We will be using nasm as our assembler in this class because it is widely used and available on most platforms.

To assemble this program into an object file, we can run the nasm assembler:

shell

> nasm -f elf64 -o 42.o 42.s

Note: on macOS, the format option -f should be macho64; on Linux it should be elf64.

This creates 42.o, an object file containing the instructions above (in binary format).

We can link this file with the run-time to produce an executable file:

shell

> gcc main.o 42.o -o 42.run

This creates the file 42.run, an exectuable program:

shell

> ./42.run
42

We now have a working example. The remaining work will be to design a compiler that takes an Abscond program and emits a file like 42.s, but with the appropriate integer literal.

3.6 A Compiler for Abscond

We will now write a compiler for Abscond. To heart of the compiler will be a function with the following signature:

; Expr -> Asm
(define (compile e) ...)

Where Asm is a data type for representing assembly programs, i.e. it will be the AST of x86-64 assembly. This datatype will evolve as we see more X86 instructions, but for now it is very simple:

; type Asm = [Listof Instruction]
 
; type Instruction =
; | ret
; | (mov ,Arg ,Arg)
; | Label
 
; type Label = Symbol
 
; type Arg =
; | Reg
; | Integer
 
; type Reg =
; | rax

So the AST representation of our example is:

'(entry
  (mov rax 42)
  ret)

Writing the compile function is easy:

abscond/compile.rkt

  #lang racket
  (provide (all-defined-out))
   
  ;; Expr -> Asm
  (define (compile e)
    `(entry
      (mov rax ,e)
      ret))
   

Examples:
> (compile 42)

'(entry (mov rax 42) ret)

> (compile 38)

'(entry (mov rax 38) ret)

To convert back to the concrete NASM syntax, we can write a (few) function(s), which we’ll place in its own module:

abscond/asm/printer.rkt

  #lang racket
  (provide (all-defined-out))
   
  ;; Asm -> String
  (define (asm->string a)
    (foldr (λ (i s) (string-append (instr->string i) s)) "" a))
   
  ;; Instruction -> String
  (define (instr->string i)
    (match i
      [`(mov ,a1 ,a2)
       (string-append "\tmov " (arg->string a1) ", " (arg->string a2) "\n")]
      [`ret "\tret\n"]
      [l (string-append (label->string l) ":\n")]))
   
  ;; Arg -> String
  (define (arg->string a)
    (match a
      [`rax "rax"]
      [n (number->string n)]))
   
  ;; Label -> String
  ;; prefix with _ for Mac
  (define label->string
    (match (system-type 'os)
      ['macosx
       (λ (s) (string-append "_" (symbol->string s)))]
      [_ symbol->string]))
   
  ;; Asm -> Void
  (define (asm-display a)
    ;; entry point will be first label
    (let ((g (findf symbol? a)))
      (display 
        (string-append "\tglobal " (label->string g) "\n"
              "\tsection .text\n"
                       (asm->string a)))))
   

Note: the printer takes care of the macOS vs Linux label convention by detecting the underlying system and printing appropriately.

Example:
> (asm-display (compile 42))

global entry

section .text

entry:

mov rax, 42

ret

Putting it all together, we can write a command line compiler much like the command line interpreter before, except now we emit assembly code:

abscond/compile-file.rkt

  #lang racket
  (provide (all-defined-out))
  (require "compile.rkt" "asm/printer.rkt")
   
  ;; String -> Void
  ;; Compile contents of given file name,
  ;; emit asm code on stdout
  (define (main fn)
    (with-input-from-file fn
      (λ ()
        (let ((c (read-line)) ; ignore #lang racket line
              (p (read)))
          (unless (integer? p) (error "syntax error" p))
          (asm-display (compile p))))))
   

Example:

shell

> racket -t compile-file.rkt -m 42.rkt
    global entry
    section .text
entry:
    mov rax, 42
    ret

Using a Makefile, we can capture the whole compilation dependencies as:

Note: the appropriate object file format is detected based on the operating system.

abscond/Makefile

UNAME := $(shell uname)
.PHONY: test

ifeq ($(UNAME), Darwin)
  format=macho64
else
  format=elf64
endif

%.run: %.o main.o
    gcc main.o $< -o $@

main.o: main.c
    gcc -c main.c -o main.o

%.o: %.s
    nasm -f $(format) -o $@ $<

%.s: %.rkt
    racket -t compile-file.rkt -m $< > $@

clean:
    rm *.o *.s *.run

test: 42.run
    @test "$(shell ./42.run)" = "42"

And now compiling Abscond programs is easy-peasy:

shell

> make 42.run
make[1]: Entering directory `/home/travis/build/cmsc430/www/www/notes/abscond'
racket -t compile-file.rkt -m 42.rkt > 42.s
nasm -f elf64 -o 42.o 42.s
gcc main.o 42.o -o 42.run
make[1]: Leaving directory `/home/travis/build/cmsc430/www/www/notes/abscond'
> ./42.run
42

It’s worth taking stock of what we have at this point, compared to the interpreter approach. To run the interpreter requires all of Racket in the run-time system.

When running a program using the interpreter, we have to parse the Abscond program, check the syntax of the program (making sure it’s an integer), then run the interpreter and print the result.

When running a program using the compiler, we still have to parse the Abscond program and check its syntax, but this work happens at compile-time. When we run the program this work will have already been done. While the compiler needs Racket to run, at run-time, Racket does not need to be available. All the run-time needs is our (very tiny) object file compiled from C. Racket doesn’t run at all – we could delete it from our computer or ship the executable to any compatible x86-64 machine and run it there. This adds up to much more efficient programs. Just to demonstrate, here’s a single data point measuring the difference between interpreting and compiling Abscond programs:

shell

> time -p racket -t interp-file.rkt -m 42.rkt
42
real 0.43
user 0.40
sys 0.03

Compiling:

shell

> time -p ./42.run
42
real 0.00
user 0.00
sys 0.00

Because Abscond is a subset of Racket, we can even compare results against interpreting the program directly in Racket:

shell

> touch 42.rkt # forces interpreter to be used
> time -p racket 42.rkt
42
real 0.35
user 0.29
sys 0.05

Moreover, we can compare our compiled code to code compiled by Racket:

shell

> raco make 42.rkt
> time -p racket 42.rkt
42
real 0.25
user 0.20
sys 0.04

3.7 But is it Correct?

At this point, we have a compiler for Abscond. But is it correct?

Here is a statement of compiler correctness:

Compiler Correctness: For all expressions e and integers i, if (e,i) in image, then (asm-interp (compile e)) equals i.

Ultimately, we want the compiler to capture the operational semantics of our language (the ground truth of what programs mean). However, from a practical stand-point, relating the compiler to the intepreter may be more straightforward. What’s nice about the interpreter is we can run it, so we can test the compiler against the interpreter. Moreover, since we claimed the interpreter is correct (w.r.t. to the semantics), testing the compiler against the interpreter is a way of testing it against the semantics, indirectly. If the compiler and interpreter agree on all possible inputs, then the compiler is correct with respect to the semantics since it is equivalent to the interpreter, and the interpreter is correct.

So, in this setting, means we have the following equivaluence:

(interp e) equals (asm-interp (compile e))

But we don’t actually have interpret-asm, a function that interprets the Asm code we generate. Instead we printed the code and had gcc assembly and link it into an executable, which the OS could run. But this is a minor distinction. We can write asm-interp to interact with the OS to do all of these steps.

Here’s such a definition. (Again: the details here are not important and we won’t ask you to write or understand this code, but roughly,q all it’s doing is emitting assembly (to a temporary file) and calling make to build the executable, then running it and parsing the result.)

abscond/asm/interp.rkt

  #lang racket
  (provide (all-defined-out))
  (require "printer.rkt" racket/runtime-path)
  (define-runtime-path dir "..")
   
  ;; Asm -> Integer
  ;; Interpret (by assemblying, linking, and exec'ing) x86-64 code
  ;; Assume: starts with entry point run-time expects
  (define (asm-interp a)
    (let* ((t.s (make-temporary-file "nasm~a.s"))
           (t.run (path-replace-extension t.s #".run")))
      (with-output-to-file t.s
        #:exists 'truncate
        (λ ()
          (asm-display a)))
      (system (format "(cd ~a && make -s ~a) 2>&1 >/dev/null" dir t.run))
      (delete-file t.s)
      (with-input-from-string
          (with-output-to-string
            (λ ()
              (system (path->string t.run))
              (delete-file t.run)))
        read)))
   

This is actually a handy tool to have for experimenting with compilation within Racket:

Examples:
> (asm-interp (compile 42))

42

> (asm-interp (compile 37))

37

> (asm-interp (compile -8))

-8

This of course agrees with what we will get from the interpreter:

Examples:
> (interp 42)

42

> (interp 37)

37

> (interp -8)

-8

We can turn this in a property-based test, i.e. a function that computes a test expressing a single instance of our compiler correctness claim:

Examples:
> (define (check-compiler e)
    (check-eqv? (interp e)
                (asm-interp (compile e))))
> (check-compiler 42)
> (check-compiler 37)
> (check-compiler -8)

This is a powerful testing technique when combined with random generation. Since our correctness claim should hold for all Abscond programs, we can randomly generate any Abscond program and check that it holds.

Examples:
> (check-compiler (random 100))
> (for ([i (in-range 10)])
    (check-compiler (random 10000)))

The last expression is taking 10 samples from the space of Abscond programs in [0,10000) and checking the compiler correctness claim on them. If the claim doesn’t hold for any of these samples, a test failure would be reported.

Finding an input to check-compiler that fails would refute the compiler correctness claim and mean that we have a bug. Such an input is called a counter-example.

On the other hand we gain more confidence with each passing test. While passing tests increase our confidence, we cannot test all possible inputs this way, so we can’t be sure our compiler is correct by testing alone. To really be sure, we’d need to write a proof, but that’s beyond the scope of this class.

At this point we have not found a counter-example to compiler correctness. It’s tempting to declare victory. But... can you think of a valid input (i.e. some integer) that might refute the correctness claim?

Think on it. In the meantime, let’s move on. }