18 Mug: symbols and interned string literals

8.1

← prev up next →

18 Mug: symbols and interned string literals

18.1 String Literals

18.2 Static Memory

18.3 Static Interning

18.4 Symbols

18.5 Dynamic Interning

18.6 Uninterned Symbols

18.7 Matching symbols and strings

18.8 Compiling Symbols and Strings

18.1 String Literals

As it currently stands in our language, string literals are dynamically allocated when they are evaluated.

This means, for example, that if we had a program like this:

(define (f) "fred")
(cons (f) (cons (f) (cons (f) '())))

This will allocate three distinct copies of the string "fred", one for each call to f. This is unfortunate since really just a single allocation of "fred" that is referenced three times could’ve worked just as well and allocated less memory.

A common approach programming language implementations take is to take every string literal that appears in a program and all allocate it once and replace occurrences of those literals with references to memory allocated for it.

This means, for example, that multiple occurrences of the same string literal evaluate to the same pointer:

Examples
> (eq? "x" "x")
#t

Note that this doesn’t mean that every string of the same characters is represented by a unique pointer. We can dynamically construct strings that will not be equal to a string literal of the same characters:

Examples
> (eq? "x" (string #\x))
#f

Let’s consider how strings were previously compiled. Here’s an assembly program that returns "Hello!":

Examples
> (require loot/compile)
> (seq (Label 'entry)
       (Mov 'rbx 'rdi)
       (compile-string "Hello!")
       (Ret))
(list
(Label 'entry)
(Mov 'rbx 'rdi)
(Mov 'rax 6)
(Mov (Offset 'rbx 0) 'rax)
(Mov 'rax 72)
(Mov (Offset 'rbx 8) 'eax)
(Mov 'rax 101)
(Mov (Offset 'rbx 12) 'eax)
(Mov 'rax 108)
(Mov (Offset 'rbx 16) 'eax)
(Mov 'rax 108)
(Mov (Offset 'rbx 20) 'eax)
(Mov 'rax 111)
(Mov (Offset 'rbx 24) 'eax)
(Mov 'rax 33)
(Mov (Offset 'rbx 28) 'eax)
(Mov 'rax 'rbx)
(Or 'rax 4)
(Add 'rbx 32)
(Ret))

We can run it just to make sure:

Examples
> (unload/free
   (asm-interp
    (seq (Label 'entry)
         (Mov 'rbx 'rdi)
         (compile-string "Hello!")
         (Ret))))
"Hello!"

Notice that this program dynamically allocates the string by executing instructions that write to memory pointed to by 'rbx and incrementing 'rbx.

But fundamentally, we shouldn’t need to do anything dynamically if we know statically that the string being return is "Hello!". We could statically allocate the memory for the string at compile-time and return a pointer to this data.

18.2 Static Memory

How can we statically allocate memory? The idea is to use memory in the program itself to store the data needed to represent the string literal. It turns out that in an a86 program you can have a section for the program text and another section with binary data. To switch between the program text and program data, we use the (Text) and (Data) directive. Once in (Data) mode we can write down data that will be placed in the program.

For example, here is a data section:

Examples
> (seq (Data)
       (Label 'hi)
       (Dq 6)
       (Dd (char->integer #\H))
       (Dd (char->integer #\e))
       (Dd (char->integer #\l))
       (Dd (char->integer #\l))
       (Dd (char->integer #\o))
       (Dd (char->integer #\!)))
(list
(Data)
(Label 'hi)
(Dq 6)
(Dd 72)
(Dd 101)
(Dd 108)
(Dd 108)
(Dd 111)
(Dd 33))

These psuedo-instructions will add to the data segment of our program 56-bytes of data. The first 8-bytes consist of the number 6. The next 4-bytes consist of the number 72, i.e. the codepoint for #\H. The next 4-bytes consist of the codepoint for #\e and so on. The names of these psuedo-instructions designate how much memory is used: Dq means 8-bytes (64-bits), while Dd means 4-bytes (32-bits).

The label 'hi is given to name this data’s location. We’ve previously seen how to load the address of a label using the Lea instruction in order to compute a place in the code to jump to. Similarly, if we load the address of 'hi, we have a pointer to the data at that location in the program.

So to write a similar program that returns "Hello!" but statically allocates the memory for the string, we could do the following:

Examples
> (unload/free
   (asm-interp
    (seq (Label 'entry)
         (Lea 'rax 'hi)
         (Or 'rax type-str)
         (Ret)
         (Data)
         (Label 'hi)
         (Dq 6)
         (Dd (char->integer #\H))
         (Dd (char->integer #\e))
         (Dd (char->integer #\l))
         (Dd (char->integer #\l))
         (Dd (char->integer #\o))
         (Dd (char->integer #\!)))))
"Hello!"

A couple things to note:

nothing is allocated in the heap memory set up by the run-time; indeed this program doesn’t use the 'rbx register at all.
Executing this program takes fewer steps than the previous version; when the 'entry label is called, it executes an Lea and Or instruction and returns.

This is pretty big improvement over the previous approach since the number of instructions to execute were proportional to the size of the string being compiled. Now we simply load the address of the static data in a small, constant number of instructions.

In fact, we can do one better. The Or instruction is there in order to tag the pointer to 'hi as a string. There’s really no reason to do this at run-time; we should be able to add the tag statically so that just a single load instruction suffices. The goal is to add the tag to the address of 'hi at compile time, but the location of the label is actually not fully known until link time. Our assembler has a way of resolving this by allowing us to write expressions involving labels and constants that will be computed at link time.

Here is a version of the same program that avoids the Or instruction, instead computing that type tagging at link time:

Examples
> (unload/free
   (asm-interp
    (seq (Label 'entry)
         (Lea 'rax (Plus 'hi type-str))
         (Ret)
         (Data)
         (Label 'hi)
         (Dq 6)
         (Dd (char->integer #\H))
         (Dd (char->integer #\e))
         (Dd (char->integer #\l))
         (Dd (char->integer #\l))
         (Dd (char->integer #\o))
         (Dd (char->integer #\!)))))
"Hello!"

So one idea is to use static data to represent string literals. This reduces the run-time memory that is allocated and makes is more efficient to evaluate string literals. We could replace the old compile-string function with the following:

Examples
> (define (compile-string s)
    (let ((l (gensym 'string)))
      (seq (Data)
           (Label l)
           (Dq (string-length s))
           (map Dd (map char->integer (string->list s)))
           (Text)
           (Lea 'rax (Plus l type-str)))))
> (compile-string "Hello!")
(list
(Data)
(Label 'string6874)
(Dq 6)
(Dd 72)
(Dd 101)
(Dd 108)
(Dd 108)
(Dd 111)
(Dd 33)
(Text)
(Lea 'rax (Plus 'string6874 4)))
> (unload/free
   (asm-interp
    (seq (Label 'entry)
         (compile-string "Hello!")
         (Ret))))
"Hello!"

Now, while this does allocate string literals statically, using memory within to the program to store the string, it doesn’t alone solve the problem with string literals being represented uniquely.

18.3 Static Interning

We’ve seen static memory, but we still need to make sure every string literal is allocated just once.

Here is the basic idea:

Collect all of the string literals in the program.
For each distinct string literal, compile it to static data as described above, labelling the data location.
For each string literal expression, compile it to a reference to the appropropiate label for that string.

For example, let’s say we want to compile this program:

(begin "Hello!"
"Hello!")

We’d like it to compile to something like this:

(seq (Mov 'rax (Add 'hi type-str))
     (Mov 'rax (Add 'hi type-str))
     (Ret)
     (Data)
     (Label 'hi)
     (Dq 6)
     (Dd (char->integer #\H))
     (Dd (char->integer #\e))
     (Dd (char->integer #\l))
     (Dd (char->integer #\l))
     (Dd (char->integer #\o))
     (Dd (char->integer #\!)))

Notice how the two occurrences of "Hello!" turn into the instruction (Mov 'rax (Add 'hi type-str)). The labelled location 'hi contains the data for the string and it is statically allocated just once.

In order to do this, we need to maintain an association between unique string literals and the labels our compiler will choose to label their static data.

We could do this by making a pass over the program to compute this association. Initially it would be empty and every time a string literal was encountered, we’d check to see if it’s already in the association. If it is, there’s nothing to be done. If isn’t, we’d generate a new label and add it to the association.

This association would have to be added as a parameter to each of our compile-e functions and string literals would consult the association to emit the (Mov 'rax (Add label type-str)) instruction.

We’d also take every label and string pair in the association and compile the string data to static data labelled with the associated label.

However, here’s a fun “trick” we can employ to avoid having to explicitly represent this association between strings and their labels.

Strings can be converted to symbols, and symbols can be used as labels. Symbols that consist of the same characters are guaranteed to be pointer-equal to each other, so by converting a string to a symbol, we can take advantage of our implementation language’s (Racket’s) facility for interning to help us implement interning in our compiler.

So here is our revised apporach will produce code like this for our example program:

(seq (Mov 'rax (Add (symbol->label (string->symbol "Hello!")) type-str))
     (Mov 'rax (Add (symbol->label (string->symbol "Hello!")) type-str))
     (Ret)
     (Data)
     (Label (symbol->label (string->symbol "Hello!")))
     (Dq 6)
     (Dd (char->integer #\H))
     (Dd (char->integer #\e))
     (Dd (char->integer #\l))
     (Dd (char->integer #\l))
     (Dd (char->integer #\o))
     (Dd (char->integer #\!)))

So now an occurrence of a string literal str can be compiled as (Mov 'rax (string->label (string->symbol str))); no association needs to be maintained explicity.

; String -> Asm
(define (compile-string s)
(seq (Lea 'rax (Plus (symbol->label (string->symbol s)) type-str))))

So here’s how an occurrence of "Hello!" is compiled:

Examples
> (compile-string "Hello!")
(list
(Lea
'rax
(Plus 'label_Hello__7bf299e401ab47a 4)))

We still need to compile the set of string literals that appear in the program into statically allocated data, so for this we will write a function:

; Prog -> [Listof Symbol]
(define (literals p) ...)

This will produce the set of strings that appear literally in the program text. Each string will be converted to its symbol representation. The string representation is easy to recover by using symbol->string.

This function is straightforwad, if a bit tedious, to write. It traverses the AST. Recursive results are collected with append; when a string node (Str s) is encountered, it produces (list (string->symbol s)). After all of the strings have been collected, a final call to remove-duplicates ensures a list of unique symbols is returned.

Examples
> (literals (parse '["Hello!"]))
'(Hello!)
> (literals (parse '[(begin "Hello!" "Hello!")]))
'(Hello!)
> (literals (parse '[(begin "Hello!" "Fren")]))
'(Hello! Fren)
> (literals (parse '[(define (f x) "Hello!")
(cons (f "Fren") (cons (f "Hello!") '()))]))
'(Hello! Fren)

Using literals, we can write a function that compiles all of the string literals into static data as follows:

Examples
; Prog -> Asm
> (define (compile-literals p)
    (append-map compile-literal (literals p)))
; [Listof Char] -> Asm
> (define (compile-string-chars cs)
    (match cs
      ['() (seq)]
      [(cons c cs)
       (seq (Dd (char->integer c))
            (compile-string-chars cs))]))
; Symbol -> Asm
> (define (compile-literal s)
    (let ((str (symbol->string s)))
      (seq (Label (symbol->label s))
           (Dq (string-length str))
           (compile-string-chars (string->list str))
           (if (odd? (string-length str))
               (seq (Dd 0))
               (seq)))))
> (seq (compile-string "Hello!")
       (compile-string "Hello!")
       (compile-literal 'Hello!))
(list
(Lea 'rax (Plus 'label_Hello__7bf299e401ab47a 4))
(Lea 'rax (Plus 'label_Hello__7bf299e401ab47a 4))
(Label 'label_Hello__7bf299e401ab47a)
(Dq 6)
(Dd 72)
(Dd 101)
(Dd 108)
(Dd 108)
(Dd 111)
(Dd 33))

We’ve seemingly reached our goal. However, there is a fairly nasty little bug with our approach. Can you spot it?

Here’s a hint: we are generating labels based on the content of string literals. What else do we generate labels based off of and is it possible to create a conflict?

The answer is yes. Consider this program:

(define (Hello! x) "Hello!")
42

It contains both a function called Hello! and a string literal "Hello!". Unfortunately, the label used both for the function and the string data will be (symbol->label 'Hello!). If the compiler emits two definitions of this label, the assembler will complain and fail to assemble the program.

The solution is simple, when generating labels for data, we will use a different symbol to label function, let’s call it symbol->data-label that is guaranteed to produce disjoint labels from symbol->label, which we will continue to use for code labels.

Using this function in all the places we used symbol->label will resolve the issue and our problematic program will now have two different labels defined in it:

Examples
> (symbol->label 'Hello!)
'label_Hello__7bf299e401ab47a
> (symbol->data-label 'Hello!)
'data_Hello__7bf299e401ab47a

So now we have accomplished our goal: string literals are statically allocated and different occurrence of the same string literal are considered eq? to each other:

Examples
> (seq (compile-string "Hello!")
(compile-string "Hello!")
(compile-literal 'Hello!))
(list
(Lea 'rax (Plus 'label_Hello__7bf299e401ab47a 4))
(Lea 'rax (Plus 'label_Hello__7bf299e401ab47a 4))
(Label 'label_Hello__7bf299e401ab47a)
(Dq 6)
(Dd 72)
(Dd 101)
(Dd 108)
(Dd 108)
(Dd 111)
(Dd 33))

We can try it out to confirm some examples.

Examples
> (define (run . p)
(unload/free (asm-interp (compile (parse p)))))
> (run "Hello!")
"Hello!"
> (run '(begin "Hello!" "Hello!"))
"Hello!"
> (run '(eq? "Hello!" "Hello!"))
#t
> (run '(eq? "Hello!" "Fren"))
#f
> (run '(define (Hello! x) "Hello!")
'(eq? (Hello! 42) "Hello!"))
#t

It’s still worth noting that only string literals are interned. Dynamically created strings are not pointer-equal to structurally equal string literals:

Examples
> (run '(eq? "fff" (make-string 3 #\f)))
#f

This is why we refer to this kind of interning as “static” interning.

Let us now turn to a new, but familar, data type that supports a stronger sense of interning: the symbol.

18.4 Symbols

One basic data type that we’ve used frequently in the writing of our compiler, but which is not currently accounted for in our language is that of symbols.

At first cut, a symbol is alot like a string: the name of a symbol consists of some textual data. We can represent a symbol much like we represent a string: using a tagged pointer to a sized array of characters that comprise the name of the symbol.

In fact, we made extensive use of this in our implementation of static interning for string literals. This section will now uncover how symbols do their (dynamic) interning.

From a syntax point of view, we add a new AST constructor for symbols and names of the new operations:

mug/ast.rkt
;; type Expr = ...
;;           | (Symb Symbol)
;; type Op1  = ...
;;           | 'symbol? | 'symbol->string
;;           | 'string->symbol | 'string->uninterned-symbol
(struct Symb (s) #:prefab)

The parser is updated to construct such AST nodes when it encounters a symbol:

Examples
> (parse-e ''foo)
'#s(Symb foo)

We can create a new pointer type tag:

mug/types.rkt
(define type-symb #b110)

The run-time system has to be updated to handle symbol results and the printer is updated to properly print symbols, but all of this follows the blueprint of strings. It’s simply a different tag and a slightly different printer, which uses and initial ’ delimiter instead of an initial " and subsequent " delimiter.

But one of the key differences between strings and symbols is that symbols that have the same name are considered the same, i.e. they should be represented by the same pointer.

This means that two symbols of the same name should be eq? to each other:

Examples
> (eq? 'x 'x)
#t

Having seen how string literals are handled, you can see that symbol literals are like string literals and we can take a similar approach to transform a program into one that statically allocates all of the symbols that appear in the program and replace their occurrences with references.

Again, we just follow the blueprint of strings.

The key additions are a function for compiling symbol occurrences:

; Symbol -> Asm
(define (compile-symbol s)
(seq (Lea 'rax (Plus (symbol->data-label s) type-symb))))

Which works as follows:

Examples
> (compile-symbol 'Hello!)
(list
(Lea 'rax (Plus 'data_Hello__7bf299e401ab47a 6)))

And the literals function should now include a case for (Symb sym) to return (list sym).

Examples
> (literals (parse '['Hello!]))
'(Hello!)

You might worry that programs that have similar strings and symbols may cause problem. Since literals on the following program only returns a single literal:

Examples
> (literals (parse '[(begin "Hello!" 'Hello!)]))
'(Hello!)

But actually this is just fine. What happens is that only a single chunk of memory is allocated to hold the character data H, e, l, l, o, !, but the symbol 'Hello is represented as a pointer to this data tagged as a symbol, while the string "Hello" is represent as the same pointer, but tagged as a string. So this program compiles to:

Examples
> (seq (compile-string "Hello!")
(compile-symbol 'Hello!)
(compile-literal 'Hello!))
(list
(Lea 'rax (Plus 'label_Hello__7bf299e401ab47a 4))
(Lea 'rax (Plus 'data_Hello__7bf299e401ab47a 6))
(Label 'label_Hello__7bf299e401ab47a)
(Dq 6)
(Dd 72)
(Dd 101)
(Dd 108)
(Dd 108)
(Dd 111)
(Dd 33))

We have now added a symbol data type and have implement static interning just as we did for strings.

However this strategy alone won’t fully solve the problem of symbol identity because it is possible to dynamically create symbols and even then it should be the case that symbols with the same name are “the same.” This in contrast to how strings work:

Examples
> (eq? 'x (string->symbol (string #\x)))
#t

Here we are creating a symbol dynamically, using the string "x" to specify the name of the symbol. Comparing it to a 'x that appears statically should still produce #t.

This was in fact a critical property we relied upon in implementing static string interning.

This latter example shows that we need to dynamically ensure symbols of the same name evaluate to unique pointers.

18.5 Dynamic Interning

Static interning requires identical static occurrences of data to have a unique representation. Dynamic interning requires identical data, regardless of when it’s created, to have a unique representation. Symbols are like strings that support dynamic interning.

This is going to require more support from our run-time system.

Essentially, the run-time systems needs to keep track of all of the symbols that have appeared so far during the running of the program. When a new symbol is dynamically created, e.g. through string->symbol, the run-time will check whether this symbol has been seen before (based on the characters of its name). If it has been seen before, the run-time can give us the pointer for the previous use of the symbol, thus preserving the pointer-equality between this symbol and any other occurrences.

On the other hand if the run-time has not see this symbol, it can allocate memory for it, return the pointer, and remember in the future that this symbol has been seen.

To accomplish this, we will implement a symbol table. It associates symbol names, i.e. the characters of a symbol, with pointers to symbols. When a program wishes to create a symbol, it confers with the table to either fetch an existing pointer for the symbol or create a new on, updating the table.

To implement this table, we’ll use a binary search tree of symbols, represented in C as. We have a globally defined pointer symbol_tbl is which is initially empty (NULL). The work of dynamically interning a symbol will be done by the intern_symbol function. It searches the BST, using symb_cmp to compare symbols for alphabetic ordering. If an entry is found, it returns the previously seen symbol, otherwise it adds the symbol to the table and returns it.

mug/symbol.c

#include <stdlib.h>
#include <inttypes.h>
#include "values.h"

int symb_cmp(const val_symb_t *, const val_symb_t *);

// binary tree node
struct Node {
  val_symb_t* elem;
  struct Node* left;
  struct Node* right;
};

static struct Node *symbol_tbl = NULL;

val_symb_t *intern_symbol(val_symb_t* symb)
{
  struct Node **curr = &symbol_tbl;

  while (*curr) {
    struct Node *t = *curr;
    int r = symb_cmp(symb, t->elem);
    if (r == 0) {
      // found it, so return saved pointer
      return t->elem;
    } else if (r < 0) {
      curr = &t->left;
    } else {
      curr = &t->right;
    }
  }

  // wasn't found, so insert it and return pointer
  *curr = calloc(1, sizeof(struct Node));
  (*curr)->elem = symb;
  return (*curr)->elem;
}

int symb_cmp(const val_symb_t *s1, const val_symb_t *s2)
{
  if (s1 == s2) return 0;

  int64_t len1 = s1->len;
  int64_t len2 = s2->len;

  int64_t len = len1 < len2 ? len1 : len2;
  int i;

  for (i = 0; i < len; i++) {
    if (s1->codepoints[i] != s2->codepoints[i])
      return s1->codepoints[i] - s2->codepoints[i];
  }

  return len1 - len2;
}

The idea will be that every time a symbol is constructed, we call intern_symbol to intern it.

So in addition to collecting all of the literals and compiling each to static data, we will need to collect all of the symbols and emit a call to intern_symbol at the start of the program.

To accomplish this, we’ll design a function:

; Prog -> Asm
; Initialize the symbol table with all the symbols that occur statically
(define (init-symbol-table p) ...)

Here’s what it will produce for some example programs:

Examples
> (init-symbol-table (parse '['Hello!]))
(list
(Sub 'rsp 8)
(Lea 'rdi 'data_Hello__7bf299e401ab47a)
(Call 'intern_symbol)
(Add 'rsp 8))
> (init-symbol-table (parse '[(begin 'Hello! 'Hello!)]))
(list
(Sub 'rsp 8)
(Lea 'rdi 'data_Hello__7bf299e401ab47a)
(Call 'intern_symbol)
(Add 'rsp 8))
> (init-symbol-table (parse '["Hello!"]))
'()
> (init-symbol-table (parse '[(define (Hello! x) 'Hello!)
(Hello! 'Fren)]))
(list
(Sub 'rsp 8)
(Lea 'rdi 'data_Hello__7bf299e401ab47a)
(Call 'intern_symbol)
(Lea 'rdi 'data_Fren_8706c9444e3575c)
(Call 'intern_symbol)
(Add 'rsp 8))

For each unique symbol in the program, it emits two instructions:

move the address of the symbol’s data into 'rdi, the register used for the first argument in the System V ABI,
call intern_symbol.

We know that initially the table is empty, so each of these calls will insert the given symbols into the table ensure that if any subsequent symbol is interned that has the same character data, call intern_symbol will produce the original pointer to static data for that symbol.

Now we can implement the two operations string->symbol and symbol->string. Here’s what we do for string->symbol:

; Op1 -> Asm
(define (compile-op1 p)
  (match p
    ; ...
    ['string->symbol
     (seq (assert-string rax)
          (Xor rax type-str)
          (Mov rdi rax)
          pad-stack
          (Call 'intern_symbol)
          unpad-stack
          (Or rax type-symb))]))

This first does some type-tag checking to make sure the argument is a string, then it untags the pointer and moves it to the 'rdi register in order to call intern_symbol. The address of the interned symbol is returned in 'rax, which is then tagged as being a symbol.

We can now confirm that dynamically created symbols are still pointer-equal to symbols that statically appear in the program:

Examples
> (run '(eq? 'fff (string->symbol (make-string 3 #\f))))
#t

Even creating two symbols dynamically will result in the same pointer so long as they are spelled the same:

Examples
> (run '(eq? (string->symbol (make-string 3 #\a))
(string->symbol (make-string 3 #\a))))
#t

Going the other direction from symbols to strings is easy: we copy the string data and tag the pointer as a string. Note that we could get away will simply retagging the pointer and not actually copying the string, but we make a copy to mimic Racket’s behavior and to be safe should we add string mutation operations.

; Op1 -> Asm
(define (compile-op1 p)
  (match p
    ; ...
    ['symbol->string
     (seq (assert-symbol rax)
          (Xor rax type-symb)
          char-array-copy
          (Or rax type-str))]))

; Asm
; Copy sized array of characters pointed to by rax
(define char-array-copy
  (seq (Mov rdi rbx)            ; dst
       (Mov rsi rax)            ; src
       (Mov rdx (Offset rax 0)) ; len
       (Add rdx 1)              ; #words = 1 + (len+1)/2
       (Sar rdx 1)
       (Add rdx 1)
       (Sal rdx 3)              ; #bytes = 8*#words
       pad-stack
       (Call 'memcpy)
       unpad-stack
       (Mov rax rbx)
       (Add rbx rdx)))

The char-array-copy sequence of instructions sets up a call to C’s memcpy function giving the address of the string data as the source, the current heap pointer as the destination, and the number of bytes which will be copied. After the call returns, the heap pointer is incremented by that number of copied bytes.

We can see that this works:

Examples
> (run '(symbol->string 'foo))
"foo"

To observe the copying behavior, notice:

Examples
> (run '(eq? (symbol->string 'foo) (symbol->string 'foo)))
#f

18.6 Uninterned Symbols

Sometimes it is useful to create a symbol that is distinct from all other symbols. We’ve relied on the ability to create a symbol with this property whenever we used the gensym operation. What gensym produces is an uninterned symbol. Even if you constructed a symbol with the same letters, it would be a different pointer from the symbol created by a call to gensym.

To add this ability, we will add an precursor to gensym called string->uninterned-symbol. It consumes a string and produces a symbol with the same letters, but distinct from all other symbols, even those that are spelled the same.

Examples
> (eq? 'Hello! (string->uninterned-symbol "Hello!"))
#f

Calling string->uninterned-symbol twice with the same string will produce two different symbols:

Examples
> (eq? (string->uninterned-symbol "Hello!")
(string->uninterned-symbol "Hello!"))
#f

Implementing string->uninterned-symbol is fairly simple: we allocate a new symbol, thereby ensuring it is unique and then simple avoid calling intern_symbol:

; Op1 -> Asm
(define (compile-op1 p)
  (match p
    ; ...
    ['string->uninterned-symbol
     (seq (assert-string rax)
          (Xor rax type-str)
          char-array-copy
          (Or rax type-symb))]))

We can confirm this works as expected:

Examples
> (run '(string->uninterned-symbol "foo"))
'foo
> (run '(eq? 'foo (string->uninterned-symbol "foo")))
#f
> (run '(eq? (string->uninterned-symbol "foo")
(string->uninterned-symbol "foo")))
#f

With that, we have completed the implementation of symbols and strings with the proper interning behavior.

18.7 Matching symbols and strings

Since we have match in our language, we should probably add the ability to match against strings and symbols.

We can extend the AST definition for patterns:

mug/ast.rkt
;; type Pat = ...
;;          | (PSymb Symbol)
;;          | (PStr String)
(struct PSymb (s) #:prefab)
(struct PStr (s)  #:prefab)

Extending the interpreter is straightforward:

mug/interp.rkt
;; Pat Value Env -> [Maybe Env]
(define (interp-match-pat p v r)
  (match p
    ; ...
    [(PSymb s) (and (eq? s v) r)]
    [(PStr s)  (and (string? v) (string=? s v) r)]))

Extending the compiler is more involved, but essentially boils down to doing exactly what the interpreter is doing above:

mug/compile-expr.rkt
;; Pat CEnv Symbol -> (list Asm Asm CEnv)
(define (compile-pattern p cm next)
  (match p
    ; ...
    [(PStr s)
     (let ((fail (gensym)))
       (list (seq (Lea rdi (symbol->data-label (string->symbol s)))
                  (Mov r8 rax)
                  (And r8 ptr-mask)
                  (Cmp r8 type-str)
                  (Jne fail)
                  (Xor rax type-str)
                  (Mov rsi rax)
                  pad-stack
                  (Call 'symb_cmp)
                  unpad-stack
                  (Cmp rax 0)
                  (Jne fail))
             (seq (Label fail)
                  (Add rsp (* 8 (length cm)))
                  (Jmp next))
             cm))]
    [(PSymb s)
     (let ((fail (gensym)))
       (list (seq (Lea r9 (Plus (symbol->data-label s) type-symb))
                  (Cmp rax r9)
                  (Jne fail))
             (seq (Label fail)
                  (Add rsp (* 8 (length cm)))
                  (Jmp next))
             cm))]))

The implementation of string matching uses the symb_cmp function from the run-time system, checking whether it returns 0 to indicate the strings are the same. (Although the function is concerned with comparing symbols, symbols and strings are represented the same, so it works just as well to compare strings.)

We can confirm some examples:

Examples
> (run '(match 'foo
          ['foo  1]
          ["foo" 2]))
1
> (run '(match "foo"
          ['foo  1]
          ["foo" 2]))
2
> (run '(match (cons '+ (cons 1 (cons 2 '())))
          [(cons '+ (cons x (cons y '())))
           (+ x y)]))
3

18.8 Compiling Symbols and Strings

We can now put the pieces together for the complete compiler.

We do a bit of housekeeping and move the code for compiling expressions to its own module: mug/compile-expr.rkt.

The top-level compiler is now:

mug/compile.rkt
#lang racket
(provide (all-defined-out))
(require "ast.rkt"
         "types.rkt"
         "lambdas.rkt"
         "fv.rkt"
         "utils.rkt"
         "compile-define.rkt"
         "compile-expr.rkt"
         "compile-literals.rkt"
         a86/ast)

;; Registers used
(define rbx 'rbx) ; heap
(define rsp 'rsp) ; stack
(define rdi 'rdi) ; arg

;; type CEnv = [Listof Id]

;; Prog -> Asm
(define (compile p)
  (match p
    [(Prog ds e)
     (prog (externs)
           (Global 'entry)
           (Label 'entry)
           (Mov rbx rdi) ; recv heap pointer
           (init-symbol-table p)
           (compile-defines-values ds)
           (compile-e e (reverse (define-ids ds)) #t)
           (Add rsp (* 8 (length ds))) ;; pop function definitions
           (Ret)
           (compile-defines ds)
           (compile-lambda-defines (lambdas p))
           (Label 'raise_error_align)
           pad-stack
           (Call 'raise_error)
           (Data)
           (compile-literals p))]))

(define (externs)
  (seq (Extern 'peek_byte)
       (Extern 'read_byte)
       (Extern 'write_byte)
       (Extern 'raise_error)
       (Extern 'intern_symbol)
       (Extern 'symb_cmp)
       (Extern 'memcpy)))

The work of compiling literals and emitting calls to initialize the symbol table is contained in its own module:

mug/compile-literals.rkt
#lang racket
(provide compile-literals init-symbol-table literals compile-string-chars)
(require "ast.rkt"
         "utils.rkt"
         a86/ast)

(define rdi 'rdi)

;; Prog -> Asm
(define (compile-literals p)
  (append-map compile-literal (literals p)))

;; Symbol -> Asm
(define (compile-literal s)
  (let ((str (symbol->string s)))
    (seq (Label (symbol->data-label s))
         (Dq (string-length str))
         (compile-string-chars (string->list str))
         (if (odd? (string-length str))
             (seq (Dd 0))
             (seq)))))

;; Prog -> Asm
;; Call intern_symbol on every symbol in the program
(define (init-symbol-table p)
  (match (symbols p)
    ['() (seq)]
    [ss  (seq (Sub 'rsp 8)
              (append-map init-symbol ss)
              (Add 'rsp 8))]))

;; Symbol -> Asm
(define (init-symbol s)
  (seq (Lea rdi (symbol->data-label s))
       (Call 'intern_symbol)))

;; Prog -> [Listof Symbol]
(define (literals p)
  (remove-duplicates
   (map to-symbol (literals* p))))

;; Prog -> [Listof Symbol]
(define (symbols p)
  (remove-duplicates (filter symbol? (literals* p))))

;; (U String Symbol) -> Symbol
(define (to-symbol s)
  (if (string? s)
      (string->symbol s)
      s))

;; Prog -> [Listof (U Symbol String)]
(define (literals* p)
  (match p
    [(Prog ds e)
     (append (append-map literals-d ds) (literals-e e))]))

;; Defn -> [Listof (U Symbol String)]
(define (literals-d d)
  (match d
    [(Defn f xs e)
     (literals-e e)]))

;; Expr -> [Listof (U Symbol String)]
(define (literals-e e)
  (match e
    [(Str s)  (list s)]
    [(Symb s) (list s)]
    [(Prim1 p e)
     (literals-e e)]
    [(Prim2 p e1 e2)
     (append (literals-e e1) (literals-e e2))]
    [(Prim3 p e1 e2 e3)
     (append (literals-e e1) (literals-e e2) (literals-e e3))]
    [(If e1 e2 e3)
     (append (literals-e e1) (literals-e e2) (literals-e e3))]
    [(Begin e1 e2)
     (append (literals-e e1) (literals-e e2))]
    [(Let x e1 e2)
     (append (literals-e e1) (literals-e e2))]
    [(App e1 es)
     (append (literals-e e1) (append-map literals-e es))]
    [(Lam f xs e)
     (literals-e e)]
    [(Match e ps es)
     (append (literals-e e) (append-map literals-match-clause ps es))]
    [_ '()]))

;; Pat Expr -> [Listof (U Symbol String)]
(define (literals-match-clause p e)
  (append (literals-pat p) (literals-e e)))

;; Pat -> [Listof (U Symbol String)]
(define (literals-pat p)
  (match p
    [(PSymb s) (list s)]
    [(PStr s)  (list s)]
    [(PBox p) (literals-pat p)]
    [(PCons p1 p2) (append (literals-pat p1) (literals-pat p2))]
    [(PAnd p1 p2) (append (literals-pat p1) (literals-pat p2))]
    [_ '()]))

;; [Listof Char] -> Asm
(define (compile-string-chars cs)
  (match cs
    ['() (seq)]
    [(cons c cs)
     (seq (Dd (char->integer c))
          (compile-string-chars cs))]))

← prev up next →

	Syllabus
	Texts
	Schedule
	Notes
	Assignments
	Midterms
	Project
	Software

1	What is a Compiler?
2	From OCaml to Racket
3	a86: a Little Assembly Language
4	Abscond: a language of numbers
5	Blackmail: incrementing and decrementing
6	Con: branching with conditionals
7	Dupe: a duplicity of types
8	Dodger: addressing a lack of character
9	Evildoer: change the world a couple nibbles at a time
10	Extort: when errors exist
11	Fraud: local binding, variables, and binary operations
12	Hustle: heaps and lists
13	Hoax: vectors and strings
14	Iniquity: function definitions and calls
15	Jig: jumping to tail calls
16	Knock: pattern matching
17	Loot: lambda the ultimate
18	Mug: symbols and interned string literals
19	Mountebank: quote and compound static data
20	Neerdowell: structures

18.1	String Literals
18.2	Static Memory
18.3	Static Interning
18.4	Symbols
18.5	Dynamic Interning
18.6	Uninterned Symbols
18.7	Matching symbols and strings
18.8	Compiling Symbols and Strings