On this page:
8.1 Characters
8.2 Characters in Racket
8.3 Meaning of Dodger programs
8.4 Ex uno plures iterum:   Out of One, Many... Again
8.5 A Compiler for Dodger
8.6 A Run-Time for Dodger
8.1

8 Dodger: addressing a lack of character

There are 11 types of values...

    8.1 Characters

    8.2 Characters in Racket

    8.3 Meaning of Dodger programs

    8.4 Ex uno plures iterum: Out of One, Many... Again

    8.5 A Compiler for Dodger

    8.6 A Run-Time for Dodger

8.1 Characters

In Dupe: a duplicity of types, we saw how to accomodate disjoint datatypes, namely integers and booleans. Let’s add yet another: the character type. Conceptually, there’s not much new here (hence we stay in the "D" family of languages); we’re simply adding a third type, which we distinguish from the other two by using more bits to tag the type.

We’ll call it Dodger.

To the syntax of expressions, we add character literals.

We will also add the following operations:

Abstract syntax is modelled with the following datatype definition:

dodger/ast.rkt

  #lang racket
  (provide Int Bool Char Prim1 If)
   
  ;; type Expr =
  ;; | (Int Integer)
  ;; | (Bool Boolean)
  ;; | (Char Character)
  ;; | (Prim1 Op Expr)
  ;; | (If Expr Expr Expr)
  ;; type Op = 'add1 | 'sub1 | 'zero?
  ;;         | 'char? | 'integer->char | 'char->integer
  (struct Int (i)       #:prefab)
  (struct Bool (b)      #:prefab)
  (struct Char (c)      #:prefab)
  (struct Prim1 (p e)   #:prefab)
  (struct If (e1 e2 e3) #:prefab)
   

The s-expression parser is defined as follows:

dodger/parse.rkt

  #lang racket
  (provide parse)
  (require "ast.rkt")
   
  ;; S-Expr -> Expr
  (define (parse s)
    (match s
      [(? integer?) (Int s)]
      [(? boolean?) (Bool s)]
      [(? char?)    (Char s)]
      [(list (? op1? o) e)  (Prim1 o (parse e))]
      [(list 'if e1 e2 e3)
       (If (parse e1) (parse e2) (parse e3))]
      [_ (error "Parse error")]))
   
  ;; Any -> Boolean
  (define (op1? x)
    (memq x '(add1 sub1 zero? char? integer->char char->integer)))
   
8.2 Characters in Racket

Racket has a Character data type for representing single letters. A Racket character can represent any of the 1,114,112 Unicode code points.

The way a character is most often written is an octothorp, followed by a backslash, followed by the character itself. So for example the character a is written #\a. The character λ is written #\λ. The character is written #\文.

A character can be converted to an integer and vice versa:

Examples

> (char->integer #\a)

97

> (char->integer #\λ)

955

> (char->integer #\文)

25991

> (integer->char 97)

#\a

> (integer->char 955)

#\λ

> (integer->char 25991)

#\文

However, integers in the range of valid code points are acceptable to integer->char and using any other integer will produce an error:

Examples

> (integer->char -1)

integer->char: contract violation

  expected: valid-unicode-scalar-value?

  given: -1

> (integer->char 55296)

integer->char: contract violation

  expected: valid-unicode-scalar-value?

  given: 55296

There are a few other ways to write characters (see the Racket Reference for the details), but you don’t have to worry much about this since read takes care of reading characters in all their different forms. The run-time system, described below, takes care of printing them.

8.3 Meaning of Dodger programs

The semantics are omitted for now (there’s really nothing new that’s interesting).

The interpeter is much like that of Dupe, except we have a new base case:

dodger/interp.rkt

  #lang racket
  (provide interp)
  (require "ast.rkt" "interp-prim.rkt")
   
  ;; type Value =
  ;; | Integer
  ;; | Boolean
  ;; | Character
   
  ;; Expr -> Value
  (define (interp e)
    (match e
      [(Int i) i]
      [(Bool b) b]
      [(Char c) c]
      [(Prim1 p e)
       (interp-prim1 p (interp e))]    
      [(If e1 e2 e3)
       (if (interp e1)
           (interp e2)
           (interp e3))]))
   

And the interpretation of primitives is extended:

dodger/interp-prim.rkt

  #lang racket
  (provide interp-prim1)
   
  ;; Op1 Value -> Value
  (define (interp-prim1 op v)
    (match op
      ['add1          (add1 v)]
      ['sub1          (sub1 v)]
      ['zero?         (zero? v)]
      ['char?         (char? v)]    
      ['integer->char (integer->char v)]
      ['char->integer (char->integer v)]))
   

The meaning of characters and their operations are just lifted from Racket.

We can try out some examples:

Examples

> (interp (Char #\a))

#\a

> (interp (Char #\b))

#\b

> (interp (Prim1 'char? (Char #\a)))

#t

> (interp (Prim1 'char? (Bool #t)))

#f

> (interp (Prim1 'char->integer (Char #\a)))

97

> (interp (Prim1 'integer->char (Prim1 'char->integer (Char #\a))))

#\a

Just as in Dupe, type errors result in the interpreter crashing:

Examples

> (interp (Prim1 'char->integer (Bool #f)))

char->integer: contract violation

  expected: char?

  given: #f

Also, not every integer corresponds to a character, so when integer->char is given an invalid input, it crashes (more on this in a minute):

Examples

> (interp (Prim1 'integer->char (Int -1)))

integer->char: contract violation

  expected: valid-unicode-scalar-value?

  given: -1

8.4 Ex uno plures iterum: Out of One, Many... Again

We have exactly the same problem as in Dupe: we need to represent different kinds of values within our one primordial datatype: the 64-bit integer.

We can use the following encoding scheme:

Notice that each kind of value is disjoint.

We can write an interpreter that operates at the level of bits just as we did for Dupe; notice that it only ever constructs characters at the very end when converting from bits to values. Let’s first define our bit encodings:

dodger/types.rkt

  #lang racket
  (provide (all-defined-out))
   
  (define int-shift     1)
  (define char-shift    2)
  (define type-int    #b0)
  (define type-char  #b01)
  (define mask-char  #b11)
  (define val-true  #b011)
  (define val-false #b111)
   
  (define (bits->value b)
    (cond [(= type-int (bitwise-and b #b1))         
           (arithmetic-shift b (- int-shift))]
          [(= type-char (bitwise-and b #b11))
           (integer->char (arithmetic-shift b (- char-shift)))]
          [(= b val-true)  #t]
          [(= b val-false) #f]
          [else (error "invalid bits")]))
   
  (define (value->bits v)
    (cond [(integer? v) (arithmetic-shift v int-shift)]
          [(char? v)
           (bitwise-ior type-char
                        (arithmetic-shift (char->integer v) char-shift))]
          [(eq? v #t) val-true]
          [(eq? v #f) val-false]))
   
   

And now the interpreter:

dodger/interp-bits.rkt

  #lang racket
  (provide interp interp-bits)
  (require "ast.rkt" "types.rkt")
   
  ;; type Value =
  ;; | Integer
  ;; | Boolean
  ;; | Character
   
  ;; type Bits = Integer
   
  ;; Expr -> Value
  (define (interp e)
    (bits->value (interp-bits e)))
   
  ;; Expr -> Bits
  (define (interp-bits e)
    (match e
      [(Int i)  (value->bits i)]
      [(Char c) (value->bits c)]
      [(Bool b) (value->bits b)]
      [(Prim1 'add1 e0)
       (+ (interp-bits e0) (value->bits 1))]
      [(Prim1 'sub1 e0)
       (- (interp-bits e0) (value->bits 1))]
      [(Prim1 'zero? e)
       (if (zero? (interp-bits e))
           val-true
           val-false)]
      [(Prim1 'char? e0)
       (if (= type-char (bitwise-and (interp-bits e0) #b11))
           val-true
           val-false)]
      [(Prim1 'char->integer e0)
       (arithmetic-shift
        (arithmetic-shift (interp-bits e0) (- char-shift))
        int-shift)]
      [(Prim1 'integer->char e0)
       (bitwise-ior
        (arithmetic-shift
         (arithmetic-shift (interp-bits e0) (- int-shift))
         char-shift)
        type-char)]
      [(If e1 e2 e3)
       (if (= (interp-bits e1) val-false)
           (interp-bits e3)
           (interp-bits e2))]))
   
8.5 A Compiler for Dodger

Compilation is pretty easy, particularly since we took the time to develop the bit-level interpreter. The compiler uses the same bit-level representation of values and uses logical operations to implement the same bit manipulating operations. Most of the work happens in the compilation of primitives:

dodger/compile-ops.rkt

  #lang racket
  (provide (all-defined-out))
  (require "ast.rkt" "types.rkt" a86/ast)
   
  (define rax 'rax)
   
  ;; Op1 -> Asm
  (define (compile-op1 p)
    (match p
      ['add1 (Add rax (value->bits 1))]
      ['sub1 (Sub rax (value->bits 1))]
      ['zero?
       (let ((l1 (gensym)))
         (seq (Cmp rax 0)
              (Mov rax val-true)
              (Je l1)
              (Mov rax val-false)
              (Label l1)))]
      ['char?
       (let ((l1 (gensym)))
         (seq (And rax mask-char)
              (Xor rax type-char)
              (Cmp rax 0)
              (Mov rax val-true)
              (Je l1)
              (Mov rax val-false)
              (Label l1)))]
      ['char->integer
       (seq (Sar rax char-shift)
            (Sal rax int-shift))]
      ['integer->char
       (seq (Sar rax int-shift)
            (Sal rax char-shift)
            (Xor rax type-char))]))
   

The top-level compiler for expressions now has a case for character literals, which are compiled like other kinds of values:

dodger/compile.rkt

  #lang racket
  (provide (all-defined-out))
  (require "ast.rkt" "types.rkt" "compile-ops.rkt" a86/ast)
   
  (define rax 'rax)
   
  ;; Expr -> Asm
  (define (compile e)
    (prog (Global 'entry)
          (Label 'entry)
          (compile-e e)
          (Ret)))
   
  ;; Expr -> Asm
  (define (compile-e e)
    (match e
      [(Int i)       (compile-value i)]
      [(Bool b)      (compile-value b)]
      [(Char c)      (compile-value c)]
      [(Prim1 p e)   (compile-prim1 p e)]
      [(If e1 e2 e3) (compile-if e1 e2 e3)]))
   
  ;; Value -> Asm
  (define (compile-value v)
    (seq (Mov rax (value->bits v))))
   
  ;; Op1 Expr -> Asm
  (define (compile-prim1 p e)
    (seq (compile-e e)
         (compile-op1 p)))
   
  ;; Expr Expr Expr -> Asm
  (define (compile-if e1 e2 e3)
    (let ((l1 (gensym 'if))
          (l2 (gensym 'if)))
      (seq (compile-e e1)
           (Cmp rax val-false)
           (Je l1)
           (compile-e e2)
           (Jmp l2)
           (Label l1)
           (compile-e e3)
           (Label l2))))
   

We can take a look at a few examples:

Examples

> (define (show e)
    (displayln (asm-string (compile-e (parse e)))))
> (show '#\a)

        mov rax, 389

> (show '#\λ)

        mov rax, 3821

> (show '(char->integer #\λ))

        mov rax, 3821

        sar rax, 2

        sal rax, 1

> (show '(integer->char 97))

        mov rax, 194

        sar rax, 1

        sal rax, 2

        xor rax, 1

8.6 A Run-Time for Dodger

The only interesting aspect of Dodger, really, is that we need to add run-time support for printing character literals.

We extend the bit-encoding of values following the pattern we’ve already seen:

dodger/types.h

#ifndef TYPES_H
#define TYPES_H

/*
  Bit layout of values

  Values are either:
  - Integers:   end in  #b0
  - Characters: end in #b01
  - True:              #b11
  - False:            #b111
*/
#define int_shift        1
#define int_type_mask    ((1 << int_shift) - 1)
#define int_type_tag     (0 << (int_shift - 1))
#define nonint_type_tag  (1 << (int_shift - 1))
#define char_shift       (int_shift + 1)
#define char_type_mask   ((1 << char_shift) - 1)
#define char_type_tag    ((0 << (char_shift - 1)) | nonint_type_tag)
#define nonchar_type_tag ((1 << (char_shift - 1)) | nonint_type_tag)
#define val_true  ((0 << char_shift) | nonchar_type_tag)
#define val_false ((1 << char_shift) | nonchar_type_tag)

#endif

And update the interface for values in the runtime system:

dodger/values.h

#ifndef VALUES_H
#define VALUES_H

#include <stdint.h>

/* any abstract value */
typedef int64_t val_t;

typedef enum type_t {
  T_INVALID = -1,
  /* immediates */
  T_INT,
  T_BOOL,
  T_CHAR,
} type_t;

typedef uint32_t val_char_t;

/* return the type of x */
type_t val_typeof(val_t x);

/**
 * Wrap/unwrap values
 *
 * The behavior of unwrap functions are undefined on type mismatch.
 */
int64_t val_unwrap_int(val_t x);
val_t val_wrap_int(int64_t i);

int val_unwrap_bool(val_t x);
val_t val_wrap_bool(int b);

val_char_t val_unwrap_char(val_t x);
val_t val_wrap_char(val_char_t b);

#endif

dodger/values.c

#include "types.h"
#include "values.h"

type_t val_typeof(val_t x)
{
  if ((int_type_mask & x) == int_type_tag)
    return T_INT;
  if ((char_type_mask & x) == char_type_tag)
    return T_CHAR;
  
  switch (x) {
  case val_true:
  case val_false:
    return T_BOOL;
  }
  
  return T_INVALID;
}

int64_t val_unwrap_int(val_t x)
{
  return x >> int_shift;
}
val_t val_wrap_int(int64_t i)
{
  return (i << int_shift) | int_type_tag;
}

int val_unwrap_bool(val_t x)
{
  return x == val_true;
}
val_t val_wrap_bool(int b)
{
  return b ? val_true : val_false;
}

val_char_t val_unwrap_char(val_t x)
{
  return (val_char_t)(x >> char_shift);
}
val_t val_wrap_char(val_char_t c)
{
  return (((val_t)c) << char_shift) | char_type_tag;
}

The only other change is that print_result is updated to handle the case of printing characters:

dodger/print.c

#include <stdio.h>
#include <inttypes.h>
#include "values.h"

void print_char(val_char_t);
void print_codepoint(val_char_t);
int utf8_encode_char(val_char_t, char *);

void print_result(val_t x)
{
  switch (val_typeof(x)) {
  case T_INT:
    printf("%" PRId64, val_unwrap_int(x));
    break;
  case T_BOOL:    
    printf(val_unwrap_bool(x) ? "#t" : "#f");
    break;
  case T_CHAR:
    print_char(val_unwrap_char(x));    
    break;
  case T_EOF:
    printf("#<eof>");
    break;
  case T_VOID:
    break;    
  case T_INVALID:
    printf("internal error");
  }
}

void print_char(val_char_t c)
{
  printf("#\\");
  switch (c) {
  case 0:
    printf("nul"); break;
  case 8:
    printf("backspace"); break;
  case 9:
    printf("tab"); break;
  case 10:
    printf("newline"); break;
  case 11:
    printf("vtab"); break;
  case 12:
    printf("page"); break;
  case 13:
    printf("return"); break;
  case 32:
    printf("space"); break;
  case 127:
    printf("rubout"); break;
  default:
    print_codepoint(c);
  }
}

void print_codepoint(val_char_t c)
{
  static char buffer[5] = {0};
  utf8_encode_char(c, buffer);
  printf("%s", buffer);
}

int utf8_encode_char(val_char_t c, char *buffer)
{
  // Output to buffer using UTF-8 encoding of codepoint
  // https://en.wikipedia.org/wiki/UTF-8
  if (c < 128) {
    buffer[0] = (char) c;
    return 1;
  } else if (c < 2048) {
    buffer[0] =  (char)(c >> 6)       | 192;
    buffer[1] = ((char)       c & 63) | 128;
    return 2;
  } else if (c < 65536) {
    buffer[0] =  (char)(c >> 12)      | 224;
    buffer[1] = ((char)(c >> 6) & 63) | 128;
    buffer[2] = ((char)       c & 63) | 128;
    return 3;
  } else {
    buffer[0] =  (char)(c >> 18)       | 240;
    buffer[1] = ((char)(c >> 12) & 63) | 128;
    buffer[2] = ((char)(c >>  6) & 63) | 128;
    buffer[3] = ((char)        c & 63) | 128;
    return 4;
  }
}

Will these pieces in place, we can try out some examples:

Examples

> (define (run e)
    (bits->value (asm-interp (compile (parse e)))))
> (run '#\a)

#\a

> (run '(integer->char (add1 (char->integer #\a))))

#\b

> (run '(integer->char 955))

#\λ