CMSC330

Parsers

Parsers

Lexing
Parsing
Interpreting

Lexing

Lexing

Lexing(Tokenizing): Converting a string to a list of tokens

Token: A meaningful string

Typically: keywords, identifiers, numbers,

"The short Wizard" \(\Rightarrow\) [Det;Adj;noun]


type token = Int of int | Add | Sub | LParen | RParen;;

tokenize "2 + ( 4 - 5)";;
= > [Int(2); add; LParen; Int(4); sub; Int(5); RParen]
          

How to Tokenize?

One way: RE and boring repitition


(* take a regexp *)
let re_num = Str.regexp "[0-9]+" in
let re_add = Str.regexp "+" in 
let re_sub = Str.regexp "-" in 
let rec mklst text = 
  if text = "" then [] else
  if (Str.string_match re_num text 0) then
    let matched = Str.matched_string text in 
    Int(int_of_string matched)::(mklst (String.sub text 1 ((String.length text)-(String.length matched))))
  else if (Str.string_match re_add text 0) then
    Add::(mklst (String.sub text 1 ((String.length text)-1)))
  else if (Str.string_match re_sub text 0) then
    Sub::(mklst (String.sub text 1 ((String.length text)-1)))
  else (mklst (String.sub text 1 ((String.length text)-1))) in
mklst "2 + 3";;
          

Parsing

Parsing: taken list to AST

can checks if text is grammatically correct

Many types of parsers: we will use recursive decent

RDP is top down; Grammar slides showed bottom up

Consider the basic grammar for polish notation

\(E \rightarrow A\vert + A\ E \vert - A\ E\)
\(A \rightarrow 0\vert 1\vert \dots\vert 9\)

  • Which Branch am I in/looking for?

  • Which Token are we looking for?

\(E \rightarrow A\vert + A\ E \vert - A\ E\)
\(A \rightarrow 0\vert 1\vert \dots\vert 9\)


let parse_toks tokens = 
let parse_num tokens = 
  if tokens = [] then failwith "error" else
  let h::t = tokens in 
  if h = Int(0) then t else
  (* ... *)
  if h = Int(9) then t else
  failwith "error"  in
let rec parse-expr tokens = 
  if tokens = [] then failwith "error" else
  let h::t = tokens in 
    if h = Add then 
      parse-expr (parse_num t) 
    else if h = Sub then 
      parse-expr (parse num t)
    else parse_num tokens
in (parse-expr tokens) = [];;
          

Important: knowing which branch you are looking for

Important: knowing which branch you are looking for

Backtracking vs Predictive

Predictive: whats the next symbol?

First(nt): set of terminals nt represents

Only so good: conflicting first sets

Only so good: conflicting first sets

  • Can rewrite grammar
  • Can rewrite parser

Converting to AST

Recall a Tree in OCaml


type tree = Leaf|Node of int * Node * Node;;
Node(2,Node(0,Leaf,Leaf),Leaf);;
          

Modify for Tokens


type expr = Num of int|Plus of expr * expr|Minus of expr * expr;;
(Add(Num 1, Num 2));;
          

Interpreting/Compiling

Interpreting/Compiling: Take AST and return either code or a value (which is code)


compile (Add(Num 1, Num 2));;
=> "mov eax,1
    mov ebx,2
    add eax,ebx"

interpret (Add(Num 1, Num 2));;
=> 3
          

Typically some sort of recursive traversal of the AST

Interpreting/Compiling: Take AST and return either code or a value (which is code)


let compile ast = 
match ast with 
|Add(x,y,z) -> let num1 = compile y in
               let num2 = compile z in 
               let cn1 = "mov eax,"^num1^"\n" in
               let cn2 = "mov ebx,"^num2^"\n" in
               let add = "add eax,ebx" in
              cn1^cn2^add
|Num(x) -> string_of_int x
|_ -> failwith "error"