So far this summer, we’ve been discussing how to implementing the semantics of programming languages by writing interpreters for their terms. The terms we compute with are OCaml datatypes representing programs, rather than the literal string "let x = 23 + 12 in x + y". Parsing is the process of transforming raw (string based) source code into the internal representation.

Context Free Grammars

Earlier in the semester, we covered regular expressions. For reasons we didn’t fully explain, I asserted that regular expressions were less powerful than we might want for more general purposes. As an example, consider the following set of strings:

\[
S = \{^n \}^n
\]

is the set of strings with balanced braces. It contains things like , , , etc… If you think hard about how to write a regular expression for this language, you’ll eventually be stumped: it’s impossible! If you take a course on automata theory, you’ll learn a cookbook technique for showing why (the pumping lemma will help you).

The basic upshot is this: finite automata don’t have unbounded memory. Finite automata consist of a fixed number of states, and recognizing a language like above requires us to have some memory of how many {s we’ve seen. This is one reason why regular languages are less powerful than other models of computations. Ask yourself: what makes a finite automaton different than a turing machine?

To recognize the above language, we could write a grammar such as the following:

\[
S \rightarrow \{ S \} \mid \epsilon
\]

is called a production of the grammar, and tells you that an is allowed to consume an opening brace, another recursively, and then a closing brace. For the base case, we also allow it to capture nothing at all. We’ll see later that in general you can have more than one production, and the productions can be mutually recursive. This mechanism is called a context free grammar, or a CFG.

CFGs are more powerful than regular expressions

It’s worth noting that CFGs subsume (are more powerful than) regular expressions. As an example, the following grammar recognizes the regular expression (0|1)*.

\[
S \rightarrow 0S \mid 1S \mid \epsilon
\]

However, you get this at a cost: recognizing CFGs is in general more expensive (computationally, and usually notationally) than recognizing regular languages. Because of this we usually write regular languages using regular expressions.

Formal definitions

Definition: Context Free Grammar (CFG)

A CFG is a 4-tuple of

  • , the alphabet, which is a finite set of symbols or terminals (these are the basic characters that show up in the text). As an example, , or .

  • , a finite set of nonterminal symbols, which are disjoint from (these are the meta characters that show up when we recognize the text). E.g., for our above examples.

  • , a set of productions of the form . E.g., . (Note the is just syntactic sugar for multiple productions.)

    • Informally: each nonterminal can be replaced by the string of zero or more terminals to the right of the . You can think of them as rewriting rules.
  • , the start symbol

Backus-Naur Form

One common form of CFGs is called Backus-Naur form. It was an influential mechanism of describing CFGs in now anxient texts on programming languages, but productions like is written in BNF as <A> ::= <B> c <D>. Notice that the nonterminals are put in brackets and the is replaced with ::=, because it was made in a time where computers couldn’t actually render fancy characters like . You’ll still see it used some places, as an example, in the back of the K&R.

Generating / Accepting strings

Just like DFAs, NFAs, and regular expressions, grammars accept or reject strings. We can view the productions as serving the role of rewriting rules. As an example, consider the following grammar: . If we wanted to recognize the string :

\[
S \Rightarrow 0S \Rightarrow 01S \Rightarrow 010S \Rightarrow 010
\]

You play around with this to check whether an arbitrary grammar is accepting. As an example of a string that isn’t accepting, let’s think about the grammar . Let’s see what happens on :

\[
S \Rightarrow \{S\} \Rightarrow \text{???}
\]

It turns out, we get stuck: there’s no we can substitute to make it work out. For this reason, the grammar rejects . The language of a grammar is the set of strings that it accepts by churning on these rewriting rules.

Notice that the difference between regular expressions and context free grammars is recursion: they can mention themselves. This is also a helpful reminder for how their corresponding machines work: recognizing CFGs requires (in general) a finite state machine with a stack. This induces a pushdown automaton that recognizes context free languages, versus a DFA for regular expressions.

Grammars for programming languages

Now that we have this definition of CFGs, we can use them to write grammars for programming languages. As an example, here’s how I would write a grammar to recognize arithmetic expressions over the numbers 0 and 1:

\[
E \rightarrow n \mid E + E \mid E * E \mid ( E )
\]

We’ll see in a few minutes how to handle arbitrary numbers (e.g., 123 or 32 rather than just 0 and 1) as terminals. But the basic idea here is that when we look at a program:

0 + 2 * 1 

We get feed it into our grammar, and then pull out individual matches to extract an abstract syntax tree to feed into our interpreter. This is similar to the way we extracted match patterns from regular expressions.

Parse ambiguities, precedence, and maximal munch

Unfortunately, if we use the above grammar, we will find that we could potentially get incorrect answers to the following computation

0 * 1 + 2

How should we parse this? Should it be 0 * (1 + 2) or should it be (0 * 1) + 2? Our elementary school intuition tells us it should be the first. But the way we’ve written the grammar, it could be either.

We can rewrite our grammar to account for this, however:

\[
\begin{align}
E \rightarrow M \mid E + E \\
M \rightarrow n \mid M * M \\
\end{align}
\]

Notice that we’ve stratified the grammar into two cases: addition and multiplication. Now, we can parse arithmetic expressions with the right precedence:

\[
\begin{align}
E & \Rightarrow E + E \Rightarrow M + E \Rightarrow M + M \Rightarrow 0 + M \\
& \Rightarrow 0 + M * M \Rightarrow 0 + 2 * M \Rightarrow 0 + 2 * 1
\end{align}
\]

To do this, we had to reorganize our rules to account for the right predence. There’s a whole set of ways to munge on grammars to make them accommodate different desired properties. Here is a good explanation of the topic, albeit in a very archaic web 0.5 style. Some more recent ones

Putting them together

To use all of this theory, we can write parsers for programming languages. In lecture I showed an example of a toplevel for our CEKaml language. If you’d like to learn more about it, I highly suggest reading the relevant chapter of Real world OCaml.

For your interest, you may also want to look at various things from today:

If you actually want to try running it, I just have to say a few things:

  • You’ll need menhir installed, you should be able to opam install menhir, if it’s not already installed.

  • You need to rename your solution to cekamlsol.ml. I am hesitant to post solutions to the project before the deadline out of fear people might actually look at them and use them to their advantage.

  • The implementation of the parser is a little incomplete, if you’d like to make it complete I’d love to have your help.

Other resources