computer science II
 c m s c 214   s p r i n g   2 0 0 2

EBNF

Background

In CMSC 114, you studied BNF. Even if you placed into CMSC 214, you should know about BNF (since this was covered in CMSC 114). BNF stands for Backus-Naur Form. BNF has primarily been used to describe, in a formal way, the syntax of programming languages. It was first used to describe languages such as Algol68.

Prior to BNF, the grammar of programming languages were often expressed in English or some non-mathematical language. This lead to ambiguities.

Formalizing the grammar had two effects. First, it made the valid programs of a language more precise. Second, because it could be stated in a mathematical form, it made it easier to develop compilers for such languages.

BNF

In BNF, you have two kinds of entities:
• non-terminals, which appear with angle brackets, e.g. <white-space>.
• terminals, which I will write with double quotes around it (e.g. "a").
To understand BNF, pretend you start off with two kinds of cards: green cards (which represent non-terminals) and red cards (which represent terminals).

The cards are laid in front of you left to right.

Initially, you only have one green card. At each step, you are allowed to replace one green card with one or more cards based on something called production rules.

Suppose you have 3 cards (call them X, Y, and Z) which are laid left to right (X is in position 1, Y is in position 2, and Z is in position 3). You wish to turn in Y. You will get certain cards back. You must place these cards in the position that Y occupied. For example, if you get 2 cards back (call them A and B), then you will have four cards: X, A, B, Z. Notice A and B are in the position that Y used to be.

You stop when all you have left are red cards (terminals).

The goal of the game is to produce a sequence of red cards which represent a desired string.

Production Rules

Production rules tell you how to replace a single non-terminal with one or more non-terminals or terminals. For example, this is a production rule.
```  <id> ::= <letter> "1" | <letter><letter>
```
A production rule looks a lot like an assignment statement. However, in an assignment statement, you evaluate expressions on the right hand side of =, and take that value and place it into the memory location of the variable (more properly, an lvalue) on the left hand side.

By contrast, a production rule is often thought of from left to right. The rule above says that you can take a "red card" with <id> written on it, and replace it with the right hand side.

In the example above, there are two things on the right hand side.

• <letter> "1"
• <letter> <letter>
There are separated by a "|". The "|" is a selection (i.e., an OR) operator. In English, this means you can replace <id> with either <letter> "1" OR you can replace it with <letter> <letter> . In the above example, "1" is a terminal.

When you have two things adjacent to each other, as in <letter> <letter> , this means the two things are concatenated. Realize that every non-terminal will eventually be replaced by a terminal, and the result will be a string.

EBNF

The main problem with BNF is that it's a pain to describe repetition. For example, when you want to say a word contains one or more letters, you must write a production that looks like:
```  <word> ::= <letter> | <letter><word>
```
This is awkward to read. It would be nice to have an easier way of showing repetition. And there is. We will add three more operators to BNF to make it easier to read.

We will call the resulting enhancement of the BNF as EBNF, which you can call "extended BNF" or perhaps "enhanced BNF".

Two of these operators are often seen in languages that support regular expressions (e.g., Perl).

• Name: Kleene star
Looks like: [ ..some bnf.. ]*
The * means "0 or more repetitions".

• Name: no special name
Looks like: [ ..some bnf.. ]+
The + means "1 or more repetitions".

• Name: no special name
Looks like: [ ..some bnf.. ] {1,3}
The {1,3} means between 1 and 3 repetitions, inclusive. {-,3} means between 0 and 3 repetitions, inclusive. {3,-} means 3 or more repetitions. {3,3} means exactly 3 repetitions.
So, if we wanted to say that a word consisted of one or more letters, we would write:
```  <word> ::= [<letter>]+
```
If we wanted to say that a word consists of zero or more letters (thus, the empty string would be considered a valid "word"), the EBNF would be:
```  <word> ::= [<letter>]*
```
Notice the Kleene star is used, instead of +.

That's all there is to EBNF, at least, as much as I need to use for the course.

Interestingly enough, EBNF is not really any more powerful than BNF. Anything you can write in EBNF, you can translate to something equivalent in BNF. EBNF is, what is often called, syntactic sugar. This means that it gives you convenience, without additional "power".

What we use BNF for

Maryland is one of the few universities that use BNF to describe the format of the input and output. BNF has some advantages and some disadvantages for this purpose.

You might think that one advantage is precision. While that's true, BNF isn't always the easiest way to describe things. Sometimes using English, while more vague, is easier. Using BNF to describe input and output is like using a program to describe how to solve a problem. It often has many more details than needed and requires a lot of work to get it precisely right. It may be easier to write pseudo-code than real code, just as it may be easier to write English rather than BNF.

In fact, the advantages of using BNF is simply your ability to read it. As mentioned earlier, many languages have their syntax written in BNF. It's very possibly you have to read the syntax. Or you may have to develop your own. Either way, a working knowledge of BNF can help.

Useful BNF

This uses extended BNF.
```   <lower> ::== "a" | "b" | "c" | "d" | "e" | "f" |
"g" | "h" | "i" | "j" | "k" |
"l" | "m" | "n" | "o" | "p" |
"q" | "r" | "s" | "t" | "u" |
"v" | "w" | "x" | "y" | "z"
<upper> ::== "A" | "B" | "C" | "D" | "E" | "F" |
"G" | "H" | "I" | "J" | "K" |
"L" | "M" | "N" | "O" | "P" |
"Q" | "R" | "S" | "T" | "U" |
"V" | "W" | "X" | "Y" | "Z"
<alpha> ::== <lower> | <upper>
<digit> ::== "0" | "1" | "2" | "3" | "4" |
"5" | "6" | "7" | "8" | "9"
<digits> ::== [ "+" | "-" ]? [ <digit> ]+
<udigits> ::== [ <digit> ]+
<empty> ::==
<endl>  ::== "\n"
<left>  ::== "<"
<right> ::== ">"
<lparen>  ::== "("
<rparen> ::== ")"
<tag>   ::== <left> <word> <right>
<under> ::== "_"
<blank>  ::== " "
<blanks>  ::== [ <blank> ]+
<mblanks>  ::== [ <blank> ]+
<sep>   ::== <blank> | <endl>
<seps>   ::== [ <sep> ]+
<mseps>   ::== [ <sep> ]*
<var>  ::== <alpha> [ <alpha> | <digit> | <under> ]*
<word>  ::== [ <alpha> | <digit> ]+
<bword>  ::== [ <alpha> | <digit> | <blank> ]+
<dquote>  ::== "\""
<squote>  ::== "\'"
<dqstring>  ::== <dquote> <bword> <dquote>
<sqstring>  ::== <squote> <bword> <squote>
```