computer organization
c m s c 311  
s p r i n g   2 0 0 2  

Project #1

Due Sunday, May 5, 11:59 PM (just before midnight)
Late Due Date Sunday, May 12, 11:59 PM (just before midnight)

Posted: April 5, 2002



The goal of this project is to read in a basic assembly language program written in MIPS. This is an ongoing project to write an assembly language interpreter. This project should be fairly easy, but will require a little time to understand what's needed.

Academic Integrity Statement

Please note that *all* programming projects in this course (including this one) are to be done independently or with the assistance of the instructional staff of this course only, unless otherwise specified IN PRINT by official webpages.

Please review the policies outlined on the class syllabus concerning the use of class computer accounts and concerning the University's Code of Academic Integrity. The instructors of this course will review the programs submitted by students for potential violations of the Code of Academic Integrity and if it is believed that a violation has occurred it will be referred to the Office of Judicial Programs and the Student Honor Council.

Hardcoding is considered a violation of academic integrity

Assembly Language Programs

Assembly language are human-readable (somewhat) versions of machine language. Machine language is written in 0's and 1's, and therefore is hard to write, because it's easy to make mistakes. Usually, assembly language programs are written at a somewhat higher level of abstraction than machine language. However, it's still often the case that one assembly language instruction is equivalent to one machine language instruction. Occasionally, one assembly language instruction is equivalent to several machine language instructions.

Think of an instruction as specific C or C++ function, which takes arguments.

Assembly language does not support anything complicated. There are no loops, no functions, no nesting, etc. Yet all of these can be simulated in assembly language, with great effort.


One of the most basic, yet most important observations made about Von Neumann computers (the kinds used today) is that data and instructions can all be written using 0's and 1's. This lead to the idea of programs being "soft" as in software, meaning that programs can be manipulated using the same operations as data (in fact, there are so-called self-modifying programs, which modify the code, by calling instructions on the code itself).

Unfortunately, assembly language can differ, based on the assembler, just like C compilers often varied just a little, before ANSI standardized the language. Some features would exist in one flavor of C, but not in another.

In our version of MIPS, the code will be divided into two regions: data and text. The data region will start with .data on its own line and the text region will start with .text on its own line.

Here's an example:


        DATA HERE


        CODE HERE
Usually, the .data string (and .text)is preceded by a TAB, but we'll consider it optional. You can assume that the data region of the program always preceded the text region. This will make it easier to parse.

Data Region

The data region consists of a label, a data type, and possibly data.

These are the data types you should support:

Here are some examples:

num:      word   10      # num initialized to 10
num2:     word   -12     # num2 initialized to -12
num3:     word           # num3 uninitialized

arr:      words  [10] 0*4 1*6   # arr holds 10 elements, four 0's and six 1's

# arr2 holds 10 elements, four 0's and six 1's
arr2:     words  [10] 0 0 0 0 1 1 1 1 1 1   

# arr3 holds 10 elements, four 0's and six 1's
arr2:     words  [10] 0*4 1**

newline:  byte '\n'      # newline stores '\n'
char:     byte 'I'       # char stores 'I'
char2:    byte \101      # char2 stores 'A', which is 65 in octal

# arr3 holds 10 elements, four '\0' and six 'a'
arr3:      words  [10] '\0'*4 'a'*5   

greetings: string "Hello, there!"

The first part of the project is to handle the data region. You may assume it's the first part of the input file. The input file will be an assembly language program written in MIPS.

The user will use command line options. This will be the typical way that the user will enter the information:

   assem -d 1000 -t 2000 < input
where -d and -t are switches.

Option Purpose
-d The starting address of the data region
-t The starting address of the text region

To read command line options (in C/C++), you will need to expand main() to take parameters. Here's how it's typically done.

int main( int argc, char *argv[] )
where argc is the number of command line arguments. For example, in
   assem -d 1000 -t 2000
argc would be 5. Each command line argument is separated by spaces. Each argument is stored as a C-style string in argv. It's very important to realize that they are stored as C-style strings (not int values, etc.).

In this example, argv[ 0 ] = "assem", argv[ 1 ] = "-d", argv[ 2 ] = "1000", argv[ 3 ] = "-t", and argv[ 0 ] = "2000".

So, argv[ 2 ] is a string, NOT the number 1000.

You may assume that -d and -t appear in the command line argument, and that each switch is followed by a single string which is a non-negative number, which represents an address in memory.

You MAY NOT assume that -d is always first and -t is always second. It may be written in either order (that way, the user doesn't have to remember which is which).

For a small amount of extra credit, print an error message if there aren't enough command line arguments.

Incorrect number of arguments.
Syntax: assem -d <addr> -t <addr>


The data region starts with .data on its own line. It can have spaces before or after.

You may assume the data region is first in an assembly language program.

Each line with either be blank (containing 0 or more spaces, followed by a newline), or contain a label.

A label is a string that ends in a colon. The name of label itself does not contain the colon. Thus, you see:

num is the name of the label. The colon is how you know it's a label. You may assume the colon always appears at the end of the name of the label.

The syntax of a label is the same as a C/C++ identifier. It begins with an alphabetic letter or an underscores, followed by zero or more alphabetic letters, underscores, or digits.

You may assume that no two labels have the same name.

In the data region, the label is followed by the "type". The following are the supported types in our assembly language.

Type Description
word A 32 bit integer, written in base 10
words An "array" of 32 bit integer, written in base 10
byte A 8 bit "character", written as a character in single quotes or octal
bytes An "array" of 8 bit "character", written as a character in single quotes or octal
string Written as a double quoted string. The amount of storage is the length of the string plus 1 (for the null character)

Each label defines a region of memory where the data is stored. It's your job to determine what addresses each label occupies. You start the address given by -d. For example, if the command line option says -d 1000, you should attempt to set the addresses of the label beginning at address 1000.


Normally, you would place the elements contiguously in memory. However, we wil observe the following restrictions:

You may assume that memory is byte-addressable. Each address in memory refers to a byte.

Back to Syntax

After a label, there may be spaces and/or newlines. You will know when the information for a label ends in one of two ways. Either you see a new label, or you see .text (or you are at EOF). When you test your code, you may wish to put a .text at the end, so your code can look for this as a stopping condition.

Let's consider a word.

num: word 10
Here's the BNF.
<line>  := <label> <colon> <seps> 
            "word" <digits> <seps> <endl> 
<label> := <identifier>
<identifier> := [ "_" | <alpha> ]
                [ "_" | <alpha> | <digit> ]*
<alpha> := <lower> | <upper>
<lower> := "a" | "b" | "c" | "d" | "e" | 
           "f" | "g" | "h" | "i" | "j" | 
           "k" | "l" | "m" | "n" | "o" | 
           "p" | "q" | "r" | "s" | "t" | 
           "u" | "v" | "w" | "x" | "y" | 
<upper> := "A" | "B" | "C" | "D" | "E" | 
           "F" | "G" | "H" | "I" | "J" | 
           "K" | "L" | "M" | "N" | "O" | 
           "P" | "Q" | "R" | "S" | "T" | 
           "U" | "V" | "W" | "X" | "Y" | 
<digit> := "0" | "1" | "2" | "3" | "4" | 
           "5" | "6" | "7" | "8" | "9" | 
<digits> := [-]? [ <digit> ]+
<seps> := [ " " | "\n" ]+
<endl> := "\n"

The BNF is "enhanced". It uses * to stand for Kleene closure (0 or more repetitions), + to stand for 1 or more repetitions, and ? stands for 0 or 1 repetitions.

Even though some BNF appears over two or more lines, do not assume that there is a line break. They appear on two or more lines because there's not enough space to put it on one line. To indicate a line break, you will see either a "\n" or a <endl>.

The syntax says that you will see "word" followed by digits. This may have an optional minus sign in front (if it represents a negative number). You may assume the number is in base 10 and can be stored in 32 bits in two's complement.

For a little extra credit, you may wish to support numbers that are hexadecimal or octal. Values that are hexadecimal begin with 0x. The alphabetic parts are either uppercase or lowercase. Here's the BNF.

<hex> := [-]? "0x" [ <digit> | "a" | "A" |
                 | "b" | "B" | "d" | "D" |
                 | "e" | "E" | "f" | "F" ]*

<octal> := [-]? "0" [ "0" | "1" | "2" | "3" | "4" | 
                 "5" | "6" | "7" | "8" | "9" ]*
An octal value starts with a 0. This is extra credit, so if don't want to handle this, you don't have to.

For the output, you shold print the following:

(<label>, word)
<addr>:   10
where <label> is the label, "word" is the type. On the second line, <addr> is the starting address of the label. which is followed by a colon, followed by at least one blank space and the value stored at the address in base 10.


Sometimes you want to reserve more than a single word in memory. This is especially useful if you want to declare space for an array. You can do this with the words.
<line2>  := <label> <colon> <seps> 
            "words" <seps> "[" <digits> "]"
            [ <seps> [ <digits> | <digits> "*" [ <digits> | "*" ] ] ]*
What does this mean?

Here's an example:

num2: words [10] 0*7 1 2 3
num2 is a label. words indicates it will be an array of 32 bit values. [10] indicates the number of 32 bit values, in this case, 10 values stored.

Each value afterwards can appear either as a base 10 number, or a base 10 number, followed by an asterisk, followed by either another base 10 number or a second asterisk.

If it's a number, then it occupies 32 bits of memory. If it's a number followed by an asterisk followed by another number, then the first number is the value, and the second number is the amount of times it's repeated. Thus, -10*4 means -10 is repeated 4 times, and therefore occupies 4 * 32 bits of memory.

If it's a number, followed by two asterisks, this means that the value is copied until the end of the array. Thus,

num2: words [10] 0*7 2**
means that there are 7 zeroes, followed by 3 two's (since the array contains 10 values).

If the number of values is less than 10, the assume the values are filled with "garbage". If there are more than 10 values, then any value beyond the tenth is ignored. You can print an error message, if you wish.

The output will be:

(<label>, words, <size>)
<addr>:   0 0 0 0 0 0 1 1 1 1
<addr>:   0 0 0 0 0 0 1 1 1 1
where <label> is the label, "words" is the type, and <size> is the number of words.

Each line afterward will contain an address and up to ten values per line. Each line except the last will have 10 values. The address should be the address of the location of the first of the 10 values. For example,

(num, words, 18)
1000:   0 0 0 0 0 0 1 1 1 1
1040:   3 0 0 0 0 0 ? ?
Print a ? if it contains garbage value. 1000 refers to address 1000, which is the address of the first 0. 1040 is the address of 3. There are 18 values (which is how many is printed).

Remember that the addresses should be word-aligned (divisible by 4).


A single byte can be specified in one of two ways. You can use a single quote, as in 'a', as done in C/C++, or you can use a backslash, followed by a 3 octal value. For the time being, assume that all characters are "printable". There will be two exceptions: '\n' and '\0', i.e., the newline character and the null character.

The syntax is similar to word.

char: byte 'b'
newline: byte '\n'
char2: byte \101
Each byte requires one byte of memory. This can be stored at any address.

When printing the output, print using single characters in single quotes.

(<label>, byte)
<addr>:   '<char>'
For example,
(char, byte)
923:   'b'
Be careful when printing '\n' and '\0'. They should be printed as shown (a single quote, a backslash, the character, and another single quote).


Again, used for arrays of bytes. The syntax is similar to words.

Here's an example:

arr: bytes [12] 'b'*7 '\0'**
This is an array of 12 bytes, which is 7 'b', followed by 5 '\0's.

The * and value will appear concatenated to the string, so you will need to figure out how to seperate the parts.

This is how you would print the output:

(arr, bytes, 12)
 993: 'b' 'b' 'b' 'b' 'b' 'b' 'b' '\0' '\0' '\0'
1003: '\0' '\0'


The syntax for a string is fairly simple.
mesg: string "Hi, there!"
It's a label, followed by string, followed by a double quoted string, separated by any combination of spaces and newlines. You may assume the double quoted string does not contain "\n", although for extra credit, you can attempt to handle this.

You should print:

(mesg, string, 11)
1002: "Hi, there!"
You print out the label, then "string", then the number of characters needed (in this case, 10, including the null character at the end), followed by the start address, and the string in double quotes.


There are two ways to approach parsing the input. You can read one whole line at a time, then parse that line (which isn't such a bad idea considering how line-oriented assembly language can be), or you can try to read it word by word (which has drawbacks, esp. with double quoted strings). You can even process character by character, if you wish.

Longer Example

Text (Code) Region

The code region is "simple". It consists of commands. Here are the instructions that it supports. where i, j, and k are int values between 0 and 31, inclusive, and val is a positive or negative value in base 10 (you can also try to "handle" octal, which will have a leading 0, or hexadecimal with a leading "0x").

label is a label which appears in the text section (we'll deal with labels in the data section in Project 2). Assume valid label names.

Each line that appears after .text will either contain an instruction, a label, a label and an instruction, or a blank line.

Here's an example of each:

This is an instruction.

   add $r1, $r1, $r2
A label, by itself,
A label with an instruction
BRANCH: add $r1, $r1, $r2
or a blank line (which should be ignored).

If the input looks like (for "extra credit", ignore comments):

       bne $r1, $r2, ELSE    # if ( i == j )
       addi $r1, $r1, 1      # i++
ELSE:  j LAST                # skip over else
       subi $r2, $r2, 1      # j--
       add $r1, $r1, $r2     # j += i

If the code segment begins at address 999, then the output should look like:
1000:  bne $r1, $r2, ELSE    # jump +1
1004:  addi $r1, $r1, 1      
1008:  j LAST                # jump +1
1012:  subi $r2, $r2, 1      
1016:  add $r1, $r1, $r2     

ELSE: 1008
LAST: 1016
The output consists of addresses, followed by the instruction. You should add a comment if the instruction is beq, bne, j, or jal.

Print jump +k if the branch/jump instruction should jump k instructions forward starting at the next instruction. For example, ELSE is 2 instructions forward from line 1000, but only one instruction forward from 1004 (which is the next instruction). It's odd, but there's a reason for this (ask later about it).

Print jump -k if the branch/jump instruction should jump k instructions backward. Again, count from the next instruction.

Print a blank line, and list out the labels, and the addresses of the labels. Notice that a line that contains a label by itself (such as LAST), should be "combined" with the next instruction. Thus, LAST: is really the same line as "add $r1, $r1, $r2".

You may assume the following:

Obviously, it would be nicer to handle more cases of spacing, but for now, you can assume this.

The output of the code part should appear just after the output of the data part.

See the class syllabus for policies concerning email
Last Modified: Sun Mar 3 22:54:00 EST 2002
left up down right home