c m s c 311
s p r i n g 2 0 0 2
Due Sunday, May 5, 11:59 PM (just before midnight)
Late Due Date Sunday, May 12, 11:59 PM (just before midnight)
Posted: April 5, 2002
Please note that *all* programming projects in this course (including this one) are to be done independently or with the assistance of the instructional staff of this course only, unless otherwise specified IN PRINT by official webpages.
Please review the policies outlined on the class syllabus concerning the use of class computer accounts and concerning the University's Code of Academic Integrity. The instructors of this course will review the programs submitted by students for potential violations of the Code of Academic Integrity and if it is believed that a violation has occurred it will be referred to the Office of Judicial Programs and the Student Honor Council.
Hardcoding is considered a violation of academic integrity
Think of an instruction as specific C or C++ function, which takes arguments.
Assembly language does not support anything complicated. There are no loops, no functions, no nesting, etc. Yet all of these can be simulated in assembly language, with great effort.
Unfortunately, assembly language can differ, based on the assembler, just like C compilers often varied just a little, before ANSI standardized the language. Some features would exist in one flavor of C, but not in another.
In our version of MIPS, the code will be divided into two regions: data and text. The data region will start with .data on its own line and the text region will start with .text on its own line.
Here's an example:
Usually, the .data string (and .text)is preceded by
a TAB, but we'll consider it optional. You can assume that
the data region of the program always preceded the text region.
This will make it easier to parse.
These are the data types you should support:
Here are some examples:
num: word 10 # num initialized to 10
num2: word -12 # num2 initialized to -12
num3: word # num3 uninitialized
arr: words  0*4 1*6 # arr holds 10 elements, four 0's and six 1's
# arr2 holds 10 elements, four 0's and six 1's
arr2: words  0 0 0 0 1 1 1 1 1 1
# arr3 holds 10 elements, four 0's and six 1's
arr2: words  0*4 1**
newline: byte '\n' # newline stores '\n'
char: byte 'I' # char stores 'I'
char2: byte \101 # char2 stores 'A', which is 65 in octal
# arr3 holds 10 elements, four '\0' and six 'a'
arr3: words  '\0'*4 'a'*5
greetings: string "Hello, there!"
The first part of the project is to handle the data region.
You may assume it's the first part of the input file. The
input file will be an assembly language program written
The user will use command line options. This will
be the typical way that the user will enter the information:
assem -d 1000 -t 2000 < input
where -d and -t are switches.
|-d||The starting address of the data region|
|-t||The starting address of the text region|
To read command line options (in C/C++), you will need to
expand main() to take parameters. Here's how it's
int main( int argc, char *argv )
where argc is the number of command line arguments.
For example, in
assem -d 1000 -t 2000
argc would be 5. Each command line argument is
separated by spaces. Each argument is stored as a C-style
string in argv. It's very important to realize that
they are stored as C-style strings (not int values, etc.).
In this example, argv[ 0 ] = "assem", argv[ 1 ] = "-d", argv[ 2 ] = "1000", argv[ 3 ] = "-t", and argv[ 0 ] = "2000".
So, argv[ 2 ] is a string, NOT the number 1000.
You may assume that -d and -t appear in the command line argument, and that each switch is followed by a single string which is a non-negative number, which represents an address in memory.
You MAY NOT assume that -d is always first and -t is always second. It may be written in either order (that way, the user doesn't have to remember which is which).
For a small amount of extra credit, print an error message if there
aren't enough command line arguments.
Incorrect number of arguments.
Syntax: assem -d <addr> -t <addr>
You may assume the data region is first in an assembly language program.
Each line with either be blank (containing 0 or more spaces, followed by a newline), or contain a label.
A label is a string that ends in a colon. The
name of label itself does not contain the colon. Thus, you
num is the name of the label. The colon is how you
know it's a label. You may assume the colon always appears
at the end of the name of the label.
The syntax of a label is the same as a C/C++ identifier. It begins with an alphabetic letter or an underscores, followed by zero or more alphabetic letters, underscores, or digits.
You may assume that no two labels have the same name.
In the data region, the label is followed by the "type". The following are the supported types in our assembly language.
|word||A 32 bit integer, written in base 10|
|words||An "array" of 32 bit integer, written in base 10|
|byte||A 8 bit "character", written as a character in single quotes or octal|
|bytes||An "array" of 8 bit "character", written as a character in single quotes or octal|
|string||Written as a double quoted string. The amount of storage is the length of the string plus 1 (for the null character)|
Each label defines a region of memory where the data is stored. It's your job to determine what addresses each label occupies. You start the address given by -d. For example, if the command line option says -d 1000, you should attempt to set the addresses of the label beginning at address 1000.
You may assume that memory is byte-addressable. Each address in memory refers to a byte.
Let's consider a word.
num: word 10
Here's the BNF.
<line> := <label> <colon> <seps>
"word" <digits> <seps> <endl>
<label> := <identifier>
<identifier> := [ "_" | <alpha> ]
[ "_" | <alpha> | <digit> ]*
<alpha> := <lower> | <upper>
<lower> := "a" | "b" | "c" | "d" | "e" |
"f" | "g" | "h" | "i" | "j" |
"k" | "l" | "m" | "n" | "o" |
"p" | "q" | "r" | "s" | "t" |
"u" | "v" | "w" | "x" | "y" |
<upper> := "A" | "B" | "C" | "D" | "E" |
"F" | "G" | "H" | "I" | "J" |
"K" | "L" | "M" | "N" | "O" |
"P" | "Q" | "R" | "S" | "T" |
"U" | "V" | "W" | "X" | "Y" |
<digit> := "0" | "1" | "2" | "3" | "4" |
"5" | "6" | "7" | "8" | "9" |
<digits> := [-]? [ <digit> ]+
<seps> := [ " " | "\n" ]+
<endl> := "\n"
The BNF is "enhanced". It uses * to stand for Kleene closure
(0 or more repetitions), + to stand for 1 or more repetitions,
and ? stands for 0 or 1 repetitions.
Even though some BNF appears over two or more lines, do not assume that there is a line break. They appear on two or more lines because there's not enough space to put it on one line. To indicate a line break, you will see either a "\n" or a <endl>.
The syntax says that you will see "word" followed by digits. This may have an optional minus sign in front (if it represents a negative number). You may assume the number is in base 10 and can be stored in 32 bits in two's complement.
For a little extra credit, you may wish to support numbers
that are hexadecimal or octal. Values that are hexadecimal
begin with 0x. The alphabetic parts are either uppercase
or lowercase. Here's the BNF.
<hex> := [-]? "0x" [ <digit> | "a" | "A" |
| "b" | "B" | "d" | "D" |
| "e" | "E" | "f" | "F" ]*
<octal> := [-]? "0" [ "0" | "1" | "2" | "3" | "4" |
"5" | "6" | "7" | "8" | "9" ]*
An octal value starts with a 0. This is extra credit, so if
don't want to handle this, you don't have to.
For the output, you shold print the following:
where <label> is the label, "word" is the type. On
the second line, <addr> is the starting address of the
label. which is followed by a colon, followed by at least one blank
space and the value stored at the address in base 10.
<line2> := <label> <colon> <seps> "words" <seps> "[" <digits> "]" [ <seps> [ <digits> | <digits> "*" [ <digits> | "*" ] ] ]*What does this mean?
Here's an example:
num2: words  0*7 1 2 3
num2 is a label. words indicates it will be
an array of 32 bit values.  indicates the number of
32 bit values, in this case, 10 values stored.
Each value afterwards can appear either as a base 10 number, or a base 10 number, followed by an asterisk, followed by either another base 10 number or a second asterisk.
If it's a number, then it occupies 32 bits of memory. If it's a number followed by an asterisk followed by another number, then the first number is the value, and the second number is the amount of times it's repeated. Thus, -10*4 means -10 is repeated 4 times, and therefore occupies 4 * 32 bits of memory.
If it's a number, followed by two asterisks, this means that
the value is copied until the end of the array. Thus,
num2: words  0*7 2**
means that there are 7 zeroes, followed by 3 two's (since the
array contains 10 values).
If the number of values is less than 10, the assume the values are filled with "garbage". If there are more than 10 values, then any value beyond the tenth is ignored. You can print an error message, if you wish.
The output will be:
(<label>, words, <size>)
<addr>: 0 0 0 0 0 0 1 1 1 1
<addr>: 0 0 0 0 0 0 1 1 1 1
where <label> is the label, "words" is the type,
and <size> is the number of words.
Each line afterward will contain an address and up to ten
values per line. Each line except the last will have 10 values.
The address should be the address of the location of the first
of the 10 values. For example,
(num, words, 18)
1000: 0 0 0 0 0 0 1 1 1 1
1040: 3 0 0 0 0 0 ? ?
Print a ? if it contains garbage value. 1000 refers to
address 1000, which is the address of the first 0. 1040 is the
address of 3. There are 18 values (which is how many is printed).
Remember that the addresses should be word-aligned (divisible by 4).
The syntax is similar to word.
char: byte 'b'
newline: byte '\n'
char2: byte \101
Each byte requires one byte of memory. This can be stored
at any address.
When printing the output, print using single characters in
Be careful when printing '\n' and '\0'. They should be printed
as shown (a single quote, a backslash, the character, and another
Here's an example:
arr: bytes  'b'*7 '\0'**
This is an array of 12 bytes, which is 7 'b', followed by 5
The * and value will appear concatenated to the string, so you will need to figure out how to seperate the parts.
This is how you would print the output:
(arr, bytes, 12)
993: 'b' 'b' 'b' 'b' 'b' 'b' 'b' '\0' '\0' '\0'
1003: '\0' '\0'
mesg: string "Hi, there!"It's a label, followed by string, followed by a double quoted string, separated by any combination of spaces and newlines. You may assume the double quoted string does not contain "\n", although for extra credit, you can attempt to handle this.
You should print:
(mesg, string, 11)
1002: "Hi, there!"
You print out the label, then "string", then the number of characters
needed (in this case, 10, including the null character at the end),
followed by the start address, and the string in double quotes.
label is a label which appears in the text section (we'll deal with labels in the data section in Project 2). Assume valid label names.
Each line that appears after .text will either contain an instruction, a label, a label and an instruction, or a blank line.
Here's an example of each:
This is an instruction.
add $r1, $r1, $r2
A label, by itself,
A label with an instruction
BRANCH: add $r1, $r1, $r2
If the input looks like (for "extra credit", ignore comments):
bne $r1, $r2, ELSE # if ( i == j )
addi $r1, $r1, 1 # i++
ELSE: j LAST # skip over else
subi $r2, $r2, 1 # j--
add $r1, $r1, $r2 # j += i
If the code segment begins at address 999, then the
output should look like:
1000: bne $r1, $r2, ELSE # jump +1
1004: addi $r1, $r1, 1
1008: j LAST # jump +1
1012: subi $r2, $r2, 1
1016: add $r1, $r1, $r2
The output consists of addresses, followed by the instruction.
You should add a comment if the instruction is beq,
bne, j, or jal.
Print jump +k if the branch/jump instruction should jump k instructions forward starting at the next instruction. For example, ELSE is 2 instructions forward from line 1000, but only one instruction forward from 1004 (which is the next instruction). It's odd, but there's a reason for this (ask later about it).
Print jump -k if the branch/jump instruction should jump k instructions backward. Again, count from the next instruction.
Print a blank line, and list out the labels, and the addresses of the labels. Notice that a line that contains a label by itself (such as LAST), should be "combined" with the next instruction. Thus, LAST: is really the same line as "add $r1, $r1, $r2".
You may assume the following:
The output of the code part should appear just after the output of the data part.
See the class syllabus for policies concerning email
Last Modified: Sun Mar 3 22:54:00 EST 2002