Addition is simple. Suppose you want to add two floating point numbers, X and Y.
For sake of argument, assume the exponent in Y is less than or equal to the exponent in X. Let the exponent of Y be y and let the exponent of X be x.
Here's how to add floating point numbers.
Add x - y to Y's exponent. Shift the radix
point of the mantissa (signficand) Y left by x -
y to compensate for the change in exponent.
Variable | sign | exponent | fraction |
X | 0 | 1001 | 110 |
Y | 0 | 0111 | 000 |
Here are the steps again:
In normalized scientific notation, X is 1.110 x 2^{2}, and Y is 1.000 x 2^{0}.
Add x - y to Y's exponent. Shift the radix point of the mantissa (signficand) Y left by x - y to compensate for the change in exponent.
The difference of the exponent is 2. So, add 2 to Y's exponent, and shift the radix point left by 2. This results in 0.0100 x 2^{2}. This is still equivalent to the old value of Y. Call this readjusted value, Y'
We add 1.110_{two} to 0.01_{two}. The sum is: 10.0_{two}. The exponent is still the exponent of X, which is 2.
In this case, the sum, 10.0_{two}, has two bits left of the radix point. We need to move the radix point left by 1, and increase the exponent by 1 to compensate.
This results in: 1.000 x 2^{3}.
Sum | sign | exponent | fraction |
X + Y | 0 | 1010 | 000 |
Variable | sign | exponent | fraction |
X | 0 | 1001 | 110 |
Y | 0 | 0110 | 110 |
Here are the steps again:
In normalized scientific notation, X is 1.110 x 2^{2}, and Y is 1.110 x 2^{-1}.
Add x - y to Y's exponent. Shift the radix point of the mantissa (signficand) Y left by x - y to compensate for the change in exponent.
The difference of the exponent is 3. So, add 3 to Y's exponent, and shift the radix point of Y left by 3. This results in 0.00111 x 2^{2}. This is still equivalent to the old value of Y. Call this readjusted value, Y'
We add 1.110_{two} to 0.00111_{two}. The sum is: 1.11111_{two}. The exponent is still the exponent of X, which is 2.
In this case, the sum, 1.11111_{two}, has a single 1 left of the radix point. So, the sum is normalized. We do not need to adjust anything yet.
So the result is the same as before: 1.11111 x 2^{3}.
We only have 3 bits to represent the fraction. However, there were 5 bits in our answer. Obviously, it looks like we should round, and real floating point hardware would do rounding.
However, for simplicity, we're going to truncate the additional two bits. After truncating, we get 1.111 x 2^{2}. We convert this back to floating point.
Sum | sign | exponent | fraction |
X + Y | 0 | 1010 | 111 |
If you're doing it on paper, then you proceed with the sum as usual. Just do normal addition or subtraction.
If it's in hardware, you would probably convert the mantissas to two's complement, and perform the addition, while keeping track of the radix point (read about fixed point representation.
Once the addition is done, we may have to renormalize and to truncate bits if there are too many bits to be represented.
If the differences in the exponent is too great, then the adding X + Y effectively results in X.
Real floating point hardware uses more sophisticated means to round the summed result. We take the simplification of truncating bits if there are more bits than can be represented.