Reals : Floating Point Representation

In denary  the integer 25000 can be written as

  • 2.5 x 104

This number is in floating point format :

Mantissa x 10Exponent

In binary, though, we express the exponent as a power of 2, so every binary floating point number is of the form :

Mantissa x 2Exponent

The mantissa would be a fixed point fraction and the exponent would be an integer.

For example, the integer 40 can be written as 
= 20 x 21
= 10 x 22
=  5 x 2
3
=  2.5 x 24
= 1.25 x 25
= 0.625 x 26

!!!Which one do we use?..Well the answer is : We use the last one where the mantissa lies between 0.5 and 1 - this is said to be in normalised form and is the most accurate.

Example

A floating point number system uses 16-bit numbers. 8 bits for the mantissa, and 8 bits for the exponent).

Convert the following binary number to denary.

01010000 00001001

Step 1 : The Mantissa 01010000

This is a positive fixed point fraction (binary point after the sign bit):

Sign 0.5 0.25 0.125 0.0625 0.03125 0.015625 0.0078125
0 1 0 1 0 0 0 0

The mantissa value is 0.5 + 0.125 = 0.625

Step 2 : The Exponent 00001001

This is a positive integer :

Sign 64 32 16 8 4 2 1
0 0 0 0 1 0 0 1

The exponent value is 8 + 1 = 9

Step 3 : Put together what we have...

The final answer is 0.625 x 29 = 320

Exercise

These exercises use the same system as the example above...

1 What denary number is represented by the binary number 
00101010 00000011 ?
2 What denary number is represented by the binary number
01101000 00010001 ?

 

 

The important bit...

Range and precision

To increase the range of numbers which can be represented, use more bits for the exponent.

To increase the precision, use more bits for the mantissa.

Not all numbers are exactly representable in any system, and approximations must be used. To increase the accuracy of the arithmetic, more bits must be used to represent numbers. (Scientific high level languages would use more bits for numbers than commercial languages)

 

 

Overflow

If the result of an arithmetic operation is too 'big' to fit in a pre-defined space, then overflow has occurred.

Trivial example using 8-bit integers:
50 = 00011010
40 = 00101000

but 40 x 50 = 11111010000 which will not fit into 8 bits!

Underflow

If the result of an arithmetic operation is too small to be represented, underflow has occurred.