Reals : Floating point representation


	Home	Back	Topics	CD Orders	INDEX

Reals : Floating Point Representation

In denary the integer 25000 can be written as

2.5 x 10⁴

This number is in floating point format :

Mantissa x 10^Exponent

In binary, though, we express the exponent as a power of 2, so every binary floating point number is of the form :

Mantissa x 2^Exponent

The mantissa would be a fixed point fraction and the exponent would be an integer.

For example, the integer 40 can be written as
= 20 x 2¹
= 10 x 2²
= 5 x 2³= 2.5 x 2⁴
= 1.25 x 2⁵
= 0.625 x 2⁶

!!!Which one do we use?..Well the answer is : We use the last one where the mantissa lies between 0.5 and 1 - this is said to be in normalised form and is the most accurate.

Example

A floating point number system uses 16-bit numbers. 8 bits for the mantissa, and 8 bits for the exponent).

Convert the following binary number to denary.

01010000 00001001

Step 1 : The Mantissa 01010000

This is a positive fixed point fraction (binary point after the sign bit):

Sign	0.5	0.25	0.125	0.0625	0.03125	0.015625	0.0078125
0	1	0	1	0	0	0	0

The mantissa value is 0.5 + 0.125 = 0.625

Step 2 : The Exponent 00001001

This is a positive integer :

Sign	64	32	16	8	4	2	1
0	0	0	0	1	0	0	1

The exponent value is 8 + 1 = 9

Step 3 : Put together what we have...

The final answer is 0.625 x 2⁹⁼320

Exercise

These exercises use the same system as the example above...

The important bit...

Range and precision

To increase the range of numbers which can be represented, use more bits for the exponent.

To increase the precision, use more bits for the mantissa.

Not all numbers are exactly representable in any system, and approximations must be used. To increase the accuracy of the arithmetic, more bits must be used to represent numbers. (Scientific high level languages would use more bits for numbers than commercial languages)

Overflow

If the result of an arithmetic operation is too 'big' to fit in a pre-defined space, then overflow has occurred.

Trivial example using 8-bit integers:
50 = 00011010
40 = 00101000

but 40 x 50 = 11111010000 which will not fit into 8 bits!

Underflow

If the result of an arithmetic operation is too small to be represented, underflow has occurred.