Floating Point Representation of Real numbers

In denary  the integer 25000 can be written as

  • 2.5 x 104

This number is in floating point format :

Mantissa x 10Exponent

In binary, though, we express the exponent as a power of 2, so every binary floating point number is of the form :

Mantissa x 2Exponent

The mantissa would be a fixed point fraction and the exponent would be an integer.

 

There can be many floating point representations of the same number...

For example, the integer 40 can be written as 
= 20 x 21
= 10 x 22
=  5 x 2
3
=  2.5 x 24
= 1.25 x 25
= 0.625 x 26

!!!Which one do we use?..Well the answer is : We use the last one where the mantissa lies between 0.5 and 1 - this is said to be in normalised form and is the most accurate.

Example

A floating point number system uses 16-bit numbers. 8 bits for the (signed)mantissa, and 8 bits for the (signed) exponent.

Convert the following binary number to denary.

01010001 00000101

Step 1 : The Exponent 00000101

This is a positive integer :

Sign 64 32 16 8 4 2 1
0 0 0 0 0 1 0 1

The exponent value is 4 + 1 = 5

Step 2 :

The Mantissa 0.1010001

This is a positive fraction (binary point after the sign bit):

The exponent is 5 ...so perform an arithmetic left shift 5 times...

0.1010001
01.010001 (once)
010.10001 (twice)
0101.0001 (3 times)
01010.001 (4 times)
010100.01 (5 times)

Sign 16 8 4 2 1 . 0.5 0.25
0 1 0 1 0 0 . 0 1

The final answer : 19.25

 

 
Converting a real number into floating point form.

(This example uses a 12-bit signed mantissa and a 4-bit signed exponent - make sure you read any exam questions carefully for the storage setup)

Example : Convert 14.625 into floating point form.

Step 1 : Convert the integer and fraction parts of the number into binary:

Sign ... 32 16 8 4 2 1
0 ... 0 0 1 1 1 0

14 = 01110

 
Sign . 0.5 0.25 0.125 0.0625 0.03125 0.015625 ...
0 . 1 0 1 0 0 0 ...

.625 = 0.101

So 14.625 = 01110.1010000 (adding 0s to make it up to 12 bits)

Step 2 : Perform a number of arithmetic right shifts (divide mantissa by 2) until the binary point is in the correct position (after the sign bit). For each shift, add 1 to the exponent...

(Think of this as moving the binary point to the left a number of places)

Mantissa Exponent
01110.1010000 0
0111.01010000 1
011.101010000 2
01.1101010000 3
0.11101010000 4

Step 3 : Convert the exponent into a binary integer.

Exponent = 4 = 0100

So final answer :

14.625 =011101010000 0100

 

 
Advantages and Disadvantages

The advantage of using floating point form for numbers is that a greater range of numbers is representable.

The disadvantages -

  • more storage space needed
  • slower processing times
  • lack of precision - some real numbers can only be represented approximately