Analysis of Floating Point Number Representation and Standards

Verified

Added on 2021/04/21

AI Summary

This report provides a detailed explanation of floating point number representation, focusing on the IEEE-754 standard. It covers both 32-bit single-precision and 64-bit double-precision formats, outlining their components: sign bit, exponent, and significand. The report includes examples illustrating how decimal numbers are converted into the IEEE-754 format. Furthermore, it compares and contrasts floating-point representation with fixed-point representation, highlighting the key differences in their structure and the range of numbers they can represent. The report references several sources, including research papers, to support its analysis and provide a comprehensive overview of the topic.

Running head: FLOATING POINT NUMBER REPRESENTATION
Floating Point Number Representation
Name of the Student:
Name of the University:
Author note

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

1FLOATING POINT NUMBER REPRESENTATION
Table of Contents
Specification of Floating point number......................................................................................2
IEEE-754 32-bit Single-Precision Floating-Point Numbers..................................................2
IEEE-754 64-bit Double-Precision Floating-Point Numbers................................................3
Difference between Floating point number and fixed point representation...............................4
Reference....................................................................................................................................5

2FLOATING POINT NUMBER REPRESENTATION
Specification of Floating point number
IEEE-754 32-bit Single-Precision Floating-Point Numbers
The IEEE-754 is a number format on the computer system that occupies 4 bytes of
memory in the computer system. It is also referred to as binary32 as the representation
requires only 32 bits of memory. The format of the IEEE-754 32 bit single precision format is
represented below:
Figure: Single precision format
Source: Kumar & Basha, 2016
The IEEE 754 32 bit Single precision format consists of three components:
 Sign bit: 1 bit
 Exponent bit: 8 bit
 Significand precision: 24 in which 23 bits are explicitly stored.
The signed bit represents the sign of the integer which represents positive as well as
negative values. The 8 bits represents the exponent in signed format ranging from -127 to 128
as well as unsigned format ranging from 0 to 255 (Hou et al., 2017). The true significant bit is
represented in the 23 fraction bits which following the exponent bit.
An example of the IEEE 754 32 bit single precision format:

3FLOATING POINT NUMBER REPRESENTATION
Let us consider a value 0.25 in decimal. The 32 bit single precision format would be
represented as:
(0.25)10 can be considered as (1.0)2 * 2-2
The analysis of the above equation states that the exponent is -2 which can be
represented in the biased form as 127-2=125. 125 can be further represented in binary form as
0111 1101.
The fraction is 0 as the numbers following the right of the binary point in 1.0 are all
zeros. Thus, the 23 significand bit representation consists of 00000000000000000000000.
Thus, the complete representation of the number 0.25 in the 32 bit single precision
format is as follows:
0 01111101 00000000000000000000000
IEEE-754 64-bit Double-Precision Floating-Point Numbers
The IEEE 64 bit double precision number incorporates the capability to store 64 bit
precision number. It occupies two adjacent storage locations in the computer’s memory. It is
most commonly used in the PC’s due to its wider range of information storage precision. The
single precision format lacks the actual precision of the integer format, thus double precision
format is more commonly used.
The IEEE-754 64-bit double precision format is represented below:
Figure: Representation of Double precision floating point number

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

4FLOATING POINT NUMBER REPRESENTATION
Source: Fulzele & Ghodke, 2015
According to the figure, the format consists of the following three components:
 Sign bit: 1 bit
 Exponent bit: 11 bits
 Significand bits: 54 in which 23 bits are explicitly stored.
Example:
The exact value of the 64-bit double precision is given by,
(-1)sign * (1.b51b50……..b0)2 * 2e-1023
Where, sign stands for sign of integer and e stands for exponent. The number 1 can be
represented as:
0 01111111111 000000000000000000000000000000000000000000000000.
Difference between Floating point number and fixed point representation
The fixed point representation of a number includes three components, sign bit,
integer field, and the fractional field. The sign bit is 1 bit, integer field is 15 bit and the
fractional field is 16 bit. But in the floating point representation the integer field consists of
either 8 bit or 11 bits, the remaining bits are the fractional part in both the representations
(Lindstrom, Lloyd & Hittinger, 2018). Moreover, the fixed point representation can represent
smaller numbers but the floating point representation presents wider range of numbers.

5FLOATING POINT NUMBER REPRESENTATION
Reference
Fulzele, S., & Ghodke, V. (2015). Novel Technique for Parallel Pipeline Double Precision
IEEE-754 Floating Point Adder. International Journal Of Engineering And Computer
Science, 4(06).
Hou, J., Zhu, Y., Shen, Y., Li, M., Wu, H., & Song, H. (2017, December). Tackling Gaps in
Floating-Point Arithmetic: Unum Arithmetic Implementation on FPGA. In High
Performance Computing and Communications; IEEE 15th International Conference
on Smart City; IEEE 3rd International Conference on Data Science and Systems
(HPCC/SmartCity/DSS), 2017 IEEE 19th International Conference on (pp. 615-616).
IEEE.
Kumar, B. V. V., & Basha, S. M. (2016). Design and Simulation of Single-Precision Inexact
Floating-Point Adder/Subtractor. i-Manager's Journal on Electronics
Engineering, 6(4), 7.
Lindstrom, P., Lloyd, S., & Hittinger, J. (2018). Universal Coding of the Reals: Alternatives
to IEEE Floating Point.