In this post, we’ll see how to convert a `float`

to an `int`

in C. We’ll assume both `int`

and `float`

datatypes to be 32-bits long an `int`

using two’s complement representation for negative values. The conversion will be performed using the bit-level representation of the float data type. Only addition, subtraction, and bitwise operations will be used.

To convert from `int`

to `float`

, you can see this post.

This is also a solution for exercise 2.96 of the Computer Systems: A Programmer’s Perspective book.

## Rounding to whole numbers

Before we begin, let’s study how C will handle rounding from `float`

to `int`

. The rounding mode used is *round towards zero*.

For positive values, we round down:

Negative values are rounded up:

In practical terms, the fractionary part is *truncated*.

## Getting the floating-point fields

The floating-point single-precision data representation (`float`

), consists of three different fields.

These fields are used to encode a number in the form:

- One bit encodes the sign:
`s`

. - 8 bits encode the exponent, E. We’ll call this bits
`exp`

(in green on the image above). - 23 bits encode the magnificand, M. We’ll call them
`frac`

bits (in red on the image above).

For our converter, we’ll start by getting these different fields.

unsigned s = f >> 31; unsigned exp = f >> 23 & 0xFF; unsigned frac = f & 0x7FFFFF;

They are obtained by shifting operations combined with masks to get the necessary bits. On the code snippet above, `f`

contains the 32-bit `float`

representation of the value.

## Special cases: NaN and Infinity

Floating-point encodes special values such as `NaN`

(not a number, obtained by expressions such as`sqrt(-1)`

or `0.0 / 0.0`

) and infinity (for values that overflow) with the `exp`

field all set to ones. For these special cases, we’ll return the special value `0x80000000`

(which corresponds to `INT_MIN`

).

if (exp == 0xFF) return 0x80000000;

A value preceded by `0x`

, such as `0xFF`

, is used to denote an hexadecimal number.

## Denormalized values

Denormalized values are those in which the `exp`

field is all zeros. In floating-point representation, they are used to encode values that are very close to zero (both positive and negative). All of these values will be rounded to zero.

if (exp == 0) return 0;

## Normalized values

Finally, we can focus on normalized values. In this case, the `exp`

field is neither all ones nor all zeros. For this group of values, the exponent `E`

is encoded as:

Where *exp* is the unsigned 8-bit number represented by the `exp`

field and *Bias = 127*. We can get the exponent *E* as follows:

int E = exp - BIAS;

### Values less than one

First, let’s consider values that are in the range -1 to 1 (not including -1 and 1). All of these values will be rounded to zero. Positive values will be rounded down, and negative values will be rounded up. In both cases, the result will be 0. These cases will take place when the exponent *E* is less than 0:

if (E < 0) return 0;

For instance, if `E = -1`

, the encoded float will be a binary fraction in the form:

Where `XXX`

represents the fractionary part of the magnificand, encoded by the `frac`

field. `0.1XXX`

is a binary fraction less than one.

### Overflowing

The range of `int`

values is between `INT_MIN = -2^31 = 2147483648`

and `INT_MAX = 2^31 - 1 = 2147483647`

. For `float`

values beyond that range, we’ll return `0x80000000`

as a way of encoding *overflow*.

The `int`

datatype represents numbers using 31 bits (the remaining bit is used for the sign). This means that the binary representation of our value can be at most 31 bits long. This limit case will take place when `E = 30`

:

`X...X`

are the 23 bits of the `frac`

field. The value `1X...X0...0`

will be 31 bits long: a leading one + 23 `frac`

bits + 7 zeros.

We can conclude that `float`

will overflow when `E > 30`

:

if (E > 30) return 0x80000000;

## Normalized values in the range of `int`

Finally, we’re left with the `float`

values that can be rounded to an `int`

other than zero and that won’t overflow.

The `frac`

field is 23 bits long. Remember that the exponent *E* and `frac`

encode a value in the form:

Where each `X`

is one of the 23 binary digits of the `frac`

field. They represent the digits that come after the binary point (fractionary part).

A positive value of *E* will shift the binary point E places to the right. For instance if `E = 2`

:

Here we’re using the symbol `Y`

for the 21 `frac`

binary digits that come after the binary point.

As it was mentioned above, when we cast a `float`

to `int`

, C will *truncate* the fractionary part.

That means that the rightmost `23 - 2 = 21`

bits of the `frac`

field are discarded.

On the other hand, if *E* is larger than 23, the binary point will be shifted beyond the bits of the `frac`

field and we’ll have extra zeros to the right:

We will have `E - 23`

trailing zeros.

These conditions can be fulfilled by the appropriate shifting operations. We shift `frac`

to the right (`>>`

) in order to discard the least significant bits when `E < 23`

, and we shift to the left (`<<`

) to add trailing zeros.

int x = 1 << E; if (E < 23) x |= frac >> (23 - E); else x |= frac << (E - 23);

Here `x`

is the resulting bit-level `int`

representation. First, we shift the leading one `E`

places to the left. Then, we get the integer part of the `float`

value by shifting operations as it was mentioned above. Note that we use the bitwise OR operator (`|`

) as a way of “adding” the leading one and the lower order bits of the `frac`

field (e.g. `100 | 001 = 101`

).

### Negative values

Finally, we modify the bit encoding for negative values.

if (s == 1) x = ~x + 1;

The expression `~x+1`

is just a binary operation that yields `-x`

in two’s complement representation.

## Complete program and testing

Putting all of the pieces together in the `float_f2i`

function, we get:

#define NAN 0x80000000 #define BIAS 127 #define K 8 #define N 23 typedef unsigned float_bits; /* Compute (int) f. * If conversion causes overflow or f is NaN, return 0x80000000 */ int float_f2i(float_bits f) { unsigned s = f >> (K + N); unsigned exp = f >> N & 0xFF; unsigned frac = f & 0x7FFFFF; /* Denormalized values round to 0 */ if (exp == 0) return 0; /* f is NaN */ if (exp == 0xFF) return NAN; /* Normalized values */ int x; int E = exp - BIAS; /* Normalized value less than 0, return 0 */ if (E < 0) return 0; /* Overflow condition */ if (E > 30) return NAN; x = 1 << E; if (E < N) x |= frac >> (N - E); else x |= frac << (E - N); /* Negative values */ if (s == 1) x = ~x + 1; return x; }

In order to handle the bit-level representation of float values with bitwise operators, we use the `unsigned`

datatype which we call `float_bits`

. Additionally, we wrote some of the numerical constants as `K = 8`

for the length of the `exp`

field and `N = 23`

for the length of the `exp`

field.

### Testing

The program was tested as follows:

#include <limits.h> #include <assert.h> typedef unsigned float_bits; int float_f2i(float_bits f); int main() { int bits; float *fp; float_bits *fbp; for (bits = INT_MIN; bits < INT_MAX; bits++) { fp = &bits; fbp = &bits; assert((int) *fp == float_f2i(*fbp)); } return 0; }

We used the variable `bits`

to generate all the possible bit-level combinations. The address of `bits`

was then referenced as datatypes `float`

and `float_bits`

to perform the test. Finally, we verified that the result of casting `float`

to `int`

was the same as the return value of the `float_f2i`

function.

When compiled, we received no error output which means that the converter worked for all of the tested values.

Hey, great post, helped me a lot, thanks!! 🙂

You’re welcome!