Skip to content

Converting float to int in C

In this post, we’ll see how to convert a float to an int in C. We’ll assume both int and float datatypes to be 32-bits long an int using two’s complement representation for negative values. The conversion will be performed using the bit-level representation of the float data type. Only addition, subtraction, and bitwise operations will be used.

To convert from int to float, you can see this post.

This is also a solution for exercise 2.96 of the Computer Systems: A Programmer’s Perspective book.

Rounding to whole numbers

Before we begin, let’s study how C will handle rounding from float to int. The rounding mode used is round towards zero.

For positive values, we round down:

Negative values are rounded up:

In practical terms, the fractionary part is truncated.

Getting the floating-point fields

The floating-point single-precision data representation (float), consists of three different fields.

These fields are used to encode a number in the form:

  • One bit encodes the sign: s.
  • 8 bits encode the exponent, E. We’ll call this bits exp (in green on the image above).
  • 23 bits encode the magnificand, M. We’ll call them frac bits (in red on the image above).

For our converter, we’ll start by getting these different fields.

unsigned s = f >> 31;
unsigned exp = f >> 23 & 0xFF;
unsigned frac = f & 0x7FFFFF;

They are obtained by shifting operations combined with masks to get the necessary bits. On the code snippet above, f contains the 32-bit float representation of the value.

Special cases: NaN and Infinity

Floating-point encodes special values such as NaN (not a number, obtained by expressions such assqrt(-1) or 0.0 / 0.0) and infinity (for values that overflow) with the exp field all set to ones. For these special cases, we’ll return the special value 0x80000000 (which corresponds to INT_MIN).

if (exp == 0xFF)
  return 0x80000000;

A value preceded by 0x, such as 0xFF , is used to denote an hexadecimal number.

Denormalized values

Denormalized values are those in which the exp field is all zeros. In floating-point representation, they are used to encode values that are very close to zero (both positive and negative). All of these values will be rounded to zero.

if (exp == 0)
    return 0;

Normalized values

Finally, we can focus on normalized values. In this case, the exp field is neither all ones nor all zeros. For this group of values, the exponent E is encoded as:

Where exp is the unsigned 8-bit number represented by the exp field and Bias = 127. We can get the exponent E as follows:

int E = exp - BIAS;

Values less than one

First, let’s consider values that are in the range -1 to 1 (not including -1 and 1). All of these values will be rounded to zero. Positive values will be rounded down, and negative values will be rounded up. In both cases, the result will be 0. These cases will take place when the exponent E is less than 0:

if (E < 0)
  return 0;

For instance, if E = -1, the encoded float will be a binary fraction in the form:

Where XXX represents the fractionary part of the magnificand, encoded by the fracfield. 0.1XXX is a binary fraction less than one.

Overflowing

The range of int values is between INT_MIN = -2^31 = 2147483648 and INT_MAX = 2^31 - 1 = 2147483647. For float values beyond that range, we’ll return 0x80000000 as a way of encoding overflow.

The int datatype represents numbers using 31 bits (the remaining bit is used for the sign). This means that the binary representation of our value can be at most 31 bits long. This limit case will take place when E = 30:

X...X are the 23 bits of the frac field. The value 1X...X0...0 will be 31 bits long: a leading one + 23 frac bits + 7 zeros.

We can conclude that float will overflow when E > 30:

if (E > 30)
  return 0x80000000;

Normalized values in the range of int

Finally, we’re left with the float values that can be rounded to an int other than zero and that won’t overflow.

The frac field is 23 bits long. Remember that the exponent E and frac encode a value in the form:

Where each X is one of the 23 binary digits of the frac field. They represent the digits that come after the binary point (fractionary part).

A positive value of E will shift the binary point E places to the right. For instance if E = 2:

Here we’re using the symbol Y for the 21 frac binary digits that come after the binary point.

As it was mentioned above, when we cast a float to int, C will truncate the fractionary part.

That means that the rightmost 23 - 2 = 21 bits of the frac field are discarded.

On the other hand, if E is larger than 23, the binary point will be shifted beyond the bits of the frac field and we’ll have extra zeros to the right:

We will have E - 23 trailing zeros.

These conditions can be fulfilled by the appropriate shifting operations. We shift frac to the right (>>) in order to discard the least significant bits when E < 23, and we shift to the left (<<) to add trailing zeros.

int x = 1 << E;
if (E < 23)
  x |= frac >> (23 - E);
else
  x |= frac << (E - 23);

Here x is the resulting bit-level int representation. First, we shift the leading one E places to the left. Then, we get the integer part of the float value by shifting operations as it was mentioned above. Note that we use the bitwise OR operator (|) as a way of “adding” the leading one and the lower order bits of the frac field (e.g. 100 | 001 = 101 ).

Negative values

Finally, we modify the bit encoding for negative values.

if (s == 1)
  x = ~x + 1;

The expression ~x+1 is just a binary operation that yields -x in two’s complement representation.

Complete program and testing

Putting all of the pieces together in the float_f2i function, we get:

#define NAN 0x80000000
#define BIAS 127
#define K 8
#define N 23

typedef unsigned float_bits;

/* Compute (int) f.
 * If conversion causes overflow or f is NaN, return 0x80000000
 */
int float_f2i(float_bits f) {
  unsigned s = f >> (K + N);
  unsigned exp = f >> N & 0xFF;
  unsigned frac = f & 0x7FFFFF;
  
  /* Denormalized values round to 0 */
  if (exp == 0)
    return 0;
  /* f is NaN */
  if (exp == 0xFF)
    return NAN;
  /* Normalized values */
  int x;
  int E = exp - BIAS;
  /* Normalized value less than 0, return 0 */
  if (E < 0)
    return 0;
  /* Overflow condition */
  if (E > 30)
    return NAN;
  x = 1 << E;
  if (E < N)
    x |= frac >> (N - E);
  else
    x |= frac << (E - N);

  /* Negative values */
  if (s == 1)
    x = ~x + 1;

  return x;  
}

In order to handle the bit-level representation of float values with bitwise operators, we use the unsigned datatype which we call float_bits. Additionally, we wrote some of the numerical constants as K = 8 for the length of the exp field and N = 23 for the length of the exp field.

Testing

The program was tested as follows:

#include <limits.h>
#include <assert.h>

typedef unsigned float_bits;

int float_f2i(float_bits f);

int main() {
  int bits;
  float *fp;
  float_bits *fbp;
  
  for (bits = INT_MIN; bits < INT_MAX; bits++) {
    fp = &bits;
    fbp = &bits;
    assert((int) *fp == float_f2i(*fbp));
  }

  return 0;
}

We used the variable bits to generate all the possible bit-level combinations. The address of bitswas then referenced as datatypes float and float_bits to perform the test. Finally, we verified that the result of casting float to int was the same as the return value of the float_f2i function.

When compiled, we received no error output which means that the converter worked for all of the tested values.

Published inProgramming
Subscribe
Notify of
guest
2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
anonymous user
anonymous user
2 months ago

Hey, great post, helped me a lot, thanks!! 🙂