Floating point operations with no library

Question

Floating point operations with no library

427 Views Asked by andrei filip At 20 August 2025 at 08:08

I am looking for a efficient way to properly do mathematical operations with floating values. As I am in the embedded C, I don't want to use any extra library for float data type.

As far as I understand, the correct way here would be to treat a floating value as a raw binary(sign, exponent, mantissa), and do the operations like that. But I cannot find any examples on how exactly that works.

I am looking for a explication on how to do the following with no float data type: Given a variable int x that can have values from 0 to 10000.

y = x * 0.720 + 84.234;
y = y / 2.5;

Thank you for your time internet

Original Q&A

There are 1 best solutions below

**Clifford** · Answer 1

Floating point libraries are not required for the example operations you have suggested, and while avoiding floating point code on an embedded system without an FPU is often advisable, doing that by implementing your own floating point encoding will save you nothing and will likely be less efficient, less comprehensible and more error prone than using compiler's built-in FP support.

Instead, you need to avoid floating-point code entirely, and use fixed-point encoding. In many cases that can be done ad-hoc for individual expressions, but if your application is math intensive (involving trig, logs, sqrt, exponentiation for example) you might to choose a fixed-point library or implement your own.

Floating-point dependency is trivially eradicated in the examples you have suggested; for example:

// y = x * 0.720 + 84.234
// Where x_x1000 = real value * 1000
int y_x1000 = (x_x1000 * 720) / 1000 + 84234 ;

or more efficiently using binary-fixed-point and a 10 bit fractional part:

// y = x * 0.720 + 84.234
// Where x_q10 = real value * 1024
int32_t y_q10 = (x_q10 * 737) >> 10 + 86256 ;

Although you might consider int64_t for greater numeric range - in which case you might also use more fractional bits for greater precision too.

If you are doing a lot of intensive fixed-point maths, you would do well to consider a library or implement one using CORDIC algorithms. An example of such a library can be found at https://www.justsoftwaresolutions.co.uk/news/optimizing-applications-with-fixed-point-arithmetic.html, although it is C++ - the clear advantage being that by defining a fixed class and extensive operator overloading, existing floating-point code can largely be converted to fixed point by replacing double or float keywords with fixed and compiling as C++ - even if the code is otherwise non-OOP and entirely C-like.

Floating point operations with no library

There are 1 best solutions below

Related Questions in C

Related Questions in FLOATING-POINT

Related Questions in EMBEDDED

Related Questions in UINT32

Trending Questions

Popular # Hahtags

Popular Questions