What are the approaches to SW floating-point implementation?

1.1k Views Asked by At

I need to be able to use floating-point arithmetic under my dev environment in C (CPU: ~12 MHz Motorola 68000). The standard library is not present, meaning it is a bare-bones C and no - it isn't gcc due to several other issues

I tried getting the SoftFloat library to compile and one other 68k-specific FP library (name of which escapes me at this moment), but their dependencies cannot be resolved for this particular platform - mostly due to libc deficiencies. I spent about 8 hrs trying to overcome the linking issues, until I got to a point where I know I can't get further.

However, it took mere half an hour to come up with and implement the following set of functions that emulate floating-point functionality sufficiently for my needs.

The basic idea is that both fractional and non-fractional part are 16-bit integers, thus there is no bit manipulation. The nonfractional part has a range of [-32767, 32767] and the fractional part [-0.9999, +0.9999] - which gives us 4 digits of precision (good enough for my floating-point needs - albeit wasteful).

It looks to me, like this could be used to make a faster, smaller - just 2 Bytes-big - alternative version of a float with ranges [-99, +99] and [-0.9, +0.9]

The question here is, what other techniques - other than IEEE - are there to make an implementation of basic floating-point functionality (+ - * /) using fixed-point functionality?

Later on, I will need some basic trigonometry, but there are lots of resources on net for that.

  • Since the HW has 2 MBs of RAM, I don't really care if I can save 2 bytes per soft-float (say - by reserving 9 vs 7 bits in an int). Thus - 4 bytes are good enough.
  • Also, from brief looking at the 68k instruction manual (and the cycle costs of each instruction), I made few early observations:
    • bit-shifting is slow, and unless performance is of critical importance (not the case here), I'd prefer easy debugging of my soft-float library versus 5-cycles-faster code. Besides, since this is C and not 68k ASM, it is obvious that speed is not a critical factor.
    • 8-bit operands are as slow as 16-bit (give or take a cycle in most cases), thus it looks like it does not make much sense to compress floats for the sake of performance.

What improvements / approaches would you propose to implement floating-point in C using fixed-point without any dependency on other library/code?

Perhaps it would be possible to use a different approach and do the operations on frac & non-frac parts at the same time?

Here is the code (tested only using the calculator), please ignore the C++ - like declaration and initialization in the middle of functions (I will reformat that to C-style later):

inline int Pad (int f) // Pad the fractional part to 4 digits
{
if (f < 10) return f*1000;
    else if (f < 100) return f*100;
        else if (f < 1000) return f*10;
            else return f;
}

//  We assume fractional parts are padded to full 4 digits 
inline void Add (int & b1, int & f1,  int b2, int f2)
{
b1 += b2;
f1 +=f2;
if (f1 > 9999) { b1++; f1 -=10000; }
else if (f1 < -9999) { b1--; f1 +=10000; }
f1 = Pad (f1);
}

inline void Sub (int & b1, int & f1,  int b2, int f2)
{
    // 123.1652 - 18.9752 = 104.1900
b1 -= b2; // 105
f1 -= f2; // -8100
if (f1 < 0) { b1--; f1 +=10000; }
f1 = Pad (f1);
}

    // ToDo: Implement a multiplication by float
inline void Mul (int & b1, int & f1, int num)
{
    // 123.9876 * 251 = 31120.8876
b1 *=num;   // 30873
long q = f1*num; //2478876
int add = q/10000; // 247
b1+=add; // 31120
f1 = q-(add*10000);//8876
f1 = Pad (f1);
}
    // ToDo: Implement a division by float
inline void Div (int & b1, int & f1, int num)
{
    // 123.9876 / 25 = 4.959504
int b2 = b1/num; // 4
long q = b1 - (b2*num); // 23
f1 = ((q*10000) + f1) / num; // (23000+9876) / 25 = 9595
b1 = b2;
f1 = Pad (f1);
}
2

There are 2 best solutions below

4
On

You are thinking in the wrong base for a simple fixed point implementation. It is much easier if you use the bits for the decimal place. e.g. using 16 bits for the integer part and 16 bits for the decimal part (range -32767/32767, precision of 1/2^16 which is a lot higher precision than you have).

The best part is that addition and subtraction are simple (just add the two parts together). Multiplication is a little bit trickier: you have to be aware of overflow and so it helps to do the multiplication in 64 bit. You also have to shift the result after the multiplication (by however many bits are in your decimal).

typedef int fixed16;

fixed16 mult_f(fixed16 op1, fixed16 op2)
{
         /* you may need to do something tricky with upper and lower if you don't
          * have native 64 bit but the compiler might do it for us if we are lucky
          */
         uint64_t tmp;
         tmp = (op1 * op2) >> 16;

          /* add in error handling for overflow if you wish - this just wraps */
         return tmp & 0xFFFFFFFF;
}

Division is similar.

Someone might have implemented almost exactly what you need (or that can be hacked to make it work) that's called libfixmath

0
On

If you decide to use fixed-point, the whole number (i.e both int and fractional parts) should be in the same base. Using binary for the int part and decimal for the fractional part as above is not very optimal and slows down the calculation. Using binary fixed-point you'll only need to shift an appropriate amount after each operation instead of long adjustments like your idea. If you want to use Q16.16 then libfixmath as dave mentioned above is a good choice. If you want a different precision or floating point position such as Q14.18, Q19.13 then write your own library or modify some library for your own use. Some examples

See also What's the best way to do fixed-point math?

If you want a larger range then floating point maybe the better choice. Write a library as your own requirements, choose a format that is easy to implement and easy to achieve good performance in software, no need to follow IEEE 754 specifications (which is only fast with hardware implementations due to the odd number of bits and strange exponent bits' position) unless you intend to exchange data with other devices. For example a format of exp.sign.significand with 7 exponent bits followed by a sign bit and then 24 bits of significand. The exponent doesn't need to be biased, so to get the base only an arithmetic shift by 25 is needed, the sign bit will also be extended. But in case the shift is slower than a subtraction then excess-n is better.