Fast Math Firmware

mars · September 29, 2017, 4:54pm

This is the continuation of an issue that has been discussed in the Flashstorage branch topic.

Since some time (too long!) we’ve been working on implementing fast, accurate lookup-based approximations of exp, pow, log and related functions. The problem has been balancing speed, size, accuracy, portability and overheads. And ease of use. And it has to work for all types of patches - C++, FAUST, Gen, Pd - and all targets: device, web, vst. Phew.

Minimising overheads means that, ideally, if your patch doesn’t use these functions it shouldn’t have to carry any extra luggage, either in code or data.

Portability here means that when we deploy a solution, new patches should still run on older firmware versions, and old patches should run on the new firmware.

I think I’ve come up with a solution that fits the bill.

The implementation is based on http://www.hxa.name/articles/content/fast-pow-adjustable_hxa7241_2007.html
It involves lookup tables where higher precision means better accuracy, and (exponentially!) bigger tables.

What we can do is generate tables of a decent size to put in the next version of the firmware. The patch will be able to request these tables, and if they’re not available it will itself supply some smaller, default tables. And if the fast math functions aren’t used, then the default tables don’t get compiled into the patch at all.

So with new patches on new firmware you get speed and accuracy. With new patches on old firmware you get speed but reduced accuracy.

I think that initially the patches should be compiled with fast math functions as an ‘optional extra’, ie they have to be explicitly called from C++ (with e.g. fast_expf(x) instead of expf(x)). Once most people have migrated to the new firmware version, we can make this the default, with the option of de-selecting the fast maths in the patch.

What’yall think?

mars · September 29, 2017, 5:17pm

Some stats pulled from the device:

SpeedTestPow
with std::powf Device stats: CPU: 9365% Memory: 1096 bin size: 9972
with fast_powf Device stats: CPU: 612% Memory: 1096 bin size: 6292
6.5% of CPU load, 3680 bytes smaller program size

SpeedTestLog
with std::logf Device stats: CPU: 4955% Memory: 1096 bin size: 6276
with fast_logf Device stats: CPU: 792% Memory: 1096 bin size: 5524
16% of CPU load, 752 bytes smaller program size

cdbailey · September 29, 2017, 8:30pm

My technical knowledge of what you’re asking about is unfortunately limited. So I’m reluctant to say much. But I do wanna say thank you for your efforts, and I’m excited to see the end result!

mars · October 2, 2017, 12:35pm

thanks @cdbailey, good to hear.

The new functions are:

   // fast lookup-based exponentials
   float fast_powf(float x, float y);
   float fast_expf(float x);
   float fast_exp2f(float x);
   float fast_exp10f(float x);

   // fast lookup-based logarithmics
   float fast_logf(float x);
   float fast_log2f(float x);
   float fast_log10f(float x);

The firmware is now released and the implementation has been merged into OwlProgram (master branch) and deployed to the live server.

If you are writing C++ patches you can try out these functions right away, using the live server (or master branch). If not, then you can either try out the fastpow branch of OwlProgram or wait until later in the week when we’ll deploy the last piece, which will make fast math the default across the board. And yes we’ll have the same exponentials and logarithmics for Gen and Pd, which should make a fairer comparison possible!

mars · October 9, 2017, 12:41pm

I’ve pushed the changes to the live server so fast math is now the default. Patches that make heavy use of exponentials and/or logarithms should see a big performance improvement, whether C++, FAUST, Pd or Gen.

For example, the simple Compressor was running at 48% CPU, now it is at 21%.

Go fast, go exponential, and have fun!

josefuzzno · October 13, 2017, 9:59am

Very elegant solution indeed!
Can´t wait to try the new firmware.

jrp · October 19, 2017, 5:30pm

Well, just to say this: for me it matetrs most, that those lookup table deliver proper pitches, when using for instance a fastexp to calculate semitones, quatertones etc. - so generally as long as the math is musically consistent i am supporting your move

mars · October 20, 2017, 3:30pm

Yes the measured error is less than 0.0247% for log and less than 0.0378% for pow.

Do you have a patch that we can do a side-by-side subjective test with to hear any pitch differences?

sletz · October 25, 2017, 1:13pm

Hi Martin,

This looks extremely interesting ! On the Faust side, we would be interested to test this fast math functions in the context of WebAssembly generation (see this post http://faust.grame.fr/news/2017/09/15/backend-benchmarks.html)

Where is the source code go fast_powf function and the like ? (I could not locate them on github…). Thanks.

mars · October 25, 2017, 10:59pm

It’s a bit spread out partly because the tables are split between static (firmware) and dynamic (patch) code.

In OwlProgram/Tools you’ve got two simple utilities that generate tables.

In OwlProgram/LibSource you’ll find fastlog.h/.c and fastpow.h/.c which implement the lookups. And in basicmaths.h/.c are the top level functions and their dirty defines.

Or you can go straight to the source:
http://www.hxa.name/articles/content/fast-pow-adjustable_hxa7241_2007.html

mars · October 25, 2017, 11:14pm

Really interesting work in that link Stephane. I wasn’t aware of WAVM. So it seems that WebAssembly can almost be used as a kind of platform independent byte code… Crazy! What would you think of making a VST plugin that dynamically loads WebAssembly and runs it in WAVM?

sletz · October 26, 2017, 11:55am

Thanks for the links.

“WebAssembly can almost be used as a kind of platform independent byte code” yes, this was actually part of the initial design, see: http://webassembly.org/docs/non-web/

We could certainly imagine all kind of use cases for the WAVM runtime. In the context of Faust, we could certainly make the following glue code:WAVM/Source/Programs at faust · sletz/WAVM · GitHub work in a VST plugin or something similar. This could even be linked with a pure Web Faust service (like the recently developed one http://faust.grame.fr/editor/) that would compile Faust DSP to wasm, then sent the wasm code to a remote Faust WAVM runtime…

jrp · October 26, 2017, 2:07pm

I can work an an audio/pitch example next week to check on correct pitches with the new fast functions.

sletz · November 15, 2017, 10:13am

You said about the “live server” : have you compiled and deployed the fast math versions with emscripten and WebAssembly ?

sletz · November 15, 2017, 5:29pm

Another question : all of this functions use float. Does changing float by double everywhere in the implementation (including table) would be enough for a double version of all functions?

sletz · November 15, 2017, 10:04pm

It seems that fast_log10f and fast_log2f implementation is incorrect : M_LN10 and M_LN2 should be used:

float fast_log10f(float x)
{
/* log10 (x) equals log (x) / log (10). */
return icsi_log(x, log_table, log_precision) / M_LN10;
}

float fast_log2f(float x)
{
/* log2 (x) equals log (x) / log (2). */
return icsi_log(x, log_table, log_precision) / M_LN2;
}

And fast_atan2f is not really usable AFAICS, too much imprecise in some tests I did.

mars · November 18, 2017, 11:08am

Edit: in response to fast math for web build:
No, only for the ARM builds, since that is really where performance is critical. I wasn’t planning on using them with the emscripten build, though there’s no real reason not to apart from maybe size.

mars · November 18, 2017, 11:09am

edit: in response to double precision:
I haven’t tried it. What precision do you need for the lookup tables?

Looking at the icsi_log function the float argument is split into exponent and mantissa in a way that would have to be changed for double precision:

github.com

pingdynasty/OwlProgram/blob/8a2b50126348823f1e9cd536b4a912678961c566/LibSource/fastlog.c#L44

    
      
                (start with extra half increment, so the steps intersect at their midpoints.) */
            float oneToTwo = 1.0f + (1.0f / (float)( 1 <<(precision + 1) ));
            int32_t i;
            for(i = 0; i < (1 << precision); ++i){
              // make y-axis value for table element
              lookup_table[i] = logf(oneToTwo) / 0.69314718055995f;
              oneToTwo += 1.0f / (float)( 1 << precision );
            }
          }
          
          
float icsi_log(float arg, const float* lookup_table, const uint32_t precision){
            /* get access to float bits */
            register const int32_t* const pVal = (const int32_t*)(&arg);
            /* extract exponent and mantissa (quantized) */
            register const int32_t exp = ((*pVal >> 23) & 255) - 127;
            register const int32_t man = (*pVal & 0x7FFFFF) >> (23 - precision);
            /* exponent plus lookup refinement */
            return ((float)(exp) + lookup_table[man]) * 0.69314718055995f;
          }

I think that since the exponent is not used in the lookup, you should be able to preserve the range of double precision without increasing table size, with the same precision.

The powFastLookup function would have to be changed too, but it is less obvious how:

github.com

pingdynasty/OwlProgram/blob/8a2b50126348823f1e9cd536b4a912678961c566/LibSource/fastpow.c#L45

    
      
          }
          
          
/**
           * Get pow (fast!).
           *
           * @val        power to raise radix to
           * @ilog2      one over log, to required radix, of two
           * @pTable     length must be 2 ^ precision
           * @precision  number of mantissa bits used, >= 0 and <= 18
           */
          float powFastLookup
          (
             const float         val,
             const float         ilog2,
             const uint32_t* pTable,
             const uint32_t  precision
          )
          {
             /* build float bits */
             const int32_t i = (int32_t)( (val * (_2p23 * ilog2)) + (127.0f * _2p23) );