OWL DSP library updates

I think you’re oversimplifying this, because:

  1. this is ignores details of ARM EABI, since in this case you’re doubling number of register used (before stack gets utilized). From the “Procedure Call Standard for the ARM Architecture” PDF:
5.5
 Parameter Passing
The base standard provides for passing arguments in core registers (r0-r3) and on the stack. For subroutines that
take a small number of parameters, only registers are used, greatly reducing the overhead of a call.
  1. GCC can do it on its own when it makes sense:
-fipa-sra

    Perform interprocedural scalar replacement of aggregates, removal of unused parameters and replacement of parameters passed by reference by parameters passed by value.

    Enabled at levels -O2, -O3 and -Os.

Yeah, it’s not required for sync, reset is sufficient. I’ve figured out that it’s better to make a PR, but I can explain either way:

  1. I’m port a collection of various BLAMP oscillators (but the same applies to other stuff in OwlProgram - sine, polyblep, the naive non-interpolated wavetable that nobody uses ;-))

  2. Will add a class that holds base oscillators list and uses a parameter for playing 2 of them interpolated

  3. Only 2 need to be rendered at once, remaining ones don’t need to be generated

  4. We still want to stay in phase when we start morphing with another osc.

The phase in current oscillators is in radians everywhere, but polyBLEP didn’t have a function for that in code borrowed from stmlib. So I’ll be testing if thinking it works as expected too, but I think it should be doing that already.

It seems like I have to reuse most of the sine oscillator code related to phase processing and signal generation for my BLAMP classes. It really belongs to the base oscillator class, so that only a different render() function implementation would be provided in most cases (outputting sample in current phase). But care must be taken if that would be done, i.e. this relies on renderer called (sin in this case) to wrap phase on its own:

https://github.com/pingdynasty/OwlProgram/blob/feature/librefactor/LibSource/SineOscillator.h#L46-L54

https://github.com/pingdynasty/OwlProgram/blob/feature/librefactor/LibSource/Oscillator.h#L33

This probably shouldn’t take any parameters

Quite right, thank you!

And that’s why it’s faster! Great research and references, but I’m not sure they prove your point. If you mean that with a reference you’re winning because you only need one register, it doesn’t help that it’s a register because the data has to be loaded from where ever it is stored.

In any case the idea with FloatArray is that it replaces passing a float* and size, without adding extra baggage. And for that I think it works fine.

The current state of affairs is:
FloatArray, ShortArray, IntArray, ComplexFloatArray, ComplexShortArray… all inherit from SimpleArray<typename>.
This works well enough, I’ve added unit tests to verify that the object size does not grow.

BiquadFilter, StateVariableFilter both have a multi-channel version which can be used very flexibly:

  MultiStateVariableFilter* filter;
  StateVariableFilterPatch(){
    filter = MultiStateVariableFilter::create(getSampleRate(), getNumberOfChannels());
  }
  void processAudio(AudioBuffer &buffer){
    filter->setLowPass(fc, q);
    filter->process(buffer, buffer);
}

This will work with both stereo and multichannel firmwares.

And yes I think you’re right, the PR is pretty much ready to merge.

Sounds great!

tl;dr: Posting some performance results here for posterity.

I did some tests with array copy functions with the intention to see how much faster the CMSIS functions are compared to plain old memcpy. It turns out: not at all.

Actually I wanted to use the same optimisation method but adapt it for use in our SimpleArray template. So I replaced calls to e.g. arm_copy_f32 with this:

  /**
   * Optimised array copy for datatype T.
   * Copy four at a time to minimise loop overheads and allow SIMD optimisations.
   */
  static void copy(T* dst, T* src, size_t len){
    size_t blocks = len >> 2u;
    T a, b, c, d;
    while(blocks){
      a = *src++;
      b = *src++;
      c = *src++;
      d = *src++;
      *dst++ = a;
      *dst++ = b;
      *dst++ = c;
      *dst++ = d;
      blocks--;
    }
    blocks = len & 0x3;
    while(blocks){
      *dst++ = *src++;
      blocks--;
    }
  }

It turns out this is actually slightly faster than the old code which called arm_copy_f32, probably due to inlining. But it is way slower than a memcpy!

My results, testing with a 12k FloatArray on a Lich.

FloatArray::create(1024*12);
CPU: 51% Memory: 98920 memcpy
CPU: 71% Memory: 98920 SimpleArray::copy
CPU: 82% Memory: 98920 master branch: arm_copy_f32

This is the time it takes to do a copyFrom() and copyTo() each 64-sample process block.

I also thought I’d compare it to a simpler elementwise copy:

  static void copy(T* dst, T* src, size_t len){
    while(len--)
      *dst++ = *src++;
  }

This blew up so I had to reduce the array size.

FloatArray::create(1024*8);
CPU: 86% Memory: 66152 elementwise copy

I guess we can surmise that it is at least 50% slower than the ‘optimised’ version.

I was quite surprised by this, as the nanolib memcpy that we use is not supposed to be lightning fast. I thought at least with smaller object sizes the optimised version would show an improvement, but nope.

ShortArray::create(1024*12);
CPU: 27% Memory: 49768 memcpy
CPU: 71% Memory: 49768 SimpleArray::copy

Same with bigger ones (a ComplexFloat is the size of two floats, again I had to reduce the array size).

ComplexFloatArray::create(1024*6);
CPU: 51% Memory: 98920 memcpy
CPU: 60% Memory: 98920 SimpleArray::copy

I also tested with odd array sizes (ie not divisible by four) but it didn’t make any big difference to the results.

However, all of these results are with memory allocated in internal SRAM. In order to test using external SDRAM, I allocated bigger arrays and processed a smaller subarray.

FloatArray::create(1024*256).subArray(0, 1024*2)
CPU: 75% Memory: 2097768 memcpy
CPU: 39% Memory: 2097768 SimpleArray::copy

ShortArray::create(1024*256).subArray(0, 1024*2)
CPU: 39% Memory: 1049192 memcpy
CPU: 32% Memory: 1049192 SimpleArray::copy

ComplexFloatArray::create(1024*256).subArray(0, 1024*1);
CPU: 60% Memory: 4194920 memcpy
CPU: 31% Memory: 4194920 SimpleArray::copy

So, conclusions? For FloatArray, which is by far the most used, arrays allocated on internal SRAM perform almost 30% better with memcpy, while on external SDRAM it runs almost 50% faster with an ‘ARM optimised’ method.

It does make you wonder just why memcpy is sometimes faster.

My results are all on a Cortex-M4 with no data caching. It will be interesting to follow up in the future with some testing on an M7.

Meanwhile I’m going to commit a code change so that the custom copy method (which runs faster on external RAM) can be accessed with a static method, i.e. with something like:

FloatArray::copy(dst, src, len);

All other copy methods: copyFrom(), copyTo(), and insert() will use memcpy, instead of arm_copy_xyz.

This looks very interesting. But your first tests include only copy, while for SDRAM batch is create + subArray. It would be more useful if you first allocate enough buffers in a for loop to fill SRAM and then just measure the same create call that is guaranteed to use SDRAM.

Other ideas:

  • use 8 element blocks to see if that makes any difference. If this code uses VFP registers for floats, there are 32 available, so we can have higher parallelism.
  • check if the same performance gap is still present between template and CMSIS version with -O3 - it might start inlining things more aggresively.

And yes, those optimizations might become irrelevant on M7, caching changes everything and things become more unpredictable depending on IO access patterns.

no the timings are done the same:

I.e. create and subarray are called in the patch constructor, copy methods are called in processAudoi().

There’s lots more investigations that could be done, and collating the results in a comparable way would certainly be useful (e.g. average operations per element copied). I’d also be quite interested in looking at the nanolib and newlib memcpy implementations to see how they are done, to work out why the results come out this way. But that’s all for another day.

Why save it for another day, GCC-ARM has this in newlib sources:

#include <_ansi.h>
#include <string.h>
#include "local.h"

/* Nonzero if either X or Y is not aligned on a "long" boundary.  */
#define UNALIGNED(X, Y) \
  (((long)X & (sizeof (long) - 1)) | ((long)Y & (sizeof (long) - 1)))

/* How many bytes are copied each iteration of the 4X unrolled loop.  */
#define BIGBLOCKSIZE    (sizeof (long) << 2)

/* How many bytes are copied each iteration of the word copy loop.  */
#define LITTLEBLOCKSIZE (sizeof (long))

/* Threshhold for punting to the byte copier.  */
#define TOO_SMALL(LEN)  ((LEN) < BIGBLOCKSIZE)

void *
__inhibit_loop_to_libcall
memcpy (void *__restrict dst0,
	const void *__restrict src0,
	size_t len0)
{
#if defined(PREFER_SIZE_OVER_SPEED) || defined(__OPTIMIZE_SIZE__)
  char *dst = (char *) dst0;
  char *src = (char *) src0;

  void *save = dst0;

  while (len0--)
    {
      *dst++ = *src++;
    }

  return save;
#else
  char *dst = dst0;
  const char *src = src0;
  long *aligned_dst;
  const long *aligned_src;

  /* If the size is small, or either SRC or DST is unaligned,
     then punt into the byte copy loop.  This should be rare.  */
  if (!TOO_SMALL(len0) && !UNALIGNED (src, dst))
    {
      aligned_dst = (long*)dst;
      aligned_src = (long*)src;

      /* Copy 4X long words at a time if possible.  */
      while (len0 >= BIGBLOCKSIZE)
        {
          *aligned_dst++ = *aligned_src++;
          *aligned_dst++ = *aligned_src++;
          *aligned_dst++ = *aligned_src++;
          *aligned_dst++ = *aligned_src++;
          len0 -= BIGBLOCKSIZE;
        }

      /* Copy one long word at a time if possible.  */
      while (len0 >= LITTLEBLOCKSIZE)
        {
          *aligned_dst++ = *aligned_src++;
          len0 -= LITTLEBLOCKSIZE;
        }

       /* Pick up any residual with a byte copier.  */
      dst = (char*)aligned_dst;
      src = (char*)aligned_src;
    }

  while (len0--)
    *dst++ = *src++;

  return dst0;
#endif /* not PREFER_SIZE_OVER_SPEED */
}

So it doesn’t use temporary vars (migh be optimized away either way), copies with long ints and aligns to them. I guess the idea is that using double width cuts number of copies in half.

Newlib-nano code is not part of the archive for ARM toolchain sources, but I’ve found this. Seems to differ only in different macros used at the function header.

I’ve ran into something that requires making an oscillator that outputs ComplexFloatArray (it’s a polygonal VCO). While imaginary axis is not very interesting (it contains just a phase shifted copy of real signal), it’s still used for visualizing VCO output on display.

So maybe we can have a base oscillator template with array and element type parameters. It would also contain most phase/SR related stuff that is defined in SineOscillator at the moment.

I think this could also be used for some kind of FFT driven wavetable generator, and we can have signal processors operating on complex numbers too, before finally running reverse FFT.

I can make a PR with proposed changes if that would be more convenient.

Also noticed potential issue here:

Runtime polymorphism doesn’t work in calls made from constructor - so it should probably be written as SineOscillator::setFrequency(freq); to make it more explicit. It would matter only if we inherit from this class, but like I’ve said it’s better to move freq/phase processing to parent Oscillator class.