tl;dr: Posting some performance results here for posterity.
I did some tests with array copy functions with the intention to see how much faster the CMSIS functions are compared to plain old memcpy
. It turns out: not at all.
Actually I wanted to use the same optimisation method but adapt it for use in our SimpleArray template. So I replaced calls to e.g. arm_copy_f32
with this:
/**
* Optimised array copy for datatype T.
* Copy four at a time to minimise loop overheads and allow SIMD optimisations.
*/
static void copy(T* dst, T* src, size_t len){
size_t blocks = len >> 2u;
T a, b, c, d;
while(blocks){
a = *src++;
b = *src++;
c = *src++;
d = *src++;
*dst++ = a;
*dst++ = b;
*dst++ = c;
*dst++ = d;
blocks--;
}
blocks = len & 0x3;
while(blocks){
*dst++ = *src++;
blocks--;
}
}
It turns out this is actually slightly faster than the old code which called arm_copy_f32
, probably due to inlining. But it is way slower than a memcpy
!
My results, testing with a 12k FloatArray on a Lich.
FloatArray::create(1024*12);
CPU: 51% Memory: 98920 memcpy
CPU: 71% Memory: 98920 SimpleArray::copy
CPU: 82% Memory: 98920 master branch: arm_copy_f32
This is the time it takes to do a copyFrom() and copyTo() each 64-sample process block.
I also thought I’d compare it to a simpler elementwise copy:
static void copy(T* dst, T* src, size_t len){
while(len--)
*dst++ = *src++;
}
This blew up so I had to reduce the array size.
FloatArray::create(1024*8);
CPU: 86% Memory: 66152 elementwise copy
I guess we can surmise that it is at least 50% slower than the ‘optimised’ version.
I was quite surprised by this, as the nanolib memcpy
that we use is not supposed to be lightning fast. I thought at least with smaller object sizes the optimised version would show an improvement, but nope.
ShortArray::create(1024*12);
CPU: 27% Memory: 49768 memcpy
CPU: 71% Memory: 49768 SimpleArray::copy
Same with bigger ones (a ComplexFloat is the size of two floats, again I had to reduce the array size).
ComplexFloatArray::create(1024*6);
CPU: 51% Memory: 98920 memcpy
CPU: 60% Memory: 98920 SimpleArray::copy
I also tested with odd array sizes (ie not divisible by four) but it didn’t make any big difference to the results.
However, all of these results are with memory allocated in internal SRAM. In order to test using external SDRAM, I allocated bigger arrays and processed a smaller subarray.
FloatArray::create(1024*256).subArray(0, 1024*2)
CPU: 75% Memory: 2097768 memcpy
CPU: 39% Memory: 2097768 SimpleArray::copy
ShortArray::create(1024*256).subArray(0, 1024*2)
CPU: 39% Memory: 1049192 memcpy
CPU: 32% Memory: 1049192 SimpleArray::copy
ComplexFloatArray::create(1024*256).subArray(0, 1024*1);
CPU: 60% Memory: 4194920 memcpy
CPU: 31% Memory: 4194920 SimpleArray::copy
So, conclusions? For FloatArray, which is by far the most used, arrays allocated on internal SRAM perform almost 30% better with memcpy
, while on external SDRAM it runs almost 50% faster with an ‘ARM optimised’ method.
It does make you wonder just why memcpy
is sometimes faster.
My results are all on a Cortex-M4 with no data caching. It will be interesting to follow up in the future with some testing on an M7.
Meanwhile I’m going to commit a code change so that the custom copy method (which runs faster on external RAM) can be accessed with a static method, i.e. with something like:
FloatArray::copy(dst, src, len);
All other copy methods: copyFrom()
, copyTo()
, and insert()
will use memcpy
, instead of arm_copy_xyz
.