Performance of the ARM

horruh · July 28, 2017, 6:44pm

Hi Owl developers!

Great product, great idea, I am considering the Owl pedal.

I however want to obtain a better understanding on the computational capacity of the ARM, therefore I have a few questions on the processor load for the published patches, since I have some ideas about interesting, more demanding algorithms:

(1) What is the sampling rate for these patches?
(2) For almost all (if not all) patches, there is control part of the algorithm (e.g. sliders, knobes, and related computation of variables which feed into the “DSP core”). How does this relate to the published processor load ?

Thanks in advance!

horruh · July 28, 2017, 6:44pm

Meanwhile I have analyzed the patches a bit more, and have a few additional questions:

I noticed that the “Freeverb” patch has a relatively high cpu load (89%). Freeverb is not a very heavy algorithm w.r.t. required computational resources. I could imaging that the CPU is suffering from long round-trip delay’s, which are induced by delay-line read access in far-memory. Is that correct?
A similar question on the “Lorenz Attractor”. This is also not a too heavy algorithm, with a small data memory footprint. Could you explain why this patch consumes 60% of the available 3500 cycles (i.e. 2100 cycles to iterate the algorithm) ?
Is the configured buffersize identical for the published patches? Is it freely configurable by the programmer? I presume that the associated latency on the audio chain should be Tlatency=buffersize/Fsample, is that correct?
I would be interested to see how convolvers map on the Owl. Did you every try large FIR mapping. If the ARM would be capable to do one tap/cycle (does it ??) then the theoretical maximum length FIR filter would be 3500 taps, correct?
how does the “Octave down” sound when it processes a plugged-in guitar signal ? I have some difficulty in understanding the quality of the frequency tracking part of the algorithm.
I really like the Faust programming interface concept. There is a lot of exciting stuff out there. Any current developments on this field?

Thanks!

mars · July 28, 2017, 6:44pm

Hi horruh,

thanks for reposting your questions here, I think they’re of general interest.

Default patch sampling rate is 48kHz 24bit, although the codec is capable of 96kHz (and 32kHz, and 8kHz)
The percentage shows how much of the theoretical maximum (3500 operations per sample) that the audio processing block consumes. With the OWL it is possible to come very close to 100% without audio
drop-outs, thanks to a very efficient bare-bones implementation. There are no other tasks, no scheduling. USB events happen on an interrupt. If you send lots of MIDI messages to the device, it will affect performance (slightly: one or two percent), but if you don’t there’s no overhead.
ADC (knobs and expression pedal) and codec is handled by DMA and so incurs no processing time.
Thanks to the ARM Cortex core we can get the exact number of operations used in an audio block, and this is what we use to calculate the percentage.

mars · July 28, 2017, 6:44pm

I think the factory patch has 8 comb and 8 allpass filters. I’m sure the performance could be massively improved by using the optimised biquads in the library [1]
There’s also a FAUST version which is actually faster.
Looking at the code [2] I can see I’ve used 64-bit doubles for the calculations, I’m not sure what impact this has on performance. It would seem (at a glance) that there’s only about 14 calculations done per sample. You’d expect some overhead with loop setup et c, but not to that extent. I’ll make some experiments and report back!
Buffer size can be changed using the OwlControl software (or MIDI sysex) from 2 to 1024 samples, with latency from <1mS to 6.24mS - see this discussion [3]
Not one FIR tap per operation - there is a Multiply and Accumulate instruction which I think performs in one cycle, but the ARM Cortex M4F is not pipelined like a DSP chip. Interesting papers here [4] and here [5] (by Paul Beckmann who wrote the ARM optimised DSP code that we use in OwlLib). tldr I think you can expect 12 operations per FIR tap.
I’m not so familiar with this patch. You know you can run the patch in the browser, right? Go to [6], click the Test tab, select your mic input and hit start. In the library we’ve got two pitch detectors: zero-crossing and FFT.
Faust is awesome, and there’s so much good stuff out there. Not least from Stanford Uni / CCRMA. They support OWL as a target since quite a while.

[1] http://www.hoxtonowl.com/docs/classBiquadFilter.html
[2] ChaosPatches/LorenzAttractorPatch.hpp at master · marsus/ChaosPatches · GitHub
[3] http://hoxtonowl.com/forums/topic/owl-latency/
[4] http://www.dspconcepts.com/sites/default/files/white-papers/PD8_Beckmann.pdf
[5] http://www.arm.com/files/pdf/DSPConceptsM4Presentation.pdf
[6] http://hoxtonowl.com/patch-library/patch/Octave_Down/

mars · July 28, 2017, 6:44pm

Interesting! Changing doubles to floats in LorenzAttractor brings CPU load down to 12%. Changing heap allocation from external SDRAM to CCM (closely coupled memory internal to the MCU) takes it to 4%, most of which is block-related overheads (Template patch runs in 3%).

You can try this yourself with OwlProgram [1] and this make command:
make LDSCRIPT=Source/ccmheap.ld PATCHNAME=LorenzAttractor run

[1] GitHub - pingdynasty/OwlProgram: Dynamically loaded OWL program

horruh · July 28, 2017, 6:44pm

That makes a lot of sense, yes!

I have one more question/remark on the FreeVerb patch.
Indeed there is a “FaustVerb” patch available, and this looks like a FreeVerb implementation.
Still I think that the CPU load for this algorithm is considerable: 64%

The FreeVerb algorithm contains Feedback Filtered Comb-filters and allpass-filters. The lowpass filter in the feedback-filtered comb filter is a 1st order IIR structure, and so is the allpass filter. None of these have BiQuad structures.

Still there must be an explanation for the heavy cpu load. Could it also be the memory wall that is hit by this patch, similar to the earlier implementation of the Lorenz oscillator ?
Obviously, the algorithm needs to access the external memory, because of the relatively large delay-line buffers.
So if many cycles are burnt because of hitting the memory wall, then I wonder whether this algorithm can be optimized.

Your thoughts?

horruh · July 28, 2017, 6:44pm

One additional remark, or idea:
If it indeed is the case that the FreeVerb patch suffers from lots of idle cycles due to delay-line buffer reads (exposed round-trip latency), then it might be a good idea to introduce a secondary thread which prefetches the memory data and stores the fetched delay-line outputs in shared variables. These variables could then be accessed by the algorithm thread which can then run at full speed.

Or is the concept of releasing the main thread from exposed memory latencies already supported in some way ?

mars · July 28, 2017, 6:44pm

There are three types of RAM available to the OWL: internal SRAM and CCM, and external SRAM.
The external memory is a very fast 10nS 8Mbit SRAM which incurs minimal wait states - still there are always overheads with external memory.

AN2784 Application note
Write cycle time: 12ns
Read cycle time: 12ns
Write Enable low pulse width: 8ns
Address access time: 12ns

The memory is split between the main firmware and the loaded patch. The patch has 32kB of CCM, 64kB of internal SRAM and the whole external SRAM available, which can be split up in different ways for heap and stack use. It is also possible to assign variables explicitly to CCM in the patch code.

With the default link script, CCM is used for stack and external SRAM for heap.

Since there is no scheduler or threads, having a pre-fetch thread is not possible. However it should be possible (and much preferable) to use DMA for background pre-fetch from external to internal SRAM (CCM is not supported by the DMA peripheral). I’m not sure exactly what a useful API would look like, but it could be a very powerful solution since DMA incurs no processing overhead - apart from setup.

mars · July 28, 2017, 6:44pm

I’ve added another link script which assigns heap to internal SRAM, sramheap.ld. With this, FreeVerb uses 18% CPU. Nice speed up!

horruh · July 28, 2017, 6:44pm

Indeed this is a nice speedup!

The FreeVerb patch requires ~48kByte of delay-line buffer space. Is this allocated in the heap?
I read in the Owl spec that the on-board SRAM is 192kbit → 24kByte.
Does this mean that the SRAM is entirely loaded, and still half of the delay-line buffer size is
allocated in external memory ?

DMA is indeed the normal way to hide memory latency from the processor.
From a programmers point of view, however, I would not like to have the DMA mechanism exposed (by an API) because of complexity.

horruh · July 28, 2017, 6:44pm

I suspect that the “Octave Down” patch also can do a lot better. It consumes 16% CPU cycles, according to the patch information.

I would be interested to see whether this figure also drops when the heap is assigned to internal SRAM.
Can you check this ??

mars · July 28, 2017, 6:44pm

No that’s 192kByte, split into 64k CCM and 128k SRAM, divided on the OWL between firmware and patch as per above.

Note that reducing CPU load doesn’t do anything for actual audio performance, as long as you’re below about 99% you will not have any drop outs. It would be theoretically possible to reduce clock speed and therefore power consumption to match the cpu load, but I think the only practical advantage of increased code efficiency is if you wanted to combine several patches into one.

OctaveDown with sram heap: 6% CPU
OctaveDown with ccm heap: 6% CPU

horruh · July 28, 2017, 6:44pm

Ok, this way the OctaveDown is more in line with my expectations.

There is a big practical advantage in increased code efficiency…
Combination of patches, or more complex patches should be possible for the Owl.
In my opinion, the computational capacity of the ARM seems to be quite limited.
A reverberation algorithm like FreeVerb is one of the most primitive ones, a relatively small improvement w.r.t. to the good-old Schroeder reverb.

It would be very nice to see more complex and natural sounding reverb models (e.g. zita_rev1 or 3d models based on sparse digital waveguide modeling) running on Owl.

Concerning the price of the pedal, I think that the Owl pedal should be capable of running more impressive patches. The published patches all seem to be basic algorithms or experiments, most of them with a relatively high CPU load which gives the impression that the computational capacity is too limited to do real interestig or higher quality stuff.

Don’t get me wrong: I really like the concept of an open programmable pedal and think that the Owl implementation is truly versatile. However, the published patches are not convincing.

Your ideas on this?

I was thinking of more complex patches, like amp/cab models, combined with reverberation.
I have severe doubts whether the ARM is capable of doing such algorithms, what do you think ?

mars · July 28, 2017, 6:44pm

Well I think that FreeVerb running at 18% of CPU tells you that you can do a lot more if you want to. Like 120 biquads (87%) or 680 FIR taps (96%).

But writing good DSP code is not primarily about the performance of the platform but what you do with it. Consider all the famous digital reverbs from the 80s and 90s, the hardware they were running on was extremly limited.

I hear what you’re saying though, you’d like more performance for your money than what you feel you get with the OWL. These days you can get a Raspberry Zero for $5 which has (on paper) more than 5 times the CPU and many times the memory. But what you get with the Zero is a low-powered linux computer on a board, whereas the OWL is a high-powered, fully integrated embedded audio platform. There are fundamental differences which are not just about how you configure the system. If you don’t believe me, try to run FreeVerb on a RPi with less than 5mS latency - and listen to the drop outs.

The OWL is unique in that it allows you to write high performance DSP code in a standard programming language, even polyglot, with a fully open source toolchain. No assembly required, no expensive development kit. Compile it and you get a high quality, low latency audio effect that you can pop into your gig bag and take on stage without worries it might crash.

As for the quality of the patches in the library, I think we have some really brilliant contributions and many which are totally unique. It’s true that they’re not all great, and it can be difficult to sort the wheat from the chaff. It’s on our todo-list to implement a recommendation system to make it easier to find the gems, but resources are short and we’ve not had the time yet. But it’s coming! Meanwhile, for a good stereo verb, try out the JotReverb. For a great effect that you won’t find on any other pedal, try the DroneBox. I’ll publish the current list of factory patches somewhere, they’re all good. And there’s 40 of them.

If you want to have a go at porting zita reverb I’ll happily test it for you. And regarding amp/cab models: have a look at the Guitarix project. Porting to OWL is trivial as it is written in FAUST.

mars · July 28, 2017, 6:44pm

Oh and another thing about performance while I think of it -

One thing we’ve not explored yet is fixed point patches. At the moment the 24-bit samples are converted to 32bit floats and all processing is done in floating point, before being converted back to 24-bit integers. However working in ints could lead to some serious performance gains. Worth considering!

horruh · July 28, 2017, 6:44pm

It was not my intention to offend patch developers.

Surely there is great stuff among the published patches.
But like you said, it’s a bit difficult to distinguish between the high-quality ones and the more experimental and “try-out” ones, which are obviously also valuable.

My advise would be to classify these patches, and also to optimize. If it is really that easy (for perhaps the majority of the patches) to crank up the performance simply by allocating the heap to fast memory, then it will really pay off to do this, and republish the performance indicators.

One of the great aspects of the Owl is the relatively low learning curve to program the pedal. This is way easier than programming an old-school DSP.
The ideas that you mentioned to improve performance (e.g. DMA and fixed-point) make it however harder to program the pedal unless such aspects can be automated (which is not trivial) and hence non-exposed to the programmer.