SOUL Patches on OWL

Seeing the recent SOUL v1release I decided to have a go at making an OWL wrapper.

I had a few concerns, seeing that they use std::vector, exceptions, and other C++ features that add overhead to binary size and execution times. But it seems it’s not a big issue, and telling the compiler to use C++14 instead of C++11, plus a few little tricks, was all that was needed to compile the code generated from some of their examples.

But then when I tried running it I noticed two things: firstly the audio does not seem to be processed in any way, and secondly the processing time is extremely low. So I’m probably doing something wrong with how I’m calling the code.

I opened a topic on the SOUL forum, I will report back when I make more progress.

Meanwhile, if anyone would like to experiment I’ve put my code here, and I’ve pushed my C++14 changes to the OwlProgram develop branch.

This deserves a link to their online IDE

I do hope that performance issue is something fixable, but it may be a problem in their current implementation. If I understood correctly, it uses JIT-compiled bytecode. So maybe current SOUL code is just not optimized very well yet. It’s targeted to run on desktop and first release may have goals other than high DSP performance. Also, I wouldn’t be surprised if they use some SIMD optimizations, but not for ARM Cortex MCUs.

No I meant it runs too fast! The DiodeClipper runs in 3% on a Wizard. I suspect I’m calling the code wrongly though, and it’s not really running at all.

The JIT-compiler probably doesn’t apply to the C++ code gen.

I’ve noticed that its FAUST integration uses numAudioInputChannels / numAudioOutputChannels variables instead of array sizes. I wonder if you could be rendering a patch with 0 channels because the array is empty yet. This is number of element that are actually stored and you don’t assign anything initially. Using .max_size() would work too in such case.

The array is fixed size and is initialised in the generated code, e.g. for DiodeClipper which is mono we have:

        std::array<const FloatType*, 1> inputChannels;
        std::array<FloatType*, 1> outputChannels;

Reverb, which is mono in, stereo out:

        std::array<const FloatType*, 1> inputChannels;
        std::array<FloatType*, 2> outputChannels;

Either way, I’ve compiled it and won’t be posting things that don’t work instead of trying them myself.

They use an IR called heart that is then compiled to LLVM IR or C++. The LLVM IR can be optimised at JIT time using standard LLVM optimisation passes, and SIMD code may be created at this level with auto-vectorisation passes. The same probably occurs in the C++ path using standard C++ compiler optimisation settings.

Seem to have been some kind of problem with how I was initialising RenderContext::numFrames, works now.

Compiling your own SOUL patch might require commenting out this line in the generated code:
#define SOUL_CPP_ASSERT(x) assert (x)

Got a few of the examples compiling and loading, but performance is pretty poor.
Reverb.soul, which is described as a “more complex implementation of freeverb”, runs in… oh no, more than 100%… but I’ve had to disable our fast-math optimisations which is no doubt part of the problem.

In case of reverb, excessive SDRAM access is likely playing a huge part in performance loss.

Some improvements could be achievable on OWL2 if we limit patch size to 80kb and use remaining 64kb SRAM for dynamic allocation. Basically use the same patch size for OWL1/2/3. That’s should help if SOUL allocates memory dynamically and each buffer gets allocated separately.

I don’t think the Reverb uses any complex math operations that could beneficiate of fast-math optimisation ?
The Reverb.soul code does parameter smoothing for all 5 parameters, which may explain part of the CPU usage ?
Here is a simpler version Reverb_fixed.soul · GitHub with fixed parameters. It would be great to have the CPU usage of this one.

I’m getting CPU: 68% Memory: 119080 bytes on that version.

Compared to CPU: 46% Memory: 76568 with this FAUST version:

However when I rebuild the same FAUST patch locally I get much worse performance: CPU: 64% Memory: 75744. This is quite comparable to the SOUL version.

I think this could be due to a change in memory management, that since a while back we allocate the patch object itself on the heap. And if it is over a certain size, it ends up in external SDRAM.

I’m going to try with a different firmware which makes more of the internal RAM available as heap.

Comparing like for like, on OpenWare v21.1:

Reverb_fixed.soul binary 9628 bytes, CPU 33%, Heap Memory 119080 bytes
FaustVerb.dsp  binary 11628 bytes, CPU: 24% Heap Memory: 75744 bytes

They are both using internal RAM here, because the total size of heap and static memory is less than 144kB.

Tried FaustVerb: OWL2, v21.0 - 62%, this is with FAUST built yesterday. Then tried a few DLT settings:

1024 - 26% (not a typo!) - 48k
2048 - 62% - 73k
4096 - 62% - 73k

And on Daisy it uses 5% CPU

The Faust reverb you are using still has sliders, so does some computation depending of their values, so not completely comparable to the Reverb_fixed.soul file.

Here is the Faust version with fixed slider values: freeverb_fixed.dsp · GitHub
It would be interesting to get the updated CPU usage values.

@sletz I get the same performance as before:
CPU: 24% Memory: 75640
The difference is 9 percentage points; the FAUST FreeVerb runs 30% faster than the SOUL FreeVerb, in this particular setup.

Worth noting that a pass-through OWL patch that does nothing still uses apprx 3% CPU, just converting to/from float and other block processing overheads.

The OWL-killing version of reverb patch uses 15% CPU / 226KB SRAM on Daisy

@antisvin: do you mean the SOUL version takes 15% when the FaustVerb one takes 5% on Daisy?

That’s correct. But note that H7 MCU has L1 cache and FAUST generates patch three times smaller in this case. This should make huge difference.

OK, so I understand that the generated code quality is probably one part of the equation, but the memory layout and usage is probably even more relevant on embedded devices. This is something we’ll think about, so that to improve the custom memory manager model that you are already using on the OWL (AFAICS), and to make it more flexible.

On current OWL boards a simpler MCU is used, but in case of Daisy (and upcoming version of OWL) we’re dealing with a more complex one. That means more RAM sections with different access speed. It’s safe to assume that faster ones would be used first and will be more limited in size. So improvements can come from:

  1. Allocating objects that would be used more frequently first

  2. Allocating larger objects first - might reduce memory fragmentation and increase utilization for faster sections