SOUL Patches on OWL

In case of reverb, excessive SDRAM access is likely playing a huge part in performance loss.

Some improvements could be achievable on OWL2 if we limit patch size to 80kb and use remaining 64kb SRAM for dynamic allocation. Basically use the same patch size for OWL1/2/3. That’s should help if SOUL allocates memory dynamically and each buffer gets allocated separately.

I don’t think the Reverb uses any complex math operations that could beneficiate of fast-math optimisation ?
The Reverb.soul code does parameter smoothing for all 5 parameters, which may explain part of the CPU usage ?
Here is a simpler version Reverb_fixed.soul · GitHub with fixed parameters. It would be great to have the CPU usage of this one.

I’m getting CPU: 68% Memory: 119080 bytes on that version.

Compared to CPU: 46% Memory: 76568 with this FAUST version: https://www.rebeltech.org/patch-library/patch/FaustVerb

However when I rebuild the same FAUST patch locally I get much worse performance: CPU: 64% Memory: 75744. This is quite comparable to the SOUL version.

I think this could be due to a change in memory management, that since a while back we allocate the patch object itself on the heap. And if it is over a certain size, it ends up in external SDRAM.

I’m going to try with a different firmware which makes more of the internal RAM available as heap.

Comparing like for like, on OpenWare v21.1:

Reverb_fixed.soul binary 9628 bytes, CPU 33%, Heap Memory 119080 bytes
FaustVerb.dsp  binary 11628 bytes, CPU: 24% Heap Memory: 75744 bytes

They are both using internal RAM here, because the total size of heap and static memory is less than 144kB.

Tried FaustVerb: OWL2, v21.0 - 62%, this is with FAUST built yesterday. Then tried a few DLT settings:

DLT - CPU - RAM
1024 - 26% (not a typo!) - 48k
2048 - 62% - 73k
4096 - 62% - 73k

And on Daisy it uses 5% CPU

The Faust reverb you are using still has sliders, so does some computation depending of their values, so not completely comparable to the Reverb_fixed.soul file.

Here is the Faust version with fixed slider values: freeverb_fixed.dsp · GitHub
It would be interesting to get the updated CPU usage values.

@sletz I get the same performance as before:
CPU: 24% Memory: 75640
The difference is 9 percentage points; the FAUST FreeVerb runs 30% faster than the SOUL FreeVerb, in this particular setup.

Worth noting that a pass-through OWL patch that does nothing still uses apprx 3% CPU, just converting to/from float and other block processing overheads.

The OWL-killing version of reverb patch uses 15% CPU / 226KB SRAM on Daisy

@antisvin: do you mean the SOUL version takes 15% when the FaustVerb one takes 5% on Daisy?

That’s correct. But note that H7 MCU has L1 cache and FAUST generates patch three times smaller in this case. This should make huge difference.

OK, so I understand that the generated code quality is probably one part of the equation, but the memory layout and usage is probably even more relevant on embedded devices. This is something we’ll think about, so that to improve the custom memory manager model that you are already using on the OWL (AFAICS), and to make it more flexible.

On current OWL boards a simpler MCU is used, but in case of Daisy (and upcoming version of OWL) we’re dealing with a more complex one. That means more RAM sections with different access speed. It’s safe to assume that faster ones would be used first and will be more limited in size. So improvements can come from:

  1. Allocating objects that would be used more frequently first

  2. Allocating larger objects first - might reduce memory fragmentation and increase utilization for faster sections

Our idea for an improved custom memory manager is the following:

  1. expand the memory manager API so that it will receive the 1) type 2) size 3) access usage for the different memory zones of the DSP: like delay lines, static tables, DSP struct itself…
  2. have the -mem make the generated C++ code uses this extended memory manager API

Then use it in two steps:

  1. allocate a dummy DSP so that the memory manager could get all relevant informations about the DSP memory needs
  2. then the memory manager can decide the best allocation strategy with its 1) knowledge of the board capability (different RAM sections with their size and access speed) 2) requirement needs by the given DSP
  3. finally really allocate the DSP object by placing the DSP struct, delay lines, and static tables at the proper location in memory.

Do you think this strategy would fit your needs ?

Sounds a bit too complex and some things described may not be necessary. “delay lines, static tables, DSP struct itself” - actually, static tables would be loaded to static RAM as part of the patch. And DSP struct most likely should go to fastest memory section, because placing it on SDRAM would be the worst case scenario. Does that mean that we should only care about delay lines order in this case?

Current allocator that we use is fairly simple. It has one specific requirement to define sections in order of their memory addresses, which works well for us as this MCU has faster memories under lower addresses. So we end up allocating to faster memories first, but we have to choose allocation order accordingly.

knowledge of the board capability

The allocator itself doesn’t know anything other than address and size for each section. This is probably sufficient for performing allocations starting with largest objects (which could reduce fragmentation and place more data on faster memory).

requirement needs by the given DSP

It’s problematic to do it in runtime. But if we can detect at compile time which objects need updating once per block of audio rather than once per sample, we can treat them as second class and allocate them later (to slower memory).

Thanks for your feedback. The point of this somewhat complex strategy is to try to be as general as possible to be ready for the OWL, but also the Daisy board, ESP32 boards, Teensy board, possibly FPGA (with the starting Fast project here: https://fast.grame.fr).

Anyway I’ll come back to get more concrete feedback as soon as this new custom memory manager start to be usable.

I’ve merged in the OwlProgram branch and deployed to the online compiler, so… OWL now supports SOUL!

Just upload your soul files (.soul and/or .soulpatch) and select compilation type soul.

We don’t support sample loading yet, and there’s probably lots of things that don’t work, but setting parameters and receiving MIDI should be functional.

Also note that performance is not great, some examples that don’t work in realtime include:

  • Reverb
  • PadSynth
  • clarinetMIDI

Is clarinetMIDI this one SOUL/examples/patches/clarinetMIDI at master · soul-lang/SOUL · GitHub ?
If yes this is actually the Faust version automatically transpiled to SOUL in polyphonic mode (using SOUL written polyphonic wrapper). Does the original Faust version run in realtime then?

Faust can be transpiled to SOUL using the faust2soul tool here: faust/architecture/soul at master-dev · grame-cncm/faust · GitHub, or directly from the Faust Web IDE by choosing the “soul” platform in the export button.

Yes it is, and we have a copy in our patch library: Clarinet

It’s an old version, with old comments, and we run it monophonic so it can’t be directly compared to the SOUL example. On my Wizard here it uses 20% CPU.

OK. The new one is built using the Romain Michon reworked Physical Modeling library, now part of the Faust libraries: physmodels - Faust Libraries