Granular patch for Lich (inspired by Clouds)

damikyu · January 12, 2022, 5:28am

After sitting on this for a while, I’ve finally released V1 of a patch for the Lich that does real-time granular processing that is similar to Clouds. I find it pretty satisfying to play around with, but I really, really wish I could get the performance good enough to support more than 16 grains. If anyone has any suggestions after looking at the source, I’d love to hear them!

The patch is here: https://www.rebeltech.org/patch-library/patch/Grainz

Here’s a video processing drums: https://youtu.be/Me8_Ct1VJQk
And here’s a video processing me playing Debussy on my Rhodes: https://youtu.be/hLjwOQfqmmo

antisvin · January 12, 2022, 9:35am

Awesome! Haven’t tried the patch yet, but here are some suggestions from looking at code:

    float nextAttack = daisysp::fmax(0.01f, daisysp::fmin(env, 0.99f));

There’s no need to use daisysp here, because any call for std::min/max with floats gets compiled into a single hardware instruction on ARM. Also, in this case you could use std::clamp (I imagine that would do the same thing, but look cleaner)

    while (preDelay && outLen)
    {
      ++outL;
      ++outR;
      --preDelay;
      --outLen;
    }

This probably could work faster if you do it like this:

int skipSamples = std::min(prevDelalay, outLen);
if (skipSamples) {
    outL += skipSamples;
    outR += skipSamples;
    prevDelay -= skipSamples;
    outLen -= skipSamples;
}

Pass data by value for ::interpolated call to reduce amount of array lookups:

interpolated(left[i], left[j], t);
// ...
  inline float interpolated(float a, float b, float t) const
  {
    return a + t * (b - a);
  }

You could optimize this part by using buffer size that is a power of 2 and then using bit mask to truncate data:

      const int i = ((int)pos) % bufferSize;
      const int j = (i + 1) % bufferSize;

Those are trivial things and it’s likely that compiler does some of them already. Even if it doesn’t there’s not much to gain from math optimizations here. The bigger issue is that you need to access buffer data several times per grain and this buffer can only fit in SDRAM that is relatively slow. So what you could do to improve performance is split recording buffer into multiple pages (something like 4k each) that are allocated separately. This would move some data to SRAM that is much faster to use. Not sure if it’s worth the effort as there could still happen a pathological case where all grains read from pages in SDRAM.

damikyu · January 12, 2022, 7:28pm

Thanks for those suggestions @antisvin, I will try them out and see how it goes. One thing I considered was having each grain render its entire output in trigger, but figured this is probably too slow, especially for longer grains. But, I wonder if it would be fast enough to simply copy the section it needs into an internal array and then read from that in process. This would mean that each grain would have its own array that was only as long as the largest grain size, which is 1 second, so I think that’s about 1.5k per grain? Presumably those could all go in SRAM, although I need to look up how much SRAM there is. If we never trigger more than a could grains per block, maybe the copy from SDRAM to SRAM would be tolerable?

Another thing I was considering was having each grain totally overwrite the buffer it is given in process so that FloatArray methods could be used to apply leftScale and rightScale. Then the accumulation into the final grain array would be done in the Patch class.

antisvin · January 12, 2022, 8:25pm

It’s unlikely that this would work. Firstly there’s not enough SRAM, I believe this is 64kb in one memory section + (144kb minus patch code size) in another one. Second problem is that you still write and read to SDRAM - while contiguous reads should be more efficient, it won’t improve things much as you’ll still have to read data from another memory section later.

Yeah, adjusting gain after mixing would save a few CPU cycles compared to doing it for every grain. When you’re mixing grains you should multiply audio sum by normalization values of 1 / sqrt(number of grains) to get constant power output. And you can use FloatArray::scale method to linearly interpolate gain from previous value to current.

Also, it’s probably worth trying to store data in CircularShortBuffer - this would reduce number of bytes read/written by half. Loss of 8 bits of audio precision would be unnoticeable (especially with higher grain count). However performance gain may not be that great as you’ll still have the same amount of SDRAM IO operations and have to convert data when reading/writing. Still, might give a few extra grains.

mars · January 13, 2022, 5:36am

Amazing, it sounds wonderful - thank you so much for sharing this!

Lovely code too. I had a go at using InterpolatingCircularBuffer inside your Grain class. It makes the code look a bit neater, the loop becomes:

      *outL++ += left.readAt(pos) * env * leftScale;
      *outR++ += right.readAt(pos) * env * rightScale;

but it actually makes performance slightly worse :oops:

What does help performance is reducing the record buffer size, but for it to have an effect it must come down to 1 << 13, ie 8192 samples, which is only 170mS - so this was not very useful either.

It’s tricky to optimise because there’s not really any heavy computations, but a lot of moving data.

I wonder if copying the samples from the record buffer into a smaller working memory could work, ie before interpolation. The buffer size required would be genLen / grainSpeed, right? Does grainSpeed have a lower bound?
hmm tried that too with less than fabulous results: with grainSpeed >= 0.25 (ie quarter speed) I could just about squeeze another 1-2 %-units out of it.

Another thing I was considering was having each grain totally overwrite the buffer it is given in process so that FloatArray methods could be used to apply leftScale and rightScale .

You can try it out quickly by simply changing the in-loop mul to a FloatArray::multiply() after the loop. The results won’t be correct, but the timing will be fairly indicative. Unfortunately, it doesn’t do much for performance either :sadface:

damikyu · January 13, 2022, 7:20am

Thanks for the additional feedback, y’all. And @mars so cool that you took the time to actually test a few of those ideas, they are all things I’ve attempted (except copying record buffer into smaller working memory) with same results that you saw of actually performing worse than what was already there. After making the small changes originally suggested by @antisvin, plus a few other probably very minor optimizations, I’m able to do 18 grains and get 93% use. I think there’s probably room there for 1 more grain, but I don’t want to push it so close to limit.

I’m going to try the CircularShortBuffer suggestion, but suspect the extra multiplies required to convert from float to int and back will make it work out to about the same.

damikyu · January 14, 2022, 1:55am

Well, I tried switching over the RecordBuffer to a CircularShortBuffer but it’s much slower than using CircularFloatBuffer, I expect on account of the extra multiplications and implicit casts of int16_t to float. I’ve left the code in with an easy way to try shorts by by changing #if 1 to #if 0 at the top of Grain.hpp. And you’ll also want to reduce max grains to like 10 or else the patch will crash. I’m pretty sure I’m doing the conversions between data-types as simply as possible, but let me know if there is a better way.

antisvin · January 14, 2022, 11:04am

Hmm I suspected that this may happen. I think that this means that reading 16 or 32 bits takes more or less the same amount of time, so we don’t gain anything from shorter values while still adding extra processing.

Something else that’s worth trying is storing sample data in recording buffer in interleaved format. I mean that instead of writing samples like l1, l2, …ln and r1, r2, … rn you would use l1, r1, l2, r2, … ln, rn. With i16 samples you can read a 32bit frame and unpack it into 2 samples. This reduces amount of buffer reads by half, but no idea if it’s enough to overcome the extra sample conversion overhead.

With f32 buffer amount of IO won’t change, but you will perform half of your reads sequentially when processing stereo audio. This should work faster with SDRAM as it has IO buffer that makes sequential IO faster than random. It may be better to stick to raw pointers for iteration instead of using array lookups.

There are 2 possible approaches for using interleaved format and I’m not sure if there’s any performance gains in either of them:

use FloatArray and manage addresses accordingly (left and right channel data are stored as sequential samples)
use ComplexFloatArray (L/R channel data is stored as real/imaginary values in a ComplexFloat)

There’s “ComplexSignalGenerator” class for working with complex data in packed format if you want to try that. And you can use ::copyFrom and ::copyTo methods in ComplexFloatArray to convert between interleaved/non-interleaved channels.

damikyu · January 14, 2022, 7:50pm

Ah, yes, these are great ideas, I’ll report back when I have some results!

antisvin · January 14, 2022, 7:52pm

Tried my own complex float suggestion and it doesn’t improve things enough to make a difference.

But I can squeeze another grain if I use precomputed array of 1 / sqrtf for normailizing. And looks like another one can rendered with several hacks:

don’t render inactive grains (obviously gives huge CPU jumps and you have to make sure that there’s no overruns)
don’t clear grainLeft/grainRight in advance, instead of that convert Grain::generate into template method that can either write or append data and track when first write occurs.

Also, we can generate ~75 grains with this patch on OWL3, just to put things into perspective.

antisvin · January 14, 2022, 8:10pm

I’ve left you a PR, feel free to close it if not needed.

I’ll be experimenting with this a bit more, but it seems like on OWL2 you would get more mileage by using a mono granulator + reverb. In such case second audio input could be use for V/Oct control. Btw, the reverb from clouds/rings takes 23-25% CPU.

damikyu · January 14, 2022, 8:53pm

Ah, thanks, I’m not sure how my approach to using complex floats differs from the one you tried, but I’m able to add two more grains by recording to a complex float buffer. I didn’t implement ComplexSignalGenerator though and it wouldn’t surprise me if the additional function calling overhead negates the win from faster data retrieval. I’ve merged your PR and we’ll see if I can get two more!

antisvin · January 14, 2022, 9:14pm

Keep’em coming!

It’s not like CFA didn’t work, but I ended up with the same performance as with an optimized float buffer. So if you mean that you had 20 grains running, then we had the same results.

Also, those ::generate/::process functions are virtual, which means that there’s an extra level of indirection from using VPTR table to resolve this method dynamically at runtime rather than at compile time. This lets us to store different generators/processors in an array, but may have slight overhead. It’s usually negligible unless compiler gets some opportunities for more optimization or code inlining when we don’t use them.

damikyu · January 15, 2022, 12:18am

Ok, I merged in @antisvin’s PR, which improved things a little bit, but not enough to add another grain after the interleaved audio change. But, since the record buffer is now interleaved and we can do contiguous reads when generating grains, I implemented the scratch buffer idea that @mars suggested by using memcpy to move the section of the buffer that will be sampled into a static scratch buffer and then access the scratch buffer in the while loop. This means we don’t have to wrap indices when reading samples in the while loop. And even though we read more data out of the buffer at higher playback rates, this appears to perform better than reading from the buffer as needed.

The only (potential) issue with the scratch buffer is that it assumes a block size of 64. If that’s larger on some OWL devices, this version of the patch will crash them when pitch is turned higher than 1x.

I was able to increase max grains to 24 and with profiling turned off, performance seems to hit 94-95% in the worst case that I can find tweaking knobs. I worry a little bit that there is a very rare possibility that all grains will need to read from the record buffer twice because the data they need wraps across the end of the buffer, which will be enough of a performance hit to crash the device. But maybe the moment would be short enough to just generate a click without crashing.

damikyu · January 15, 2022, 1:32am

75 grains is awesome! Which devices are running OWL3?

mars · January 15, 2022, 3:03am

Amazing work!

Also, we can generate ~75 grains with this patch on OWL3

I got 100 grains at 82% on an OWL3 even without the recent optimisations… I wasn’t going to mention it, since, well, we don’t sell them yet. Made of pure unobtanium.

I have to say though, I thought the original patch with 16 grains was already quite wonderful - more is not always better!

damikyu · January 15, 2022, 4:39am

Ah, yes, fair enough! I was also pretty satisfied with 16 and had it in that state for many months before finally making it public and sharing here. It’s been really fun to try to optimize the code and I really appreciate y’all’s help with that.

antisvin · January 15, 2022, 8:00am

I suspect that with 100 grains you might be getting buffer overruns under some combinations of settings or not spawning enough grains to fully load patch. However I didn’t run it on OWL exactly, will make another go to confirm how it behaves.

Generally, patch performance under heavy load on H7 becomes dependent on cache utilization and more variable, anything accessing SDRAM makes things worse as cache misses get more expensive.

It would be interesting to have a service call for asynchronously copying data using MDMA for things like this scratchbuffer approach.

tIB · January 22, 2022, 10:07am

Am I right in thinking the lich patches will run fine on a witch?

mars · January 22, 2022, 7:11pm

Yes absolutely. In some rare cases a patch might bump the Witch over the cpu limit.

You can always just try it out: click LOAD to upload and run the patch from RAM. Worst it can do is reboot your device