Granular patch for Lich (inspired by Clouds)

damikyu · January 13, 2022, 7:20am

Thanks for the additional feedback, y’all. And @mars so cool that you took the time to actually test a few of those ideas, they are all things I’ve attempted (except copying record buffer into smaller working memory) with same results that you saw of actually performing worse than what was already there. After making the small changes originally suggested by @antisvin, plus a few other probably very minor optimizations, I’m able to do 18 grains and get 93% use. I think there’s probably room there for 1 more grain, but I don’t want to push it so close to limit.

I’m going to try the CircularShortBuffer suggestion, but suspect the extra multiplies required to convert from float to int and back will make it work out to about the same.

damikyu · January 14, 2022, 1:55am

Well, I tried switching over the RecordBuffer to a CircularShortBuffer but it’s much slower than using CircularFloatBuffer, I expect on account of the extra multiplications and implicit casts of int16_t to float. I’ve left the code in with an easy way to try shorts by by changing #if 1 to #if 0 at the top of Grain.hpp. And you’ll also want to reduce max grains to like 10 or else the patch will crash. I’m pretty sure I’m doing the conversions between data-types as simply as possible, but let me know if there is a better way.

antisvin · January 14, 2022, 11:04am

Hmm I suspected that this may happen. I think that this means that reading 16 or 32 bits takes more or less the same amount of time, so we don’t gain anything from shorter values while still adding extra processing.

Something else that’s worth trying is storing sample data in recording buffer in interleaved format. I mean that instead of writing samples like l1, l2, …ln and r1, r2, … rn you would use l1, r1, l2, r2, … ln, rn. With i16 samples you can read a 32bit frame and unpack it into 2 samples. This reduces amount of buffer reads by half, but no idea if it’s enough to overcome the extra sample conversion overhead.

With f32 buffer amount of IO won’t change, but you will perform half of your reads sequentially when processing stereo audio. This should work faster with SDRAM as it has IO buffer that makes sequential IO faster than random. It may be better to stick to raw pointers for iteration instead of using array lookups.

There are 2 possible approaches for using interleaved format and I’m not sure if there’s any performance gains in either of them:

use FloatArray and manage addresses accordingly (left and right channel data are stored as sequential samples)
use ComplexFloatArray (L/R channel data is stored as real/imaginary values in a ComplexFloat)

There’s “ComplexSignalGenerator” class for working with complex data in packed format if you want to try that. And you can use ::copyFrom and ::copyTo methods in ComplexFloatArray to convert between interleaved/non-interleaved channels.

damikyu · January 14, 2022, 7:50pm

Ah, yes, these are great ideas, I’ll report back when I have some results!

antisvin · January 14, 2022, 7:52pm

Tried my own complex float suggestion and it doesn’t improve things enough to make a difference.

But I can squeeze another grain if I use precomputed array of 1 / sqrtf for normailizing. And looks like another one can rendered with several hacks:

don’t render inactive grains (obviously gives huge CPU jumps and you have to make sure that there’s no overruns)
don’t clear grainLeft/grainRight in advance, instead of that convert Grain::generate into template method that can either write or append data and track when first write occurs.

Also, we can generate ~75 grains with this patch on OWL3, just to put things into perspective.

antisvin · January 14, 2022, 8:10pm

I’ve left you a PR, feel free to close it if not needed.

I’ll be experimenting with this a bit more, but it seems like on OWL2 you would get more mileage by using a mono granulator + reverb. In such case second audio input could be use for V/Oct control. Btw, the reverb from clouds/rings takes 23-25% CPU.

damikyu · January 14, 2022, 8:53pm

Ah, thanks, I’m not sure how my approach to using complex floats differs from the one you tried, but I’m able to add two more grains by recording to a complex float buffer. I didn’t implement ComplexSignalGenerator though and it wouldn’t surprise me if the additional function calling overhead negates the win from faster data retrieval. I’ve merged your PR and we’ll see if I can get two more!

antisvin · January 14, 2022, 9:14pm

Keep’em coming!

It’s not like CFA didn’t work, but I ended up with the same performance as with an optimized float buffer. So if you mean that you had 20 grains running, then we had the same results.

Also, those ::generate/::process functions are virtual, which means that there’s an extra level of indirection from using VPTR table to resolve this method dynamically at runtime rather than at compile time. This lets us to store different generators/processors in an array, but may have slight overhead. It’s usually negligible unless compiler gets some opportunities for more optimization or code inlining when we don’t use them.

damikyu · January 15, 2022, 12:18am

Ok, I merged in @antisvin’s PR, which improved things a little bit, but not enough to add another grain after the interleaved audio change. But, since the record buffer is now interleaved and we can do contiguous reads when generating grains, I implemented the scratch buffer idea that @mars suggested by using memcpy to move the section of the buffer that will be sampled into a static scratch buffer and then access the scratch buffer in the while loop. This means we don’t have to wrap indices when reading samples in the while loop. And even though we read more data out of the buffer at higher playback rates, this appears to perform better than reading from the buffer as needed.

The only (potential) issue with the scratch buffer is that it assumes a block size of 64. If that’s larger on some OWL devices, this version of the patch will crash them when pitch is turned higher than 1x.

I was able to increase max grains to 24 and with profiling turned off, performance seems to hit 94-95% in the worst case that I can find tweaking knobs. I worry a little bit that there is a very rare possibility that all grains will need to read from the record buffer twice because the data they need wraps across the end of the buffer, which will be enough of a performance hit to crash the device. But maybe the moment would be short enough to just generate a click without crashing.

damikyu · January 15, 2022, 1:32am

75 grains is awesome! Which devices are running OWL3?

mars · January 15, 2022, 3:03am

Amazing work!

Also, we can generate ~75 grains with this patch on OWL3

I got 100 grains at 82% on an OWL3 even without the recent optimisations… I wasn’t going to mention it, since, well, we don’t sell them yet. Made of pure unobtanium.

I have to say though, I thought the original patch with 16 grains was already quite wonderful - more is not always better!

damikyu · January 15, 2022, 4:39am

Ah, yes, fair enough! I was also pretty satisfied with 16 and had it in that state for many months before finally making it public and sharing here. It’s been really fun to try to optimize the code and I really appreciate y’all’s help with that.

antisvin · January 15, 2022, 8:00am

I suspect that with 100 grains you might be getting buffer overruns under some combinations of settings or not spawning enough grains to fully load patch. However I didn’t run it on OWL exactly, will make another go to confirm how it behaves.

Generally, patch performance under heavy load on H7 becomes dependent on cache utilization and more variable, anything accessing SDRAM makes things worse as cache misses get more expensive.

It would be interesting to have a service call for asynchronously copying data using MDMA for things like this scratchbuffer approach.

tIB · January 22, 2022, 10:07am

Am I right in thinking the lich patches will run fine on a witch?

mars · January 22, 2022, 7:11pm

Yes absolutely. In some rare cases a patch might bump the Witch over the cpu limit.

You can always just try it out: click LOAD to upload and run the patch from RAM. Worst it can do is reboot your device

Befaco · January 27, 2022, 4:14pm

Lovely. Just LOVELY!!!

Stibbons · March 14, 2022, 1:11am

Is it correct that the Dry/Wet and Feedback are controlled through midi only

Any idea of the OWL allows for a buttonhold+pot combination?

It would be great to have e.g. Button 1 + A pot to control Feedback and Button 2 + B pot to control Dry/Wet

Enjoying my time with this patch so far regardless so thanks!

damikyu · March 14, 2022, 1:56am

Yes unfortunately. This particular patch definitely wants more than four parameters available on the panel and I opted for what felt like the most important ones. I could do something like what you’ve suggested where the knob assignment changes based on button presses, but it wouldn’t be possible to separate that from the Freeze and Trigger functions that are already on the buttons. With Trigger this may not be such a big deal, but since Freeze is a toggle, I could see that getting annoying.

antisvin · March 14, 2022, 9:59am

Well that’s MIDI only on Lich, but for instance Genius gives you access to all 40 parameters in its UI and a modulation matrix routing CV for 2 inputs / outputs to any patch parameters. Or we have 20 CV channels that can be set as input or output on Magus (it has no gate inputs though, so patch would have to be adjusted to use parameters instead of buttons).

Stibbons · March 14, 2022, 1:38pm

Ah makes sense I completely overlooked Freeze as a toggle 🤦