Porting Openware to Daisy

Managed to get serial MIDI working. Old callback that we had for other devices wouldn’t work as UART register differs on H7. First attempt to use similar register fields wasn’t working for me either, but apparently the same functionality is covered by HAL functions. So this code got UART in order:

  if(__HAL_UART_GET_FLAG(&huart1, UART_FLAG_IDLE)){
  __HAL_UART_CLEAR_IDLEFLAG(&huart1);
  HAL_UART_RxCpltCallback(&huart1);

I think this should be compatible with F4 and likely most if not all other STM MCUs.

Pinging @mars , I think I need some feedback on the following.

Looks like currently OwlProgram is able to process only stereo patches. I’m trying to make a new multi-channel alternative to SampleBuffer class that would be able to handle:

  1. more than 2 channel of audio (I guess something that Noctua would also need)
  2. data exchange with more than 1 codecs that won’t require copying their data into a single buffer in firmware.

The latter would mean that we will skip parts of audio stream to handle double buffering correctly. I will generalize this to using up to 4 codecs, since it would only require reserving an extra bit in in audio format descriptor. Something like this would be used:

#define AUDIO_FORMAT_24B16_2X       0x10
#define AUDIO_FORMAT_24B24_2X       0x18
#define AUDIO_FORMAT_24B32          0x20
#define AUDIO_FORMAT_24B32_2X       0x22
#define AUDIO_FORMAT_24B32_4X       0x24
#define AUDIO_FORMAT_24B32_8X       0x28

#define AUDIO_CODEC_DUAL            0x40
#define AUDIO_CODEC_TRIPLE          0x80
#define AUDIO_CODEC_QUAD            0xC0

/*
 * This would work correctly only with 24B32* formats!
 * Others have inconsistent channels mask.
 */
#define AUDIO_CHANNELS_MASK         0x0F
#define AUDIO_CODEC_MASK            0xC0
#define AUDIO_FORMAT_MASK           0x3F
#define AUDIO_CODECS(FORMAT)        ((FORMAT & AUDIO_CODEC_MASK) >> 6)
#define AUDIO_FORMAT(FORMAT)        (FORMAT & AUDIO_FORMAT_MASK)
#define AUDIO_CODEC_CHANNELS(FORMAT) (FORMAT & AUDIO_CHANNELS_MASK)
#define AUDIO_TOTAL_CHANNELS(FORMAT) (AUDIO_CODEC_CHANNELS(FORMAT) * AUDIO_CODECS(FORMAT))

Then the loop in PatchProgram would have to do something like this:

    for(;;){
      pv->programReady();
      for (int i = 0; i < AUDIO_CODECS(pv->audio_format); i++)) {
        samples->setStartChannel(i * AUDIO_CODEC_CHANNELS(pv->audio_format));
        samples->split32(pv->audio_input, pv->audio_blocksize);
      }
      processor.setParameterValues(pv->parameters);
      processor.patch->processAudio(*samples);
      for (int i = 0; i < AUDIO_CODECS(pv->audio_format); i++)) {
        samples->setStartChannel(i * AUDIO_CODEC_CHANNELS(pv->audio_format));    
        samples->comb32(pv->audio_output);
      }
    }

This shouldn’t affect older devices - they would still be processed as stereo by old SampleBuffer class. We could theoretically also use new code for 32bit stereo processing, but I don’t think there’s any reason to do it.

Exposing codecs number overcomplicates things, so I went with plan B and just store codec outputs in a single merged buffer. This works and I can get results as 4 channel stream. This is based on just visualizing buffers with the scope UI. Of course I will still need a multi-channel aware StreamBuffer replacement to process this data in patches.

For some reason, I get a SAI DMA error on startup with 2 codecs, so I will look into this - maybe I’ll have to replace HAL tick based delay with NOP loop like most codecs do. There’s no visible problems from this (could be a few buffers lost on startup). Once I solve this, it would be time to finally start work on patch loading.

Somehow I can’t get firmware to run after bootloader jump. Linker script is edited, VTOR is set to symbol exported from LD (spotted this in Magus sources). The FW runs only if used with a different linker script and stored on flash. I think there could be some peripheral init issue that I’ve missed in FW, otherwise it could be something wrong in LD script.

However, now I’m thinking that read-only QSPI is too much of a limitation:

  1. Can’t write patches from application code, need to go back to bootloader

  2. Requires separate settings storage on flash

  3. Filling that settings storage would require overwriting bootloader in order to erase their shared sector on defrag

So I will try to convert it to loading FW as BootROM, luckily there are cube sample projects for both use cases.

This would require allocating more RAM as we’ll have to load full FW image there. But this is probably acceptable. Flash would only be used by bootloader in such case, potentially we could have a bigger bootloader with additional features - loading FW ROM from SD card, backup FW ROM slot, display support, etc.

Another interesting side effect from BootROM support is that we could overwrite FW image on QSPI flash even from running application.

Flash storage code (converted to template usable with QSPI too) doesn’t handle junk data very well. Ran into case when it got hard fault due to reading data that was previously used for storing firmware image. Looks like it can dereference header variable that can point to invalid address in certain cases. There was a check that header address is less than storage end, but beginning wasn’t checked.

Besides this invalid address issue, alignment for block headers was checked only when they were written. This was source for another hard fault - due to derefencing addresses without proper alignment.

With those issues fixed, junk data can be properly discarded if it ever reaches patch storage.

Dynamic patch loading works (in glorious quad channels!). Next stop - QSPI storage for patches (code is written, but probably needs some love to start working).

Patch storing / loading works with QSPI storage. The stack overflow I’ve ran into was solved by increasing flash task stack size from 512 to 1024 words, which was most likely due to QSPI writies made in 256 byte pages.

Currently I’m using a trivial patch that copies inputs to outputs just to confirm that it runs. I can’t run serious DSP code, because FW is built without fast math tables due to limited space. Addin LUTs would require using bootrom that I’ve tried to get working earlier. I think that it was not working due to FW issues that got fixed later, so I’ll return to this in the very end.

Next milestones:

  • check if defragmentation code works

  • enable caching, which would require using separate memory section for DMA buffers with caching disabled

Some numbers about cache efficiency. I’ve measured CPU load for a trivial patch that basically copies inputs to outputs. It was very obvious that H7 core is severely throttled by IO from running code and data in D1 domain RAM.

  • No cache - 14% load
  • Instruction cache ON - 13% load
  • Data cache ON ~3.5% load
  • Instruction + data cache ON < 1% load

Which shows that data cache gives most improvements. Instruction cache is not particularly effective without data cache, but helps a lot when data cache is enabled.

Now, DMA exchanged bypasses caching completely, so for all DMA buffers we have to do one of the following:

  1. Use a separate memory section that is not using cache (configured by MPU settings) - this will be done for most large buffers (audio data, probably also MIDI and digital bus)

  2. Use cache, but discard old value before reading - this is done for ADC values

  3. Use cache, but write it to memory before reading (using clean and invalidate call) - this is done for graphics params array. Haven’t fully understood how it interacts with cache yet, but this approach is the only way that I could get it to work correctly.

For option #2 we must align data to cache lines (32 bytes) in order to not invalidate data belonging to something else. For #3 this alignment is not mandatory, but it’s better to use it to avoid evicting data that is stored nearby graph parameters.

I’ve added 2 macros for changing object alignment and for moving to non-cacheable section.

I’ve compared that patch’s performance on Magus - 3% CPU used. Besides being 3 times faster, Daisy was processing twice as many channels. But performance on larger patches would likely drop along with caching efficiency.

I’ve finally ran out of things that need fixing in firmware port, it’s time to start dealing with last major task - setting up bootrom loading. It’s sort of written, but wasn’t functional last time. Turns out that I was booting broken FW, so maybe not that much is left to get everything running.

Getting a usable bootloader turned out too complicated. It can’t be done with the simpler approach of execution from QSPI, because it can either read and execute code or be written to, but not both both at once. Using QSPI as ROM and uploading data to external memory is another approach suggested by ST. This should work, but initialization hangs when it resets RCC clock. I’m not sure how to solve this yet, plus it seems to require initializing FMC with registers before HAL is started.

So for now I will try to fit everything in flash - currently at 88k out of 128k. In order to allocate math LUTS I’ve written a script to convert binary data to resources (i.e. add resource header and use little endian format). It can also parse data from C headers, so I can convert pow/log tables.

Once I confirm that this script works correctly, I will add some code to preload data from resources to RAM. And that should be the last major step left till FW is usable.

Made some experiments with using ITCM/DTCM memory. ITCM can be used for loading most common code via linker script + custom startup, confirmed that this works. DTCM will store data/bss/stack and maybe I will use half of it for fast heap section like it was done for CCM on older devices.

Figured out that a nice progress bar would be useful, especially since we can have larger patches/resources (up to 512k in theory)

Guess I’ll be porting this to Magus too.

Made UI page for gates state visualization:

Output state can be triggered with encoder click

Made comparison for CPU usage of stock patches from Magus running on Daisy:

Kickbox: 15% vs 55%
LorenzFM: 8% vs 27%

So around x3.5 better CPU performance. This is with cache enabled, obviously.

My idea to write bootloader based on firmware header worked, it can boot finally. Remaining minor issues:

  • roll back some stuff intended to use LUTs from resources - this is no longer needed

  • find a way to store settings only when no patch is running - either defer saving until patch is stopped or pause it. this is to prevent read errors when QSPI is not in memory mapped mode

  • there seems to be some issue when default parameter values are set from the patch. might be a bug introduced in UI controller.

  • debug build only runs if I run it when flashing with the programmer, but not on reboot. I suspect it could be related to semihosting being enabled.

  • also, debug build gets stuck when booting if the USB cable is plugged

It looks like I’ve fixed all outstanding issues. The USB cable thing required turning off OTG_FS_IRQn in bootloader before jumping to application, looks like not doing this causes an interrupt storm that prevents device from booting. I don’t think anything similar is happening on F4 devices. I’m using the same thing I’ve seen in Magus to set VTOR - SCB->VTOR = (uint32_t)&_ISR_VECTOR; and made sure that it gets changed, so not sure why is old interrupt firing before it’s enabled by application itself.

@mars I’ll probably announce this on Daisy forums within a week from now and make unofficial firmware releases. There would likely appear a certain number of clueless users here after that. Let me know if you think that it’s ok. I’ll start cleaning up final code version and preparing PRs once the dust settles.

Cool, I’ll have to give it a go - haven’t unpacked my Daisy yet!

I think it’s better to keep discussions about running OpenWare on the Daisy platform over on the Daisy forum, as it is really for Daisy users.

I’ve had some weirdness switching context on the H7 too, it seems a bit finicky sometimes. They recommend clearing pending interrupts before jumping, see e.g. ST Community

and you’ll want to flush the cache first, too!

Yeah, I agree that discussing port should be done on the other forum, but I’m not so sure about people writing patches and having typical newbie questions. It makes sense to consolidate such info here. So if that doesn’t sound like a bad idea, I could suggest them to ask questions about writing patches here (or do the opposite if that’s inconvenient).

So which device have you got? This port is for Patch specifically. I imagine it won’t take much effort to make a separate project for using it with just Seed. Or it’s possible to add an encoder and a SSD1309 OLED to “upgrade” it to something that can be used on a breadboard with current FW (there’s support for building it with only first SAI).

What’s interesting about your link is the part setting “NVIC->ICER” - I think it wasn’t present in the sample bootloader app from Cube that I’ve seen. I will test with this code instead of disabling only USB interrupt, but this looks more reliable.

As for cache, I’ve decided it’s better not to enable it in bootloader, it normally takes a fraction of second to load FW and I don’t want to risk any surprises from caching at this stage.

Built without problems all Fascination Machine patches. MCU usage is 20-33% depending on the patch.

Unfortunately I’m getting hardfaults 1-2 times per hour, it wasn’t happening with simpler patches (or maybe I didn’t let FW run long enough to notice this). I think this could be something cache-related.

I’ve caught assertion error a few times in FascinationMachine patches (or at least FM IV) after patch runs for some time: "Heavy assertion failed in void __hv_tabwrite_f(SignalTabwr line 40". This is raised by the following line:

  hv_assert((o->head + HV_N_SIMD) <= hTable_getSize(o->table)); // assert that the table bounds are respected

I suspect this may be an issue with HVCC itself, so will try to run the same patches on Magus for a while to see if can be reproduced.

Agreed - patch discussions are very welcome here!