Faust memory allocator

Very nice !

About the memory hierarchy issue (fast SRAM and slower SDRAM yes ?), we are currently improving the Faust generated C/C++ code so that to possibly separate part of the DSP structure to stay in fast SRAM, and other parts (like long delay lines) in the slower SDRAM.

This is done in the context of the FAST project, so running Faust code on FPGA with a one-sample model. But we could extent it to also work in the regular “buffer of some samples” model.

Faust can indeed generate double aware C/C++ code by using the -double option at compile time, and adapting the input/output audio buffers given to compute in the architecture file.

It would be interesting to check how far can this improve things. Current situation is that on OWL dynamic memory allocation happens in order of physical address growth. And on MCU used by both old and new OWL platforms it mostly matches RAM access speed. I.e. on OWL3 (STM32H7 MCU) we have some smaller areas that connect directly to MCU core on lower addresses, followed by a few larger areas connected via fast buses and slower bus is used for SDRAM access in the end.

I obviously don’t understand all details of FAUST internals, but I’d say that if it’s possible to “predict” which objects may need more frequent memory access and allocate them earlier, there would be some improvements due to those buffer ending up in faster memory sections.

Such memory address ordering is obviously specific to ARM Cortex-M, but even on other architectures it should be possible to allocate to faster areas if allocations are prioritized in advance and memory is heterogenous.

This sort of thing happens automatically now, because long delay lines won’t fit in internal SRAM (largest internal SRAM area is 512kb, others are smaller). I’m rather thinking about the opposite scenario when we partially fill that 512kb chunk with something infrequently used and have to move a delay line to SDRAM even though it would fit if allocation happened earlier.

Also, there’s a theoretical possibility to add an extra parameter to memory allocator used on OWL that specifies “allocation priority” or something like this. Then it could be used to reallocate previously allocated low priority data to slower sections and place objects that need faster access in their place. However this would make allocation rather slow (looks like quadratic growth for amount of checks to be made for reallocation) and feels too hacky.

Yes, we can do that: for each array (used for recursive computations or delay lines), we know the number of R/W access and the array size. So we could compute a (R/W access)/size ratio and sort and allocate all arrays with this criteria. Or imagine a better model to sort the arrays.

What do you mean? Does this happen in OWL architecture itself? Any code to show ?

Ok I see that I’ve misinterpreted a bit what you were asking about. My overall point was mostly that if allocations are prioritized, slower memory areas will get used later. So it sounds like such prioritization is possible and this is very encouraging.

This definitely needs some experiments to choose good heuristics, i.e. the /size part means we’re always deprioritizing larger arrays compared to smaller ones of the same size. And this brings up the problem I’ve mentioned before - we might end up not giving a chance to allocate a buffer to fast memory simply because it got trashed by smaller buffers, but we would be able to place both of them in SDRAM if allocation order was different.

While we do allocate memory in order of speed, the amount of available memory in each region can differ and it becomes important to consider both buffer size and access frequency in certain cases separately rather than as a ratio. So overall ideal process looks like this to me:

  1. medium size buffers (<512kb minus a few bytes) will benefit from being allocated first unless their R + W access is very low
  2. smaller (probably <64-128kb) but frequently used buffers should follow and for them using plain (R + W access) / size formula may be the way to go.
  3. everything else can be allocated later and would fill remaining SRAM or go to SDRAM without making much difference. Using ratio is probably ok, but it’s possible that allocating them simply by ascending size would place maximum number of objects in SRAM.
  4. large buffers end up in SDRAM either way, doesn’t matter when you try to allocate them

So it seems like being able to specify a few parameters for setting allocation thresholds (used in points 1 & 2 above) could be beneficial for adjusting allocation to heterogenous memory on different architectures. This is kind of similar to array size rounding for faster bounds checks that was added a while ago.

We had a similar discussion a while ago. I suggest we continue to refine this custom memory manager idea? By adding notion like access frequency in the allocate function ? Anything more that would be needed?

Would it be possible to add an extra method to allow the custom allocator to specify allocation priority based on our internal logic (like what I’ve described in the post above)? The ::allocate method itself doesn’t look like a good fit, since it just stores data sequentially in order it gets called. Simply providing extra data would require to constantly check previous allocations and reallocate old data if we determine that new one should happen in faster memory. But if we can separate allocation priority estimation from the actual allocation, we can probably improve things by ensuring that allocations happen in desired order.

Could you possibly specific what you have in mind directly in a reworked C++ version ? This would definitely help.

Well the general idea is to have an extra method like this:

struct dsp_memory_manager {    
    // ...
    virtual size_t get_allocation_priority(size_t size, size_t io_access) { return 0};

That would be called in advance for every allocation that is expected and results would be used to sort allocations order later. Default behavior implies no changes to sorting order, but a custom allocator could override it.

So as antisvin says, our current approach allocates heap roughly from fast to slow, and from small to big, on a first come first serve basis. So allocating high access frequency memory first would already go a long way to optimising it, for us at least.

However to fully optimise the memory layout the allocator would have to know in advance what all the required allocations are, and their frequency.

It then becomes a fairly complex optimisation, which still won’t yield perfect results, because of a number of additional complications:

  • asymmetric read/write/erase times
  • variations in read (or write) block sizes
  • fifo sizes, burst reads, L1 cache, pipelines…

And then to consider that there are many different potential memory types: CCM, SRAM, SDRAM, memory mapped QSPI flash, SD cards…

The ideal solution would take all this into account, preferably at compile time :slight_smile:
Meanwhile I think what we have is pretty good.

To refine it we could either:

  1. sequence allocations by priority (higher access time first), or
  2. add a pre-allocation callback which tells the memory manager what the access frequency and allocations are

The first option could be done either as Stas suggests by letting the memory manager decide the priority, or simply do it by some agreed algorithm. But it probably requires the calling code to do a lot more heavy lifting.
Second option might be less work: all the allocations could be called through twice, first as a dry run to tell the mem manager what is coming (including the access frequency and any other available metadata, like write-once or write-many). The second run would just have to be done in the same order, then the custom memory manager can allocate each one having already decided where to put it.
(Either way, memory would have to be allocated to manage the allocations :exploding_head:)

I don’t know how feasible either is from the FAUST point of view.

Hold your horses, we’re only talking about dynamic memory allocation here :wink:

Not really always from small to big! Full order of memory on chip would be something like this:

Region Address Size
ITCMRAM 0x00001000 60k
DTCMRAM 0x2000c000 80k or (80k - patch size) depending if we link to D2
D1 SRAM 0x24000000 (512k - patch size) or 512k depending if we link to D2
D2 SRAM 0x30000000 256k
D3 SRAM 0x38000000 64k, unused ATM
SDRAM 0xD0000000 8M

As you see, size does not always go up. But access speed getting slower is consistent with address direction. So ultimately we want to consider both object size and IO access amount to decide which allocation happen earlier.

I think that we would get best results by considering actual memory segments info when patch runs. This way we would be able to choose allocation order dynamically based on hardware used. At compile time this is not known yet and we would have to target some hardcoded thresholds used by a particular OWL version, giving suboptimal results if the same patch runs on another platform version. So my conclusion is that doing it all at compile time would be unfeasible.

The calling code itself (if you mean FAUST internals) shouldn’t do that much heavy lifting, it just needs to make an extra pass to ask our allocator for allocations order based on info that it can provide, then reorder its allocations as we tell it.

This was basically my point in last year description: a 2 pass model, giving all info that the compiler can provide, and letting the memory manager do the hard job (knowing the memory layout + access information given by the first pass).
After this discussion I have a more clear understanding of the kind of info the Faust compiler can provide, that is R/W access statistic, and size of course.
I’ll now work on this idea.

1 Like

WIP here: Custom memory manager - Google Docs

Can you @mars and @antisvin have a look, test and give feedback? (fell free to criticise and suggest improvements :sweat_smile:)

This matches what we’ve discussed, I’ll start work on implementation that uses memory segments info used by our allocator soon.

    virtual void info(size_t size, size_t io_access) {}

I would say it should be something like

    virtual size_t info(size_t size, size_t io_access) { return 0; }

if we want to provide default implementation that does nothing.

I don’t know how much details can be provided to allocator, maybe read and write access numbers can be separated if this is known in advance. We will likely end up summing them on OWL, but there might be some architectures where writes are much more expensive than reads.

I don’t see the point of having size_t as return value, can you explain?

Yes, I can distinguish read from write.

I thought the point of this method is to return estimated allocation cost to whatever calls it, is it supposed to do something else instead?

Distinguish read from write => done

     * Inform the Memory Manager with informations on a given memory zone 
     * when used to compute one frame.
     * @param size - the size in bytes of the allocated zone
     * @param read_access - the number of Read access to the zone
     * @param write_access - the number of Write access to the zone
    virtual void info(size_t size, size_t read_access, size_t write_access) {}

The point of this info function is to give information to the MM on a given memory zone. Then the MM is supposed to do something with that, and estimate itself the memory zone use cost

Look at the C++ generated code using -mem.

I still don’t quite understand what “prepare memory layout for allocation” is supposed to mean. My idea was to look at available memory segments and return some value based on requested size, available segments’ sizes and amount of IO that is expected. We naturally need to return that value from this virtual method to set sorting order.

Have you looked at the C++ generated code ? We can also continue the discussion on Faust Slack for faster interaction.

I’ve seen code in that branch, but it’s obviously that I’m missing something about how this method should be implemented. It would help if you could post some pseudo-code in #owl channel in slack.

I think the idea here is that the sorting is done entirely by the Memory Manager. Therefore the information is provided before allocation, and no return value is required.
To optimise, the MM will need to create a list of allocations, sort it, and assign to different segments. Then when the allocations come in it looks them up in its list.

@sletz one thing that would be very useful would be an initial call to say how many allocations, ie how many info() callls, will be made.