Tensor packaging — staged constants & multi-arena layout
This how-to is a practical companion to the tensor backing model reference. It covers when to keep weights cold, when to stage them into TCM/SRAM, how multi-arena layouts compose, and how to drive hydration from your application or DMA path.
TL;DR
- All tensors (scratch, persistent, constant) live in per-memory
arenas. There are no per-tensor
staticsymbols. - A constant is cold when source memory == runtime memory (kernels read in place; one read-only blob).
- A constant is staged when those memories differ
(
constant_destination_memory:set to a different memory): a cold source blob is copied into a writable runtime arena before inference. - All constants destined for the same memory share one arena with one source memory. That guarantees a single contiguous source blob → a single bulk transfer per arena (DMA-friendly).
- Hydration is driven from
<prefix>_model_initbetween<prefix>_context_initand the operator init loop. Override the weak<prefix>_hydrate_constantssymbol when you need DMA / async pre-stage / model-swap behavior.
When to choose cold vs staged
| Scenario | Recommendation |
|---|---|
Weights fit in MRAM (XIP-capable internal flash) and the kernel can read in place at acceptable energy |
Cold. Simpler, no hydration step, no RAM cost. |
Weights live in PSRAM/external storage with high per-access energy or no XIP window |
Staged → SRAM/DTCM. Pay one bulk transfer at boot (or model-swap), then read from low-energy memory inside the inference loop. |
| Operator is matmul-/conv-heavy and reads weights every cycle | Staged → DTCM if the weight working set fits; otherwise staged → SRAM. |
| Weights are tiny scalars/biases shared across ops | Leave cold in MRAM; the staging cost per arena is fixed and tiny constants don't justify a separate arena. |
The energy win from staging concentrates in the read-many-times-per-inference case. For one-shot reads (e.g. a single small bias), cold is usually cheaper.
Recipe 1 — Stage all weights from MRAM to DTCM
The simplest two-tier setup. Bias and weights both come from MRAM (cold internal flash) and are staged into DTCM at boot.
What this produces:
- One cold blob
<prefix>_arena_const_dtcm__source[]in.ddr_rodata-class storage, sized to the sum of allconstanttensors after alignment padding. - One writable runtime arena
<prefix>_arena_const_dtcm[]in DTCM. - A
<prefix>_arena_const_dtcm_bufferstatic (whenallocate_arenas=true) sized exactly to match. <prefix>_hydrate_constants(&ctx)does a singlememcpy(arena_const_dtcm, arena_const_dtcm__source, size).
Verify your case with the verbose log:
$ helia-aot convert --path my_model.yaml --verbose 2
…
Constant arenas (1):
• const_dtcm size=8704 B shape=staged src=mram -> dst=dtcm consts=2
(staged constants hydrated by <prefix>_model_init; ...)
Recipe 2 — Mix cold and staged
Routing is per-tensor. Stage the heavy weights into DTCM but leave the bias cold in MRAM:
memory:
tensors:
- type: constant
id: "1" # weights tensor id
attributes:
memory: MRAM
constant_destination_memory: DTCM
- type: constant
id: "2" # bias
attributes:
memory: MRAM
# no constant_destination_memory → cold (read in place)
You'll get two constant arenas: one staged const_dtcm (with a
source blob in MRAM) and one cold const_mram (the blob is the
runtime memory; kernels read it directly). The hydrate helper
copies only the staged arena.
Recipe 3 — Caller-supplied arenas (allocate_arenas: false)
The same staging policy, but the application owns the buffers:
memory:
allocate_arenas: false
tensors:
- type: constant
attributes:
memory: MRAM
constant_destination_memory: DTCM
The generated module exposes:
<prefix>_bind_arena(region, buffer, size)— bind one arena. Returns1ifregionis out of range,2onNULLbuffer,3ifsizeis below<prefix>_arena_sizes[region],4ifbufferis not aligned to at least<prefix>_arena_alignments[region]bytes.<prefix>_bind_arenas(buffers, sizes, n)— bulk-bind every region;nmust equal<prefix>_num_arena_buffers.<prefix>_arena_sizes[<prefix>_num_arena_buffers]— minimum capacity per region.<prefix>_arena_alignments[<prefix>_num_arena_buffers]— required pointer alignment per region. Each value is a power of two; internal-arena builds satisfy these by emittingalignas(...)on every buffer. Bind state is module-global, so none of these symbols take the model context.<prefix>_arena_<mem>_alignment/<prefix>_arena_persistent_<mem>_alignment/<prefix>_arena_const_<mem>_alignment— compile-time#definemacros mirroring the runtimearena_alignments[]table, suitable foralignas(...)/__attribute__((aligned(...)))on caller-owned buffers.<prefix>_arena_<mem>/<prefix>_arena_persistent_<mem>/<prefix>_arena_const_<mem>— the<prefix>_arena_region_tenumerators identifying each region.
Application sequence — every region must be bound before
<prefix>_model_init, which calls <prefix>_context_init
internally and rejects unbound regions with a non-zero status:
uint8_t scratch[<prefix>_arena_dtcm_size]
__attribute__((aligned(<prefix>_arena_dtcm_alignment)));
uint8_t persistent[<prefix>_arena_persistent_dtcm_size]
__attribute__((aligned(<prefix>_arena_persistent_dtcm_alignment)));
uint8_t const_dtcm[<prefix>_arena_const_dtcm_size]
__attribute__((aligned(<prefix>_arena_const_dtcm_alignment)));
<prefix>_bind_arena(<prefix>_arena_dtcm, scratch, sizeof(scratch));
<prefix>_bind_arena(<prefix>_arena_persistent_dtcm, persistent, sizeof(persistent));
<prefix>_bind_arena(<prefix>_arena_const_dtcm, const_dtcm, sizeof(const_dtcm));
<prefix>_model_init(&ctx); // wires ctx, memsets persistents, hydrates staged consts, runs op-init
<prefix>_model_run(&ctx);
The cold-storage source blob is staged into the writable runtime
arena by <prefix>_hydrate_constants. In external-arena mode the
caller must bind every region — staged destination arenas as
writable scratch buffers (above), and cold constant arenas to
the bytes loaded from the emitted <prefix>_arena_const_<mem>__blob.bin
sidecar — before calling <prefix>_model_init.
Recipe 4 — DMA-driven hydration
<prefix>_hydrate_constants is declared __attribute__((weak)).
Replace it with a strong symbol that drives DMA / decompression /
pre-staged HBLRAM and you're done — the rest of the contract
(invocation from model_init, blob descriptor table, idempotent
default body) is unchanged:
// In your application translation unit:
int32_t my_module_hydrate_constants(my_module_model_context_t *ctx) {
for (size_t i = 0; i < my_module_num_constant_blobs; ++i) {
const my_module_constant_blob_t *b = &my_module_constant_blobs[i];
// Each blob targets a single destination arena starting at
// offset 0 (per-arena source contiguity is guaranteed by the
// planner). ctx->arena_buffers is populated from the bind
// table by context_init.
void *dst = (void *)ctx->arena_buffers[b->arena];
// Submit DMA from b->source -> dst, length b->size, then wait.
dma_copy_blocking(dst, b->source, b->size);
}
my_module_mark_hydrated(); // arm the model_run guard
return 0;
}
The my_module_mark_hydrated() call is the contract override
authors must respect: my_module_model_run returns 200 until
the latch is set. The default weak helper sets it for you;
strong-symbol replacements must set it explicitly.
If you'd rather call hydration on your own schedule (warm resume,
model swap, lazy first-use), invoke
<prefix>_hydrate_constants(&ctx) directly between
<prefix>_context_init and <prefix>_model_init — the default
body is idempotent, so model_init's call becomes a no-op.
<prefix>_clear_hydrated() can be used to force re-hydration
before the next run (model swap) and <prefix>_is_hydrated()
lets the caller poll the latch.
Legacy memory.auto_hydrate_constants flag
The flag is retained for backwards compatibility but is now
a runtime no-op: model_init always hydrates staged-constant
arenas before op-init. This eliminates a class of races where
a kernel's _init hook reads constants before manual
hydration runs.
Recipe 5 — Tighter alignment for DMA-backed paths
Some DMA engines and cache-line-coherent transfers need stronger
alignment than the platform's MVE/Helium floor (typically 16 B).
Bump the per-arena alignment via MemoryConstraint:
memory:
constraints:
- name: PSRAM
arena_alignment: 64 # cache line for the staged blob
- name: DTCM
arena_alignment: 32
arena_alignment is applied to both the arena base symbol
(alignas(64) on the buffer) and every tensor slot inside it.
Defaults to max(16, platform.min_alignment) when unset.
Inspecting placements after planning
The --verbose 2 log prints a residency summary per build. For
tooling, opt in to the JSON report:
This writes <work_path>/<prefix>_residency.json alongside the
emitted module:
{
"schema_version": 3,
"module_prefix": "aot",
"plan_hash": "<16 hex chars>",
"tensor_layout_hash": "<16 hex chars>",
"arenas": {
"scratch": [{"region_id": 0, "memory": "dtcm", "used": 256, "total_size": 256, "alignment": 16}],
"persistent": [],
"constant": [
{"region_id": 1, "memory": "dtcm", "source_memory": "mram", "kind": "staged",
"used": 8704, "total_size": 8704, "alignment": 16, "tensor_count": 2}
]
},
"tensors": [
{"id": "1", "role": "constant", "memory": "dtcm",
"source_memory": "mram", "offset": 0, "size": 8192}
]
}
Schema is stable (schema_version: 3) — wire it into CI to gate
unintended layout changes the same way you gate the linker map.
The two top-level hashes cover disjoint layers:
plan_hash— arena envelope only (per-region role / memory / source_memory / size / alignment / is_staged). Mirrors the generated<PREFIX>_PLAN_HASHC macro and detects arena-ABI drift across separately-compiled binaries.tensor_layout_hash— per-tensor placement (tensor_id / role / memory / offset / size). Catches drift the envelope hash misses (two tensors swapping offsets inside the same arena, a tensor migrating between arenas of identical shape, etc.).
Per-arena region_id correlates each JSON entry with the C
<prefix>_arena_* enumerator.
Multi-arena layout — guarantees and limits
Per-arena guarantees the planner enforces:
- One source memory per destination arena. All constants
staged into the same memory must come from the same source
memory (otherwise the planner errors at validation time). This
makes the source blob a single contiguous
_source[]symbol and the hydration a single bulk transfer. - Source/destination relative offsets match. The hydrate
helper can therefore use a single
memcpy(dst, src, size)per arena rather than per-tensor copies. - Padding charged symmetrically. Per-tensor alignment padding is reflected in both the source blob and the destination arena so the bulk transfer never undershoots or overshoots.
Limits that are not (yet) supported:
- Multiple destination arenas backed by the same memory (e.g. two DTCM constant arenas with different source memories). The destination memory keys the arena, full stop.
- Cross-role aliasing (e.g. reusing a scratch arena to hold staged constants between inferences). Each role owns its arena.
- Lazy / on-demand hydration. Hydration is one shot at
context_init(or whenever you call the helper); paged or on-first-use weight residency is out of scope.
Linker-map verification
When allocate_arenas=true, the source blob is a const symbol
landing in the cold memory's read-only section. To assert it
didn't drift to a writable section in CI, grep the .map:
For a tighter assertion, see
tests/e2e/test_e2e.py::test_linker_map_staged_constants,
which walks back to the most recent section header and verifies it
matches the per-source-memory expectation table (e.g. MRAM →
.ddr_rodata, SRAM → .sram_data, DTCM → .dtcm_data).