Tensor packaging — staged constants & multi-arena layout

This how-to is a practical companion to the tensor backing model reference. It covers when to keep weights cold, when to stage them into TCM/SRAM, how multi-arena layouts compose, and how to drive hydration from your application or DMA path.

TL;DR

All tensors (scratch, persistent, constant) live in per-memory arenas. There are no per-tensor static symbols.
A constant is cold when source memory == runtime memory (kernels read in place; one read-only blob).
A constant is staged when those memories differ (constant_destination_memory: set to a different memory): a cold source blob is copied into a writable runtime arena before inference.
All constants destined for the same memory share one arena with one source memory. That guarantees a single contiguous source blob → a single bulk transfer per arena (DMA-friendly).
Hydration is driven from <prefix>_model_init between <prefix>_context_init and the operator init loop. Override the weak <prefix>_hydrate_constants symbol when you need DMA / async pre-stage / model-swap behavior.

When to choose cold vs staged

Scenario	Recommendation
Weights fit in `MRAM` (XIP-capable internal flash) and the kernel can read in place at acceptable energy	Cold. Simpler, no hydration step, no RAM cost.
Weights live in `PSRAM`/external storage with high per-access energy or no XIP window	Staged → SRAM/DTCM. Pay one bulk transfer at boot (or model-swap), then read from low-energy memory inside the inference loop.
Operator is matmul-/conv-heavy and reads weights every cycle	Staged → DTCM if the weight working set fits; otherwise staged → SRAM.
Weights are tiny scalars/biases shared across ops	Leave cold in `MRAM`; the staging cost per arena is fixed and tiny constants don't justify a separate arena.

The energy win from staging concentrates in the read-many-times-per-inference case. For one-shot reads (e.g. a single small bias), cold is usually cheaper.

Recipe 1 — Stage all weights from MRAM to DTCM

The simplest two-tier setup. Bias and weights both come from MRAM (cold internal flash) and are staged into DTCM at boot.

memory:
  tensors:
    - type: constant
      attributes:
        memory: MRAM
        constant_destination_memory: DTCM

What this produces:

One cold blob <prefix>_arena_const_dtcm__source[] in .ddr_rodata-class storage, sized to the sum of all constant tensors after alignment padding.
One writable runtime arena <prefix>_arena_const_dtcm[] in DTCM.
A <prefix>_arena_const_dtcm_buffer static (when allocate_arenas=true) sized exactly to match.
<prefix>_hydrate_constants(&ctx) does a single memcpy(arena_const_dtcm, arena_const_dtcm__source, size).

Verify your case with the verbose log:

$ helia-aot convert --path my_model.yaml --verbose 2
…
Constant arenas (1):
  • const_dtcm    size=8704 B  shape=staged  src=mram -> dst=dtcm  consts=2
  (staged constants hydrated by <prefix>_model_init; ...)

Recipe 2 — Mix cold and staged

Routing is per-tensor. Stage the heavy weights into DTCM but leave the bias cold in MRAM:

memory:
  tensors:
    - type: constant
      id: "1"                    # weights tensor id
      attributes:
        memory: MRAM
        constant_destination_memory: DTCM
    - type: constant
      id: "2"                    # bias
      attributes:
        memory: MRAM
        # no constant_destination_memory → cold (read in place)

You'll get two constant arenas: one staged const_dtcm (with a source blob in MRAM) and one cold const_mram (the blob is the runtime memory; kernels read it directly). The hydrate helper copies only the staged arena.

Recipe 3 — Caller-supplied arenas (`allocate_arenas: false`)

The same staging policy, but the application owns the buffers:

memory:
  allocate_arenas: false
  tensors:
    - type: constant
      attributes:
        memory: MRAM
        constant_destination_memory: DTCM

The generated module exposes:

<prefix>_bind_arena(region, buffer, size) — bind one arena. Returns 1 if region is out of range, 2 on NULL buffer, 3 if size is below <prefix>_arena_sizes[region], 4 if buffer is not aligned to at least <prefix>_arena_alignments[region] bytes.
<prefix>_bind_arenas(buffers, sizes, n) — bulk-bind every region; n must equal <prefix>_num_arena_buffers.
<prefix>_arena_sizes[<prefix>_num_arena_buffers] — minimum capacity per region.
<prefix>_arena_alignments[<prefix>_num_arena_buffers] — required pointer alignment per region. Each value is a power of two; internal-arena builds satisfy these by emitting alignas(...) on every buffer. Bind state is module-global, so none of these symbols take the model context.
<prefix>_arena_<mem>_alignment / <prefix>_arena_persistent_<mem>_alignment / <prefix>_arena_const_<mem>_alignment — compile-time #define macros mirroring the runtime arena_alignments[] table, suitable for alignas(...) / __attribute__((aligned(...))) on caller-owned buffers.
<prefix>_arena_<mem> / <prefix>_arena_persistent_<mem> / <prefix>_arena_const_<mem> — the <prefix>_arena_region_t enumerators identifying each region.

Application sequence — every region must be bound before <prefix>_model_init, which calls <prefix>_context_init internally and rejects unbound regions with a non-zero status:

uint8_t scratch[<prefix>_arena_dtcm_size]
    __attribute__((aligned(<prefix>_arena_dtcm_alignment)));
uint8_t persistent[<prefix>_arena_persistent_dtcm_size]
    __attribute__((aligned(<prefix>_arena_persistent_dtcm_alignment)));
uint8_t const_dtcm[<prefix>_arena_const_dtcm_size]
    __attribute__((aligned(<prefix>_arena_const_dtcm_alignment)));

<prefix>_bind_arena(<prefix>_arena_dtcm, scratch, sizeof(scratch));
<prefix>_bind_arena(<prefix>_arena_persistent_dtcm, persistent, sizeof(persistent));
<prefix>_bind_arena(<prefix>_arena_const_dtcm, const_dtcm, sizeof(const_dtcm));

<prefix>_model_init(&ctx);   // wires ctx, memsets persistents, hydrates staged consts, runs op-init
<prefix>_model_run(&ctx);

The cold-storage source blob is staged into the writable runtime arena by <prefix>_hydrate_constants. In external-arena mode the caller must bind every region — staged destination arenas as writable scratch buffers (above), and cold constant arenas to the bytes loaded from the emitted <prefix>_arena_const_<mem>__blob.bin sidecar — before calling <prefix>_model_init.

Recipe 4 — DMA-driven hydration

<prefix>_hydrate_constants is declared __attribute__((weak)). Replace it with a strong symbol that drives DMA / decompression / pre-staged HBLRAM and you're done — the rest of the contract (invocation from model_init, blob descriptor table, idempotent default body) is unchanged:

// In your application translation unit:
int32_t my_module_hydrate_constants(my_module_model_context_t *ctx) {
    for (size_t i = 0; i < my_module_num_constant_blobs; ++i) {
        const my_module_constant_blob_t *b = &my_module_constant_blobs[i];
        // Each blob targets a single destination arena starting at
        // offset 0 (per-arena source contiguity is guaranteed by the
        // planner). ctx->arena_buffers is populated from the bind
        // table by context_init.
        void *dst = (void *)ctx->arena_buffers[b->arena];
        // Submit DMA from b->source -> dst, length b->size, then wait.
        dma_copy_blocking(dst, b->source, b->size);
    }
    my_module_mark_hydrated();  // arm the model_run guard
    return 0;
}

The my_module_mark_hydrated() call is the contract override authors must respect: my_module_model_run returns 200 until the latch is set. The default weak helper sets it for you; strong-symbol replacements must set it explicitly.

If you'd rather call hydration on your own schedule (warm resume, model swap, lazy first-use), invoke <prefix>_hydrate_constants(&ctx) directly between <prefix>_context_init and <prefix>_model_init — the default body is idempotent, so model_init's call becomes a no-op. <prefix>_clear_hydrated() can be used to force re-hydration before the next run (model swap) and <prefix>_is_hydrated() lets the caller poll the latch.

Legacy memory.auto_hydrate_constants flag

The flag is retained for backwards compatibility but is now a runtime no-op: model_init always hydrates staged-constant arenas before op-init. This eliminates a class of races where a kernel's _init hook reads constants before manual hydration runs.

Recipe 5 — Tighter alignment for DMA-backed paths

Some DMA engines and cache-line-coherent transfers need stronger alignment than the platform's MVE/Helium floor (typically 16 B). Bump the per-arena alignment via MemoryConstraint:

memory:
  constraints:
    - name: PSRAM
      arena_alignment: 64        # cache line for the staged blob
    - name: DTCM
      arena_alignment: 32

arena_alignment is applied to both the arena base symbol (alignas(64) on the buffer) and every tensor slot inside it. Defaults to max(16, platform.min_alignment) when unset.

Inspecting placements after planning

The --verbose 2 log prints a residency summary per build. For tooling, opt in to the JSON report:

memory:
  dump_residency_json: true

This writes <work_path>/<prefix>_residency.json alongside the emitted module:

{
  "schema_version": 3,
  "module_prefix": "aot",
  "plan_hash": "<16 hex chars>",
  "tensor_layout_hash": "<16 hex chars>",
  "arenas": {
    "scratch":    [{"region_id": 0, "memory": "dtcm", "used": 256, "total_size": 256, "alignment": 16}],
    "persistent": [],
    "constant":   [
      {"region_id": 1, "memory": "dtcm", "source_memory": "mram", "kind": "staged",
       "used": 8704, "total_size": 8704, "alignment": 16, "tensor_count": 2}
    ]
  },
  "tensors": [
    {"id": "1", "role": "constant", "memory": "dtcm",
     "source_memory": "mram", "offset": 0, "size": 8192}
  ]
}

Schema is stable (schema_version: 3) — wire it into CI to gate unintended layout changes the same way you gate the linker map. The two top-level hashes cover disjoint layers:

plan_hash — arena envelope only (per-region role / memory / source_memory / size / alignment / is_staged). Mirrors the generated <PREFIX>_PLAN_HASH C macro and detects arena-ABI drift across separately-compiled binaries.
tensor_layout_hash — per-tensor placement (tensor_id / role / memory / offset / size). Catches drift the envelope hash misses (two tensors swapping offsets inside the same arena, a tensor migrating between arenas of identical shape, etc.).

Per-arena region_id correlates each JSON entry with the C <prefix>_arena_* enumerator.

Multi-arena layout — guarantees and limits

Per-arena guarantees the planner enforces:

One source memory per destination arena. All constants staged into the same memory must come from the same source memory (otherwise the planner errors at validation time). This makes the source blob a single contiguous _source[] symbol and the hydration a single bulk transfer.
Source/destination relative offsets match. The hydrate helper can therefore use a single memcpy(dst, src, size) per arena rather than per-tensor copies.
Padding charged symmetrically. Per-tensor alignment padding is reflected in both the source blob and the destination arena so the bulk transfer never undershoots or overshoots.

Limits that are not (yet) supported:

Multiple destination arenas backed by the same memory (e.g. two DTCM constant arenas with different source memories). The destination memory keys the arena, full stop.
Cross-role aliasing (e.g. reusing a scratch arena to hold staged constants between inferences). Each role owns its arena.
Lazy / on-demand hydration. Hydration is one shot at context_init (or whenever you call the helper); paged or on-first-use weight residency is out of scope.

Linker-map verification

When allocate_arenas=true, the source blob is a const symbol landing in the cold memory's read-only section. To assert it didn't drift to a writable section in CI, grep the .map:

arm-none-eabi-nm --print-size --size-sort --radix=d main.elf | \
    grep '_arena_const_.*__source'

For a tighter assertion, see tests/e2e/test_e2e.py::test_linker_map_staged_constants, which walks back to the most recent section header and verifies it matches the per-source-memory expectation table (e.g. MRAM → .ddr_rodata, SRAM → .sram_data, DTCM → .dtcm_data).