Kernel Registry

Overview

Our kernel registry is the catalog and chooser for all operator implementations used by the AOT compiler. It’s built around highly-optimized kernels from our fork of CMSIS-NN (ns-cmsis-nn), plus a small set of utility and fallback implementations. By centralizing metadata about each kernel (supported data types, layouts, shapes, stride/dilation limits, required CPU features, memory needs, etc.), the registry allows AOT to pick the best implementation for each layer on a given target.

What the registry gives you

Best-in-class performance via ns-cmsis-nn fast paths (e.g., depthwise 3×3, 1×1 pointwise, optimized GEMM).
Multiple choices per op. Many operators (Conv2D, DepthwiseConv2D, FC, Pooling, Softmax, etc.) have several candidate kernels with different trade-offs.
Compile-time selection (AOT). The AOT pipeline evaluates all eligible kernels for a layer and chooses one based on ranking—no runtime dispatch overhead.
Predictable fallbacks. If no optimized kernel is suitable, AOT falls back to a portable/reference implementation so builds always succeed.
Portability. Kernels are annotated with feature flags (e.g., M55/M85, DSP, M-Profile Vector Extension/Helium, FP support), so the chooser only considers implementations your platform actually supports.

How selection works (in short)

Eligibility filter. For the layer’s config (tensor ranks/shapes, padding/stride/dilation, data type, quantization scheme, layout, activation fuse) and the platform profile (CPU, available instructions, memory placement constraints), the registry filters kernels that can legally run the op.
Ranking. Remaining kernels are scored using a simple, transparent ordering: - Prefer tighter matches (exact data type/layout, fast-path patterns like 1×1 or DW 3×3). - Prefer CPU-feature matches (e.g., MVE/Helium, DSP) when available. - Prefer better memory locality (DTCM/SRAM usage that fits the layer’s buffers/arena). - Apply tie-breakers (e.g., lower estimated cycles/byte or smaller code size if requested).
Emission. AOT emits the chosen kernel’s call site and any required pre/post transforms (im2col, reorder, quant/dequant) and links the corresponding object files from ns-cmsis-nn.

Note: The registry is data-driven. Adding a new kernel is as simple as registering it with capabilities and a rank. The chooser will automatically consider it for future builds.

When multiple kernels exist for the same op

Example (Conv2D): You might see a general im2col+GEMM path, a 1×1 optimized path, and a DW 3×3 path (for depthwise). AOT selects the highest-ranked kernel that is eligible for your layer’s shape/params and supported on the target CPU.
Quantization: Kernels advertise their quant scheme (e.g., per-channel int8, int16 accum). The chooser rejects kernels if the layer’s quantization doesn’t match.

Memory awareness

The registry records temporary and persistent buffer requirements. During selection, AOT checks your arena placements (DTCM/SRAM/DDR) and sizes; kernels that exceed the available region are dropped. This helps avoid surprises like hidden spills to slower memory.

Practical impact

You write standard models. The registry + AOT do the heavy lifting to map them to the fastest available primitives on your Cortex-M target.
Predictable builds. If an optimized kernel can’t be used (shape/limits/feature mismatch), you still get a working binary via fallback.
Scalable optimization. As ns-cmsis-nn evolves, new kernels (or better variants) can be registered and will be automatically considered by the chooser.