Cortex-M Accelerators

heliaCORE maps CMSIS-NN-compatible operators onto the accelerator features available on Ambiq Cortex-M targets.

In this guide, Cortex-M accelerators means:

  • scalar C baseline kernels,

  • the Cortex-M DSP extension,

  • MVE/Helium vector instructions,

  • and FPU/FP16 support where a kernel and device can use it.

MVE is the primary high-throughput path on Cortex-M55-class Apollo devices. DSP paths remain important for Apollo-class devices that do not include MVE.

For a concise family-by-family summary, see Operator & Kernel Coverage. For exact function prototypes and per-kernel behavior, see API.

Why coverage matters

Arm CMSIS-NN provides the trusted foundation for efficient neural-network kernels on Cortex-M. Ambiq HELIA profiling showed that real model graphs spend time in more than the largest convolution or matrix-multiply layer.

The practical gaps are often around:

  • padding and layout changes,

  • LeakyReLU and other activations,

  • reductions and comparisons,

  • quantization and classifier-tail operators,

  • and wrapper/buffer helpers needed by firmware integration.

heliaCORE adds 200+ accelerated operators and variants so these graph edges do not fall back to generic paths on Ambiq firmware targets.

What the acceleration paths are

baseline Scalar C Portable kernels for compatibility and for smaller Cortex-M0-class devices without specialized math extensions.
fixed-point DSP extension Packed scalar instructions selected with `ARM_MATH_DSP` for efficient int8, int16, and int32 quantized kernels.
vector MVE / Helium Arm M-Profile Vector Extension, selected with MVE feature macros such as `ARM_MATH_MVEI`, for high-throughput Cortex-M55 kernels.
floating point FPU / FP16 Hardware and compiler support for fp16/fp32 variants. It complements DSP or MVE rather than replacing them.

Why MVE is a big step

MVE 128-bit vector engine

MVE, also known as Helium, lets Cortex-M55 process many narrow values at once: 16 int8 lanes, 8 int16 or fp16 lanes, or 4 int32 or fp32 lanes per vector.

DSP Packed scalar math

The DSP extension accelerates fixed-point math with packed 8-bit and 16-bit operations, dual 16-bit MAC instructions, saturation, and fast accumulates.

Flow Whole-graph speed

Good kernels keep lanes busy across convolution, fully connected, activation, pooling, quantization, and layout operators, not only the obvious MAC-heavy layers.

MVE is not just a different intrinsic name. It changes the unit of work:

  • DSP improves scalar fixed-point loops with packed instructions.

  • MVE runs 128-bit vector operations with predication, vector MAC, dot-product patterns, saturation, and lane-wise data movement.

  • Expected gain is kernel and shape dependent, but favorable MAC-heavy int8 paths can make an up-to-8x speedup over DSP-style implementations plausible when data layout and memory traffic cooperate.

That is why heliaCORE treats MVE coverage as a primary design axis for Cortex-M55-class Ambiq devices, while still preserving DSP paths for Apollo devices that rely on them.

Coverage pressure from real model graphs

These internal counts explain why heliaCORE invests in both MAC-heavy kernels and the smaller operators that connect them.

Ambiq field-like suite 53 operator types 247 unique operator configurations and 963 total operator instances across internal HELIA-oriented model graphs.
MLPerf Tiny baseline 7 operator types 34 unique operator configurations and 80 total instances in a familiar public benchmark suite used here only for scale.
Coverage target 200+ optimized operators Accelerated variants cover both the high-cost math kernels and the glue operators that can dominate real deployment latency.

How to read the counts:

  • Operator types are distinct operator families.

  • Unique operators include dtype, shape pattern, or variant differences.

  • Operator instances are total appearances across model graphs.

MLPerf Tiny is included only as a familiar public scale reference.

Note

Scope This coverage view is based on Ambiq’s internal model suite. It is not a performance benchmark and is not a statement about the broader Arm CMSIS-NN ecosystem. For vendor-neutral Cortex-M kernel work, Arm CMSIS-NN remains the upstream ecosystem reference.

For measured cycle counts comparing MVE, DSP, and Scalar C on Apollo hardware, see Kernel Benchmarks.