Toolchains
heliaRT supports three toolchains for Cortex-M targets. All three are tested in CI across every release.
Comparison
| Toolchain | ID | License | Typical Perf vs GCC | Best For |
|---|---|---|---|---|
| GCC (arm-none-eabi-gcc) | gcc |
Open source | Baseline | Default, broadest availability |
| Arm Compiler 6 (armclang) | armclang |
Commercial | ~5–15 % faster | Keil MDK shops |
| ATfE (Arm Toolchain for Embedded) | atfe |
Open source | up to 25 % more efficient1 | Recommended |
Recommended: ATfE
ATfE is LLVM-based, fully open-source, and actively maintained by Arm. On Cortex-M55 + Helium workloads it delivers fewer cycles and more inferences per Joule than GCC — a compounding win for battery-powered devices.
Why ATfE
ATfE (Arm Toolchain for Embedded) is Arm's LLVM-based, open-source toolchain for bare-metal embedded targets. On Cortex-M55 + Helium workloads it consistently outperforms arm-none-eabi-gcc for three reasons:
- MVE auto-vectorization. LLVM's loop vectorizer targets the M-Profile Vector Extension (MVE / Helium) more aggressively than GCC at parity optimization levels, lighting up predicated vector paths on inner loops that GCC still emits as scalar.
- Picolibc over newlib-nano. ATfE ships Picolibc, a modernized C library that is smaller, faster, and tuned for embedded LLVM workflows.
- compiler-rt builtins. Arm-tuned soft-float and integer helpers replace
libgcc, typically with better register usage on M-profile cores.
Measured performance
We profiled the MLPerf Tiny v1.1 reference suite on the Apollo510 EVB (Cortex-M55 + Helium @ 192 MHz) using heliaRT v1.13.1 with heliaPROFILER for latency and a Joulescope for energy.
Configuration
| Field | Value |
|---|---|
| heliaRT version | v1.13.1 |
| Hardware | Apollo510 EVB — Cortex-M55, Helium MVE @ 192 MHz |
| Compilers | ATfE 22.1 vs arm-none-eabi-gcc 14.2 |
| Models | MLPerf Tiny v1.1 — Keyword Spotting (DS-CNN), Visual Wake Words (MobileNetV1), Anomaly Detection (Deep Autoencoder), Image Classification (ResNet) |
| Build | release, -O3, MVE enabled |
| Iterations | 10 per configuration, mean reported |
| Latency | Derived from PMU cycle counts ÷ 192 MHz |
| Energy | Joulescope capture, normalized per inference (latency × average power) |
Per-model results
| Model | Latency reduction | Energy reduction | Efficiency (inf / Joule) |
|---|---|---|---|
| Keyword Spotting (DS-CNN) | −9.7 % | −5.9 % | +6.3 % |
| Visual Wake Words (MobileNetV1) | −12.5 % | −15.9 % | +19.0 % |
| Anomaly Detection (Deep Autoencoder) | −4.4 % | −12.1 % | +13.7 % |
| Image Classification (ResNet) | −10.5 % | −19.6 % | +24.4 % |
All values are ATfE relative to GCC; negative is better for latency and energy, positive is better for efficiency.
Across the four reference models, ATfE delivered 4 %–13 % fewer cycles, 6 %–20 % less energy per inference, and 6 %–25 % more inferences per Joule than the same code built with GCC. The headline "up to 25 %" refers to the inferences-per-Joule improvement on Image Classification (the most demanding model in the suite). Critically, no model regressed on any metric — ATfE is strictly better than GCC across this benchmark.
When to expect the biggest gains
Speedup tracks how vectorizable a model is on MVE. Heavily quantized int8 convolutional and fully-connected layers benefit most (ResNet, MobileNetV1). Models dominated by very small kernels or operators that fall to HELIA hand-tuned paths (Anomaly Detection) see smaller compute-side wins, but still benefit from ATfE's tighter code generation — reflected as the larger energy gain than latency gain on AD.
Trade-offs
- Newer toolchain. ATfE is younger than GCC; expect occasional rough edges around uncommon link-script directives or proprietary SDK glue code.
- Picolibc instead of newlib. Most projects work without changes, but if you rely on newlib-specific behavior (e.g. certain
_sbrkpatterns) you may need a small shim. - No Arm Compiler 5 compatibility shims. ATfE follows the modern LLVM toolchain conventions; legacy
armcc-era assembly may need updates.
For a full build walkthrough on Apollo510 + Zephyr, see Zephyr + heliaRT → Build.
Installation
Or download from Arm Developer.
Install Keil MDK or Arm Development Studio. armclang is included.
Requires a commercial license.
Download from the Arm Toolchain for Embedded releases:
Using with heliaRT
Each release ships archives for all three toolchains:
helia-rt-<tag>/cortex-m55/gcc/release/
helia-rt-<tag>/cortex-m55/armclang/release/
helia-rt-<tag>/cortex-m55/atfe/release/
Link the .a that matches your project's toolchain.
Zephyr toolchain selection is handled by ZEPHYR_TOOLCHAIN_VARIANT and is independent of heliaRT.
The prebuilt module auto-selects the matching archive for gcc or atfe.
For full build commands and per-toolchain flags, see Zephyr + heliaRT → Build.
Quick reference:
# GCC (default — no extra flags)
west build -b apollo510_evb -s app/helia_rt_app
# ATfE
west build -b apollo510_evb -s app/helia_rt_app -- \
-DZEPHYR_TOOLCHAIN_VARIANT=host \
-DTOOLCHAIN_VARIANT_COMPILER=llvm \
-DLLVM_TOOLCHAIN_PATH=/path/to/ATfE \
-DCONFIG_LLVM_USE_LLD=y -DCONFIG_COMPILER_RT_RTLIB=y
CI Matrix
The release workflow builds 18 combinations (2 architectures × 3 toolchains × 3 build types):
| Arch | Toolchain | Build types |
|---|---|---|
cortex-m4+fp |
gcc, armclang, atfe | debug, release, release_with_logs |
cortex-m55 |
gcc, armclang, atfe | debug, release, release_with_logs |
Next Steps
- SPEED vs SIZE — choose the build variant
- Kernel Selection — choose the backend
-
Measured across the MLPerf Tiny v1.1 reference suite on the Apollo510 EVB (Cortex-M55 + Helium @ 192 MHz, 10 iterations) using heliaRT v1.13.1. Latency derived from PMU cycles; energy captured with a Joulescope. Compilers: ATfE 22.1 vs
arm-none-eabi-gcc14.2. Headline "up to 25 %" refers to the inferences-per-Joule improvement on Image Classification (ResNet, +24.4 %, rounded). Every model also ran with lower latency under ATfE (4 %–13 % fewer cycles) and lower energy per inference (6 %–20 %). ↩