Kernel Benchmarks
Apollo510 EVB · Cortex-M55 + Helium MVE
heliaRT is up to 706× faster than upstream LiteRT for Micro
Across 36 operators benchmarked on real hardware, heliaRT matches or beats upstream LiteRT for Micro on every single operator — with dramatic gains on activations, reductions, and data-movement ops that LiteRT for Micro leaves unoptimized.
Speedup at a Glance
50–706× — LiteRT for Micro falls back to scalar C reference · 1.3–13× — heliaRT MVE-optimized paths · ~1× — Both use CMSIS-NN / Helium MVE
Detailed Results
| # | Operator | heliaRT Cycles | LiteRT Cycles | Speedup |
|---|---|---|---|---|
| 1 | CONV_2D |
1,621,810 | 1,642,809 | 1.01× |
| 2 | DEPTHWISE_CONV_2D |
613,204 | 636,011 | 1.04× |
| 3 | FULLY_CONNECTED |
20,844 | 27,950 | 1.34× |
| 4 | TRANSPOSE_CONV |
358,543 | 359,397 | 1.00× |
| 5 | AVERAGE_POOL_2D |
98,581 | 98,590 | 1.00× |
| 6 | SOFTMAX |
9,379 | 9,385 | 1.00× |
| 7 | ADD |
218,395 | 218,369 | 1.00× |
| 8 | MUL |
95,203 | 132,152 | 1.39× |
| 9 | LOGISTIC |
1,015 | 53,208 | 52.4× |
| 10 | PAD |
6,357 | 6,354 | 1.00× |
| 11 | RELU |
98,760 | 985,349 | 10.0× |
| 12 | HARD_SWISH |
65,938 | 870,531 | 13.2× |
| 13 | SUB |
218,337 | 1,921,818 | 8.8× |
| 14 | CONCATENATION |
38,745 | 93,984 | 2.4× |
| 15 | SPLIT |
251,288 | 801,614 | 3.2× |
| 16 | STRIDED_SLICE |
2,250 | 13,900 | 6.2× |
| 17 | MEAN |
22,408 | 2,230,860 | 99.6× |
| 18 | REDUCE_MAX |
3,808 | 2,688,873 | 706× |
| 19 | BATCH_MATMUL |
214,753 | 290,100 | 1.35× |
| 20 | MAX_POOL_2D |
38,405 | 38,405 | 1.00× |
| 21 | PADV2 |
6,377 | 6,450 | 1.01× |
| 22 | TRANSPOSE |
21,820 | 21,701 | ~1.00× |
| 23 | MAXIMUM |
7,811 | 9,861 | 1.26× |
| 24 | MINIMUM |
7,809 | 9,860 | 1.26× |
| 25 | RELU6 |
8,544 | 197,167 | 23.1× |
| 26 | TANH |
82,301 | 7,945,776 | 96.5× |
| 27 | LEAKY_RELU |
143,929 | 968,936 | 6.7× |
| 28 | EQUAL |
420,227 | 1,547,914 | 3.7× |
| 29 | GREATER |
512,563 | 1,498,903 | 2.9× |
| 30 | LESS |
518,818 | 1,512,053 | 2.9× |
| 31 | RESHAPE |
5,425 | 9,596 | 1.77× |
| 32 | SPLIT_V |
233,019 | 760,585 | 3.3× |
| 33 | PACK |
10,697 | 98,739 | 9.2× |
| 34 | NOT_EQUAL |
115,516 | 115,544 | 1.00× |
| 35 | FILL |
2,422 | 33,143 | 13.7× |
| 36 | ZEROS_LIKE |
8,581 | 16,840 | 1.96× |
Why It Matters
No silent fallbacks
Upstream LiteRT for Micro only optimizes ~14 operators with CMSIS-NN. The rest — activations
like RELU and LOGISTIC, reductions like MEAN and REDUCE_MAX, data-movement
ops like SPLIT and CONCATENATION — fall back to scalar C reference code that
ignores Helium MVE entirely. These "silent fallbacks" can dominate inference time
in real models, even though they look cheap on paper.
heliaRT closes this gap with dedicated MVE kernels for 36 operators — every CMSIS-NN-optimized op plus 22 more. The result: no operator is left behind, and your model runs at full hardware capability from end to end.
What this means for your product
Lower latency
Operators like MEAN (99.6×) and REDUCE_MAX (706×) often dominate
post-convolution pooling and classification stages. heliaRT eliminates
these hidden bottlenecks.
Longer battery life
Fewer cycles per inference means less active time and more time in sleep. Every cycle saved is energy saved.
Drop-in compatible
heliaRT uses the same LiteRT API and .tflite models. Switch backends
without changing your application code.
No regressions
Across all operators, heliaRT matches or beats LiteRT for Micro. Zero trade-offs, zero surprises.
Test Environment
| Parameter | Value |
|---|---|
| Board | Apollo510 EVB (Cortex-M55 + Helium MVE) |
| Toolchain | arm-none-eabi-gcc v15.2.1 |
| heliaRT | v1.16.0 |
| LiteRT for Micro | upstream LiteRT for Micro (2.3a snapshot) |
| Iterations | 100 (10 warmup) |
| Quantization | int8 (all models) |
Methodology
Each operator is exercised by a single-operator int8 TFLite model (input
shape [1,32,32,16] for spatial ops, appropriate shapes for non-spatial ops).
All cycle counts are median values over 100 iterations after 10 warmup
iterations. The same Apollo510 EVB and GCC toolchain are used for all runs.
Speedup is calculated as litert_cycles / helia_cycles.
Toolchain Comparison (ATfE vs GCC)
For the first published benchmark — ATfE 22.1 vs arm-none-eabi-gcc 14.2 across the MLPerf Tiny v1.1 suite on Apollo510 — see Toolchains → Why ATfE. That section includes the full methodology, a per-model results table, and a Chart.js plot of latency, energy, and efficiency improvements.