Kernel Benchmarks

Apollo510 EVB · Cortex-M55 · 96 MHz LP mode

Peak speedups at a glance

heliaCORE’s kernel mix delivers large gains on real hardware. These summary figures reflect the peak average-cycle speedups across the benchmarked kernels below.

11.98× MVE vs REF Up to 11.98× faster than the REF path
7.83× MVE vs DSP Up to 7.83× faster than DSP
3.26× DSP vs REF Up to 3.26× faster than the REF path

This page highlights how heliaCORE performs on Apollo510 (Cortex-M55) across the scalar reference, DSP, and MVE execution paths. Because all three paths run on the same SoC at the same clock speed, the results provide a clean apples-to-apples view of the performance gains available from each ISA optimization path.

Note

The REF path represents the portable C implementation of each kernel compiled for Cortex-M55. Compiler optimizations may still generate DSP or other architecture-specific instructions where profitable. REF therefore reflects the practical baseline performance achievable from portable source code on M55, rather than a strictly scalar instruction stream.

Convolution & matrix-multiply kernels

The largest MVE gains appear in MAC-dominated kernels where Helium’s 128-bit vector engine amortizes unpacking, memory access, and accumulation overhead across wide vector operations. DSP kernels generally improve throughput on s8/s16 workloads, though some s4 kernels remain dominated by packing and unpacking overhead.

The x-axis reports speedup relative to REF on a logarithmic scale; values greater than 1.0x indicate lower average cycle count than REF.

Kernel

Shape

REF (GCC)

DSP (GCC)

MVE (GCC)

DSP speedup

MVE speedup

arm_convolve_s8 (contains full im2col)

32x32x64 k3 oc64

90,776,688

59,320,711

7,577,201

1.53x

11.98x

arm_convolve_s4 (contains full im2col)

32x32x64 k3 oc64

99,947,570

130,619,816

18,468,361

0.77x

5.41x

arm_convolve_s16 (contains full im2col)

32x32x64 k3 oc64

96,570,332

82,166,983

31,488,224

1.18x

3.07x

arm_convolve_1x1_s8_fast (contains simplified im2col)

32x32x64 oc64

11,885,922

8,966,843

1,707,351

1.33x

6.96x

arm_depthwise_conv_s8_opt (contains input packing)

32x32x64 k3

17,841,554

7,509,340

1,741,530

2.38x

10.24x

arm_depthwise_conv_s4_opt (contains input packing)

32x32x64 k3

7,984,346

9,767,607

2,188,508

0.82x

3.65x

arm_depthwise_conv_fast_s16 (contains input packing)

32x32x64 k3

18,069,189

7,571,340

3,861,216

2.39x

4.68x

arm_nn_mat_mult_nt_t_s8

64x512 × 256x512

19,959,308

13,725,751

2,086,979

1.45x

9.56x

arm_nn_mat_mult_nt_t_s4

64x512 × 256x512

26,937,786

26,723,602

4,197,172

1.01x

6.42x

arm_nn_vec_mat_mult_t_s8

512 × 256

491,641

331,411

69,076

1.48x

7.12x

arm_nn_vec_mat_mult_t_s16

512 × 256

562,008

421,257

99,807

1.33x

5.63x

arm_nn_vec_mat_mult_t_s4

512 × 256

504,554

588,599

88,622

0.86x

5.69x

arm_avgpool_s8

32x32x64 k3

13,228,812

4,057,857

2,135,472

3.26x

6.19x

arm_avgpool_s16

32x32x64 k3

6,517,842

4,187,804

2,408,673

1.56x

2.71x

Elementwise & scalar kernels

Elementwise kernels are pure data-parallel and benefit from MVE’s ability to process 16 int8 or 8 int16 values per vector instruction.

The x-axis reports speedup relative to REF on a logarithmic scale; values greater than 1.0x indicate lower average cycle count than REF.

Kernel

Shape

REF (GCC)

DSP (GCC)

MVE (GCC)

DSP speedup

MVE speedup

arm_elementwise_add_s8

n4096

250,005

285,797

61,575

0.87x

4.06x

arm_elementwise_add_s16

n4096

252,004

252,078

53,370

1.00x

4.72x

arm_elementwise_mul_s8

n4096

98,358

105,093

22,647

0.94x

4.34x

arm_elementwise_mul_s16

n4096

77,914

79,962

28,755

0.97x

2.71x

arm_elementwise_sub_s8

n4096

254,030

285,757

61,632

0.89x

4.12x

arm_elementwise_mul_acc_s16

n4096

86,175

90,198

31,851

0.96x

2.71x

arm_elementwise_mul_s16_s8

n4096

86,098

84,088

28,778

1.02x

2.99x

arm_elementwise_mul_s16_batch_offset

n4096

82,032

82,025

28,780

1.00x

2.85x

arm_add_scalar_s8

n4096

131,165

179,461

42,129

0.73x

3.11x

arm_sub_scalar_s8

n4096

172,164

197,856

42,161

0.87x

4.08x

arm_mul_scalar_s8

n4096

90,169

94,308

20,087

0.96x

4.49x

arm_mul_scalar_s16

n4096

77,911

77,908

27,728

1.00x

2.81x

arm_comparison_s8

n4096

266,441

208,747

78,071

1.28x

3.41x

arm_comparison_s16

n4096

270,764

237,720

77,986

1.14x

3.47x

Test conditions

Parameter

Value

Board

Apollo510 EVB (Cortex-M55)

Clock

96 MHz (LP mode)

Timer

DWT->CYCCNT

ISA paths

Scalar C (REF), DSP SIMD, MVE / Helium

Toolchain

arm-none-eabi-gcc 14.3.0

Optimization

-O3, CMSIS_NN_USE_REQUANTIZE_INLINE_ASSEMBLY

Iterations

100 per kernel

Metric

Average CPU cycles