Optimizing using Autodeploy - Case Studies
This document offers a quick guide to using neuralSPOT's Autodeploy tool to profile, deploy, and package a TFLite model for Apollo5 platforms.
Relevant Documents
- Autodeploy Usage - installation and usage of Autodeploy and other neuralSPOT tools
- From TF to EVB - deep dive into Autodeploy
- Optimization Basics - Very basic power optimization notes
Basics
Autodeploy is analyzes, profiles, packages, and deploys TFLite model files on Apollo devices.
Installing Autodeploy Dependencies
cd ns-mirror
pip install .
Getting the Hardware Ready
Autodeploy uses RPC-over-USB or RPC-over-UART to communicate with the device. Optionally, it can use a joulescope to automatically measure power - this requires GPIO connections between the EB/EVB and the Joulescope.
Development Platform | Transport | USB Connections |
---|---|---|
Apollo3P, Apollo4 Lite (include BLE variants) | UART | 1 (the Jlink connection) |
Apollo4P | USB or UART | 2 (one for Jlink, on for USB) |
Apollo510 (with J17 selecting Jlink, which is the default) | USB or UART | 2 (one for Jlink, one for USB) |
Apollo510 (with J17 selecting AP5) | USB or UART | 1 (the one labeled USB_AP%) |
The RPC transport mode is selected using the --transport
command line parameter.
Running Autodeploy
ns_autodeploy --model-name denoise --tflite-filename denoise.tflite
Autodeploy will then run for a while, generating numerous reports, source packages, and binaries.
Case Study - ECG Denoiser
We'll use Ambiq's ECG Denoising model to illustrate how we use Autodeploy as part of our development process. The denoise.tflite
file must be compatible with TFLite for Microcontrollers (TFLM), though Autodeploy will print out useful(ish) errors if this is not the case.
Fitting on the Device
Models require somewhere to store the model weights, and an arena in which to place activations and other tensors/state. We prefer placing these in the 'smallest place possible'. In the case of this model, both the weights (22kB) and arena (20kB) fit within the large DTCM (data tightly-coupled-memory) on the Apollo510. Autodeploy allows the placement of weights and arena in TCM (default), SRAM, MRAM (for weights), and PSRAM. Here, we'll leave the defaults as-is, but the MobilenetV3 scenario requires memory placement settings.
Results
Autodeploy will produce:
- A per-layer profile
- High resolution power and performance measurements
- An minimal example which can be linked into non-neuralSPOT codebases
- A minimal AmbiqSuite example which can be compiled from within AmbiqSuite.
Here, will focus on the first two.
Per-Layer Profile
Event | Tag | uSeconds | Est MACs | ARM_PMU_MVE_INST_RETIRED | ARM_PMU_MVE_INT_MAC_RETIRED | ARM_PMU_INST_RETIRED | ARM_PMU_BUS_CYCLES |
---|---|---|---|---|---|---|---|
0 | RESHAPE | 8 | 0 | 2 | 0 | 395 | 1124 |
1 | DEPTHWISE_CONV_2D | 300 | 1792 | 14350 | 2112 | 57151 | 74782 |
2 | DEPTHWISE_CONV_2D | 290 | 1792 | 14350 | 2112 | 57153 | 72199 |
3 | CONV_2D | 224 | 2048 | 12813 | 2560 | 35451 | 55559 |
4 | DEPTHWISE_CONV_2D | 351 | 14336 | 24782 | 4224 | 70854 | 87128 |
5 | MEAN | 1399 | 0 | 2 | 0 | 261755 | 348604 |
6 | CONV_2D | 6 | 32 | 49 | 5 | 609 | 1281 |
7 | CONV_2D | 5 | 32 | 85 | 10 | 730 | 1079 |
8 | MINIMUM | 13 | 0 | 10 | 0 | 1260 | 3466 |
9 | RELU | 7 | 0 | 2 | 0 | 700 | 1384 |
10 | MUL | 662 | 0 | 6 | 0 | 138755 | 164742 |
11 | CONV_2D | 367 | 32768 | 25613 | 5120 | 67338 | 91408 |
12 | DEPTHWISE_CONV_2D | 4339 | 28672 | 12 | 0 | 873508 | 1081163 |
13 | MEAN | 2765 | 0 | 2 | 0 | 520042 | 689073 |
14 | CONV_2D | 5 | 128 | 85 | 10 | 730 | 1078 |
15 | CONV_2D | 6 | 128 | 157 | 20 | 972 | 1348 |
16 | MINIMUM | 12 | 0 | 10 | 0 | 1756 | 2734 |
17 | RELU | 7 | 0 | 2 | 0 | 1124 | 1447 |
18 | MUL | 1272 | 0 | 6 | 0 | 268353 | 316995 |
19 | CONV_2D | 538 | 98304 | 38413 | 7680 | 99356 | 134023 |
20 | DEPTHWISE_CONV_2D | 6442 | 43008 | 12 | 0 | 1299717 | 1605489 |
21 | MEAN | 4136 | 0 | 2 | 0 | 778197 | 1030730 |
22 | CONV_2D | 6 | 288 | 169 | 27 | 899 | 1276 |
23 | CONV_2D | 8 | 288 | 229 | 30 | 1214 | 1619 |
24 | MINIMUM | 14 | 0 | 10 | 0 | 2252 | 3259 |
25 | RELU | 8 | 0 | 2 | 0 | 1508 | 1810 |
26 | MUL | 1893 | 0 | 6 | 0 | 398078 | 471465 |
27 | CONV_2D | 816 | 196608 | 71693 | 18432 | 152251 | 203093 |
28 | CONV_2D | 85 | 57344 | 10154 | 3646 | 16962 | 21124 |
29 | RESHAPE | 3 | 0 | 2 | 0 | 400 | 627 |
This characterization table is exported as a CSV for further analysis. Just 'eyeballing' it, we can see that MVE events roughly track estimated macs per layer, though in some cases the MVE operations are much lower than the MAC operations, suggesting good targets for TFLM optimization
Of special interest, however, are layers 12 and 20. Here, there are almost no MVE operations, and these are pretty 'chunky' layers. Deeper inspection of the model reveals that these layers use dilation
, as opposed to other depthwise convolution layers in this model.
The data scientist can now use this information to either slightly alter the neural architecture (using '0's between the filters, resulting in a larger model but possibly faster computation), looking into why TFLM isn't using MVE here, or re-architecting the model to not use dilation.
Power and Performance
If a joulescope is wired up, Autodeploy can use it to provide power and performance measurements by adding --joulescope
to the command line
ns_autodeploy --model-name denoise --tflite-filename denoise.tflite --joulescope
Case Study - Mobilenet V3 Small
Mobilenet is an off-the-shelf model that is often adapted to image classification tasks. It is quite large for an edge model, so we'll use it to illustrate how to run large models.
A priori, we only know the model size (2333232kB), which can be guessed from the size of the file, but which Autodeploy can exactly determine via analysis of the network.
The arena size is a different matter - its size is only known when the model's tensors are allocated by TFLM. This means we have to 'guess' at an arena size, and hope TFLM doesn't fail at instantiation time. This is less than ideal, but the experiments are fast, and it usually only takes a couple of iterations to find the right size.
If an arena size is not specified on the command line via --max-arena-size
, Autodeploy will select the largest SRAM size available to the target. For AP510 EVBs, there is also an option to put models and/or the arena in PSRAM (see the Very Large Models section of the Autodeploy document). This same document details how autodeploy attempts to find the optimal setting for deploy the model.
Note This arena is only unknown the first time autodeploy is run on a model. Autodeploy will record the actual size after the first successful run, and can use that for subsquent runs.
This is a large model with a large arena - autodeploy will attempt to find the best fit for both weights and activations by default.
ns_autodeploy --model-name mnv3sm --tflite-filename mobilenet_v3_sm_a075_quant.tflite
This command will result in model being placed in MRAM and the arena in TCM (for an apollo510, at least).
Per Layer Profiling
We'll skip this for MobilenetV3 because of its size. Autodeploy does export all 115 layers with MACs, PMU, and cycle time details which can be subsequently analyzed.
Detailed Optimization
For this model, we used the automatic Joulescope instrumentation to measure power and timing details, and we ran the same model in both high performance and low power modes, using GCC13 and Armclang compilers.
mobilenet_v3_sm_a075_quant.tflite | AP510 | RevB PCM1.0 | EVB | GCC13 | 8-Oct | 2333232 | 535 | MRAM | SRAM | 43399344 | 849.883 | 21708.563 | 25.543 | 2182.874 | 17120.087 | 7.843 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mobilenet_v3_sm_a075_quant.tflite | AP510 | RevB PCM1.0 | EVB | Armclang | 8-Oct | 2333232 | 535 | MRAM | SRAM | 43399344 | 800.151 | 20395 | 25.489 | 2045.89 | 16433 | 8.03 |
Here, we can see an interesting result: Armclang produces 5% faster code. We find this is true across multiple mode architectures.
Note Autodeploy doesn't make 'special efforts' to minimize the power - the numbers it measures are for model optimization, and what matters there is relative performance. Data scientists may want to know how much power different models use in an apples-to-apples situation, but they'll leave the deep power optimization to the platform experts. That being said, neuralSPOT does have many power optimization tools that are not leveraged by Autodeploy.