MulticoreWare

Case Studies

Optimising CNN Model on Low Power Vision DSP

March 27, 2024

The Client

The customer, an IP company, specializes in vision-based DSPs utilized for Imaging, Computer Vision, and AI applications.

Challenge

The project aimed to execute the end-to-end Inception-V3 CNN image classification ML model inference on the Customer’s Vision DSP.

Solution

The project utilized a range of tools and technologies, including C/C++, Quantization, CNN inference, DSP intrinsic, DMA, and tiling methodologies.

We had successfully identified the ImageNet dataset-based Inception-V3 floating point model, achieving Top-1 and Top-5 accuracy rates of 74% and 91.62% respectively. We then quantized the float model to the INT8 data type using McW’s custom quantization algorithm. Subsequently, an x86-based reference Inception-V3 pipeline was implemented for the INT8 data type.

Top-5 / Top-1 Classification Accuracy for Float vs. 8-Bit Quantized Graph

MulticoreWare hand-optimized various layers/operations in the Inception-V3 model for the Vision DSP, creating an end-to-end intrinsic-based pipeline while matching the accuracy with an x86-based INT8 pipeline. Considering the numerous layers in Inception-V3 and the DSP’s limited on-chip data memory, we had carefully designed and implemented DMA and data tiling algorithms to manage data transfer from external to on-chip memory efficiently.

Custom Quantization Logic:

MulticoreWare’s solution featured custom quantization logic with minimal loss in Top-1 and Top-5 classification accuracy for the quantized model. We hand-optimized approximately 94 layers of the Inception-V3 model using DSP intrinsic techniques, closely aligning with theoretical performance estimates. Additionally, our team implemented data tiling of input/output/weights and constructed an end-to-end Inception-V3 optimized pipeline, effectively hiding DMA data transfer latency.

CNN Model: Inception-V3 (Pre-Trained with Imagenet Dataset)

Convolutional Neural Network Architecture Details
Number of Convolution layers
94
Number of Concatenation layers
11
Number of Pooling layers
14

Business Impact

MulticoreWare’s efforts resulted in the customer achieving a processing speed of 30 FPS for input images sized at 299x299x3 while maintaining Top-1 and Top-5 accuracy levels similar to the float accuracy. This served as an excellent demonstration for the customer to showcase to their clients.

Memory Modeling - DDR latency [clock cycles] FPS
100
30.42
0
31.09
Performance Achieved Based On Memory Modeling Type (With Tiling And DMA)

Conclusion

This case study highlights MulticoreWare’s expertise in Quantization and DSPs. For a more comprehensive understanding of our solutions and services, please contact us at info@multicorewareinc.com

Share Via

Explore More

Jun 22 2026

A Monocular Video AI Pipeline for Clinical Gait Analysis

Client
A digital health company developing AI-powered gait analysis for early detection of mobility, neurological, and age-related health conditions.

Read more
Jun 17 2026

Enabling ARM Architecture Compatibility for Distributed Remote GPU Platforms

Customer
The customer is a technology company that develops a distributed GPU virtualization platform, allowing high-performance GPUs to be pooled, shared, and accessed remotely over standard network infrastructure.

Read more
May 8 2026

Optimizing Android Application Performance for Remote GPU Rendering Platforms

Customer
The customer is a technology company specializing in GPU virtualization middleware that enables discrete processing units to be aggregated into shared resource pools and accessed remotely across conventional network infrastructure.

Read more

GET IN TOUCH

    Please note: Personal emails like Gmail, Hotmail, etc. are not accepted
    (Max 2000 characters)