MulticoreWare

Case Studies

Optimising CNN Model on Low Power Vision DSP

March 27, 2024

The Client

The customer, an IP company, specializes in vision-based DSPs utilized for Imaging, Computer Vision, and AI applications.

Challenge

The project aimed to execute the end-to-end Inception-V3 CNN image classification ML model inference on the Customer’s Vision DSP.

Solution

The project utilized a range of tools and technologies, including C/C++, Quantization, CNN inference, DSP intrinsic, DMA, and tiling methodologies.

We had successfully identified the ImageNet dataset-based Inception-V3 floating point model, achieving Top-1 and Top-5 accuracy rates of 74% and 91.62% respectively. We then quantized the float model to the INT8 data type using McW’s custom quantization algorithm. Subsequently, an x86-based reference Inception-V3 pipeline was implemented for the INT8 data type.

Top-5 / Top-1 Classification Accuracy for Float vs. 8-Bit Quantized Graph

MulticoreWare hand-optimized various layers/operations in the Inception-V3 model for the Vision DSP, creating an end-to-end intrinsic-based pipeline while matching the accuracy with an x86-based INT8 pipeline. Considering the numerous layers in Inception-V3 and the DSP’s limited on-chip data memory, we had carefully designed and implemented DMA and data tiling algorithms to manage data transfer from external to on-chip memory efficiently.

Custom Quantization Logic:

MulticoreWare’s solution featured custom quantization logic with minimal loss in Top-1 and Top-5 classification accuracy for the quantized model. We hand-optimized approximately 94 layers of the Inception-V3 model using DSP intrinsic techniques, closely aligning with theoretical performance estimates. Additionally, our team implemented data tiling of input/output/weights and constructed an end-to-end Inception-V3 optimized pipeline, effectively hiding DMA data transfer latency.

CNN Model: Inception-V3 (Pre-Trained with Imagenet Dataset)

Convolutional Neural Network Architecture Details
Number of Convolution layers
94
Number of Concatenation layers
11
Number of Pooling layers
14

Business Impact

MulticoreWare’s efforts resulted in the customer achieving a processing speed of 30 FPS for input images sized at 299x299x3 while maintaining Top-1 and Top-5 accuracy levels similar to the float accuracy. This served as an excellent demonstration for the customer to showcase to their clients.

Memory Modeling - DDR latency [clock cycles] FPS
100
30.42
0
31.09
Performance Achieved Based On Memory Modeling Type (With Tiling And DMA)

Conclusion

This case study highlights MulticoreWare’s expertise in Quantization and DSPs. For a more comprehensive understanding of our solutions and services, please contact us at info@multicorewareinc.com

Share Via

Explore More

May 8 2026

Optimizing Android Application Performance for Remote GPU Rendering Platforms

Customer
The customer is a technology company specializing in GPU virtualization middleware that enables discrete processing units to be aggregated into shared resource pools and accessed remotely across conventional network infrastructure.

Read more
Apr 9 2026

Agentic AI for RAN Observability, Explainability and Orchestration

Customer A global telecommunications and network infrastructure company that provides advanced software, hardware, and services for building, managing, and optimizing large-scale telecom and enterprise networks. Its solutions leverage AI, automation, and end-to-end visibility to help operators enhance performance, ensure reliability, and efficiently manage complex, multi-domain network environments. Problem Statement Radio Access Networks (RAN) are the  … Read more

Read more
Apr 3 2026

Embedded Platform Optimization for Advanced Drone Systems: Lidar and Motor Control Integration

Client A leading drone and robotics company developing high-performance UAV platforms for autonomous operations, industrial inspection, and surveying in complex or restricted environments. Problem Statement Simultaneously executing high-throughput LiDAR processing and latency-critical motor control on resource-constrained embedded systems creates a fundamental bottleneck in real-time performance and scalable UAV autonomy. Challenge 1: High-Speed Sensor Integration Integrating  … Read more

Read more

GET IN TOUCH

    Please note: Personal emails like Gmail, Hotmail, etc. are not accepted
    (Max 2000 characters)