Optimising CNN Model on Low Power Vision DSP

March 27, 2024

The Client

The customer, an IP company, specializes in vision-based DSPs utilized for Imaging, Computer Vision, and AI applications.

Challenge

The project aimed to execute the end-to-end Inception-V3 CNN image classification ML model inference on the Customer’s Vision DSP.

Solution

The project utilized a range of tools and technologies, including C/C++, Quantization, CNN inference, DSP intrinsic, DMA, and tiling methodologies.

We had successfully identified the ImageNet dataset-based Inception-V3 floating point model, achieving Top-1 and Top-5 accuracy rates of 74% and 91.62% respectively. We then quantized the float model to the INT8 data type using McW’s custom quantization algorithm. Subsequently, an x86-based reference Inception-V3 pipeline was implemented for the INT8 data type.

Top-5 / Top-1 Classification Accuracy for Float vs. 8-Bit Quantized Graph

MulticoreWare hand-optimized various layers/operations in the Inception-V3 model for the Vision DSP, creating an end-to-end intrinsic-based pipeline while matching the accuracy with an x86-based INT8 pipeline. Considering the numerous layers in Inception-V3 and the DSP’s limited on-chip data memory, we had carefully designed and implemented DMA and data tiling algorithms to manage data transfer from external to on-chip memory efficiently.

Custom Quantization Logic:

MulticoreWare’s solution featured custom quantization logic with minimal loss in Top-1 and Top-5 classification accuracy for the quantized model. We hand-optimized approximately 94 layers of the Inception-V3 model using DSP intrinsic techniques, closely aligning with theoretical performance estimates. Additionally, our team implemented data tiling of input/output/weights and constructed an end-to-end Inception-V3 optimized pipeline, effectively hiding DMA data transfer latency.

CNN Model: Inception-V3 (Pre-Trained with Imagenet Dataset)

Convolutional Neural Network Architecture Details
Number of Convolution layers	94
Number of Concatenation layers	11
Number of Pooling layers	14

Business Impact

MulticoreWare’s efforts resulted in the customer achieving a processing speed of 30 FPS for input images sized at 299x299x3 while maintaining Top-1 and Top-5 accuracy levels similar to the float accuracy. This served as an excellent demonstration for the customer to showcase to their clients.

Memory Modeling - DDR latency [clock cycles]	FPS
100	30.42
0	31.09

Performance Achieved Based On Memory Modeling Type (With Tiling And DMA)

Conclusion

This case study highlights MulticoreWare’s expertise in Quantization and DSPs. For a more comprehensive understanding of our solutions and services, please contact us at info@multicorewareinc.com

GET IN TOUCH

Please note: Personal emails like Gmail, Hotmail, etc. are not accepted

(Max 2000 characters)

About us

Leadership Team

News and Events

Our Partners

Our CSR

Life at MCW

R & D

Compute

Media & Entertainment

Mobility & Transportation

Smart City

Smart Health

Industry 4.0

Blog

Case Studies

Webinars

Demo Videos

Whitepapers

Research Publications

About us

Leadership Team

News and Events

Our Partners

Our CSR

Life at MCW

R & D

Compute

Media & Entertainment

Mobility & Transportation

Smart City

Smart Health

Industry 4.0

Blog

Case Studies

Webinars

Demo Videos

Whitepapers

Research Publications

About us

Leadership Team

News and Events

Our Partners

Our CSR

Life at MCW

R & D

Compute

Media & Entertainment

Mobility & Transportation

Smart City

Smart Health

Industry 4.0

Blog

Case Studies

Webinars

Demo Videos

Whitepapers

Research Publications

中文

About us

Leadership Team

News and Events

Our Partners

Our CSR

Life at MCW

R & D

Compute

Media & Entertainment

Mobility & Transportation

Smart City

Smart Health

Industry 4.0

Blog

Case Studies

Webinars

Demo Videos

Whitepapers

Research Publications

中文

中文

About us