MulticoreWare

Case Studies

Enabling PyTorch 2.0 Models on Next-Gen AI Accelerator

April 9, 2025

Client

The customer is a next-generation computing company specializing in AI hardware. Their mission is to provide cost-effective, scalable computing systems optimized for AI workloads. Their hardware ecosystem natively supports PyTorch and ONNX, enabling researchers and developers to deploy AI models with minimal friction.

Challenge

With the release of PyTorch 2.0, the customer sought to enable seamless execution of AI models on their custom AI accelerator while maintaining compatibility with standard PyTorch workflows. Certain challenges included:

  1. Lack of Native Support for Custom Hardware:
    • PyTorch 2.0 introduced torch.compile, which requires a backend that efficiently maps computations to hardware.
    • The customer’s AI accelerator did not have a dedicated PyTorch backend, making model execution inefficient or infeasible.
  2. Bridging PyTorch’s Computational Graph with Custom Hardware:
    • PyTorch 2.0 models rely on Aten IR before lowering to hardware-specific execution.
    • The customer needed to implement Aten IR transformations to map PyTorch operations to their hardware’s operations
  3. Minimizing Code Changes for Developers:
    • AI researchers and developers prefer out-of-the-box compatibility with PyTorch.
    • The goal was to allow execution on the AI accelerator without requiring developers to rewrite or refactor models significantly.

Solution

To enable seamless execution of PyTorch 2.0 models on the customer’s AI accelerator, a custom PyTorch backend was designed and implemented. This ensured that models could run efficiently on the hardware with minimal code modifications while maintaining PyTorch’s ease of use. The solution comprised the following key components:

1. Custom Backend
A dedicated PyTorch backend was developed to seamlessly integrate with torch.compile and translate PyTorch operations into hardware-specific operators. This backend enabled efficient model execution by leveraging the accelerator’s computational capabilities while maintaining compatibility with PyTorch’s standard workflows.

2. Aten IR Transformation Passes
Various torch aten ops were processed and lowered to ops in the customer’s operator library. Transformation passes take care of support for missing ops in various categry including unary, binary & reduction ops. Aten ops like LayerNorm, AvgPool, Softmax, expand, squeeze, slice, etc., were lowered in the transformation pass.

3. Data movement related passes
Several optimization passes were implemented   including data movement to and from the device, constant folding, memory usage analysis, and eviction strategies—each designed to enhance execution efficiency and resource utilization on the target hardware.

4. Validation & Accuracy Testing
To guarantee correctness, transformed operations were rigorously validated against PyTorch’s reference implementations. The PCC metric was used to measure numerical accuracy, ensuring minimal deviation from expected results.

5. Runtime Testing with Real-World Models
To evaluate real-world performance, the backend was tested using:

  • Torchvision models such as ResNet, MobileNet and YOLO, etc., for vision-based inference.
  • Large Language Models to assess execution efficiency for NLP tasks including GPT, Bloom, etc
  • End-to-end inference benchmarks to validate performance and correctness.  

This comprehensive testing approach ensured that the backend was robust, performant, and production-ready for deployment in AI applications.

Technology Stack

Solution Overview

To enable seamless execution of PyTorch 2.0 models on the customer’s AI accelerator, key improvements were made in operator support, backend integration, and real-world validation.

Support for 50+ PyTorch Core Aten IR Ops

Implemented transformation and execution logic for the below mentioned 50+ commonly used PyTorch operations, ensuring broad model compatibility with minimal intervention.

  • Mathematical & Activation Functions (ReLU, Softmax, Log, Exp)
  • Normalization & Pooling (LayerNorm, BatchNorm, AvgPool)
  • Tensor Manipulation & Reduction Ops (Reshape, Transpose, Sum, Mean)

Seamless Integration with torch.compile API and model benchmarking

Enabled developers to use PyTorch’s torch.compile() API for effortless deployment.

  • No manual modifications were required, allowing models to automatically optimize for the AI accelerator.
  • Validated performance and correctness of various of the implementation using various categories of models including ResNet, MobileNet, VGG, YOLO, Faster R-CNN, BERT, GPT, Falcon, etc.,
  • Each model was tested for accuracy, latency, and stability, ensuring reliable execution.

Business Impact

  • Faster AI Model Deployment – Enables seamless execution of PyTorch models on the customer’s AI accelerator, significantly reducing time-to-market for AI applications.
  • Increased Developer Adoption – Eliminates complex integration efforts, making it easier for AI developers to leverage the hardware, driving ecosystem growth.
  • Stronger Competitive Positioning – Enhances the value proposition of the AI accelerator, offering effortless PyTorch support and positioning it as a strong alternative to existing AI computing platforms.

Conclusion

By integrating a custom PyTorch 2.0 backend with their AI accelerator, the customer has significantly improved the accessibility and usability of their hardware. This effort not only enhances execution efficiency but also positions their AI ecosystem as an attractive solution for AI researchers, enterprises, and developers looking for scalable, high-performance AI computing solutions. The seamless onboarding experience ensures that more AI practitioners can leverage the customer’s hardware without needing to modify their existing PyTorch workflows, ultimately driving greater adoption and ecosystem expansion.

MulticoreWare showcased expertise in Compilers, Python, PyTorch, and AI Accelerators and our comprehensive approach ensured Performance parity and Stability. To learn more about our expertise or to discover how we can help your organization achieve innovative and high-performance results, please contact info@multicorewareinc.com.

Share Via

Explore More

May 8 2026

Optimizing Android Application Performance for Remote GPU Rendering Platforms

Customer
The customer is a technology company specializing in GPU virtualization middleware that enables discrete processing units to be aggregated into shared resource pools and accessed remotely across conventional network infrastructure.

Read more
Apr 9 2026

Agentic AI for RAN Observability, Explainability and Orchestration

Customer A global telecommunications and network infrastructure company that provides advanced software, hardware, and services for building, managing, and optimizing large-scale telecom and enterprise networks. Its solutions leverage AI, automation, and end-to-end visibility to help operators enhance performance, ensure reliability, and efficiently manage complex, multi-domain network environments. Problem Statement Radio Access Networks (RAN) are the  … Read more

Read more
Apr 3 2026

Embedded Platform Optimization for Advanced Drone Systems: Lidar and Motor Control Integration

Client A leading drone and robotics company developing high-performance UAV platforms for autonomous operations, industrial inspection, and surveying in complex or restricted environments. Problem Statement Simultaneously executing high-throughput LiDAR processing and latency-critical motor control on resource-constrained embedded systems creates a fundamental bottleneck in real-time performance and scalable UAV autonomy. Challenge 1: High-Speed Sensor Integration Integrating  … Read more

Read more

GET IN TOUCH

    Please note: Personal emails like Gmail, Hotmail, etc. are not accepted
    (Max 2000 characters)