MulticoreWare

AI & ROBOTICS

Deploying Vision Language Action (VLA) based AI Models in Robotics: Optimization for Real-Time Edge Inference

June 26, 2025

Selventhiran Rengaraj is an Associate Technical Project Manager in the Mobility & Transportation Business Unit at MulticoreWare. He has hands-on experience in developing Robotics Stacks for Ground & Underwater Robots and is working on cutting-edge AI and ADAS Perception Stack Optimization on leading Automotive Semiconductor platforms.

Introduction: VLA Models in Robotics – The Shift to Multi-Modality

The robotics industry is amid a major paradigm shift, driven by the emergence of foundation models: large-scale, multi-modal AI systems trained to understand vision, language, and action in a unified framework. Models like Google’s RT-2, which translates web-scale vision-language knowledge into robotic actions, and PaLM-E, an embodied language model capable of reasoning across diverse sensor inputs and tasks, are setting new benchmarks for generalization and task versatility. 

However, these models come with a significant trade-off: their size and compute demand make them impractical for real-time deployment on resource-constrained edge platforms & cost constrained robots. This opens an opportunity for models like CogACT, which strike a balance between multi-modal reasoning and architectural efficiency. In this blog, we dive into the CogACT architecture and about edge optimization of such VLA-style robot models. 

CogAct: A General-Purpose Robotic Intelligence Stack

CogAct is a next-generation, large-scale multi-modal model designed to power general-purpose robotic autonomy. It’s built around three core modules Vision, Language, and Action working together to perceive, reason, and act in real-world environments. With a total of ~7.6 billion parameters, CogAct brings the scale and generalization of foundation models into robotics, without compromising on task-level execution.

How it works?

Vision Module

Built on high-capacity transformers like DINOv2 and SigLIP, this module processes raw images into perceptual tokens. Trained on large-scale datasets, it captures both spatial layouts and object-level semantics with high fidelity.

Language Module

Powered by a Large language model (LLM) LLaMA-2, this module blends visual context with language instructions to understand goals, reason through intent, and ground actions in the environment. It also enables flexible task execution by adapting to diverse natural language prompts, from simple object manipulation to more complex sequential tasks.

Action Module

To generate smooth, multi-step actions, CogAct uses a Diffusion Transformer. It translates cognitive features into temporally consistent motion commands, enabling complex, real-world tasks like grasping, placing, or navigating.

A Real-World Example

  • Give CogAct an image of a cluttered tabletop and the instruction: Move the Pepsi can near the orange”.
  • It will recognize the objects, reason through the instruction, plan a collision-free path, and output a sequence of actions for a robotic arm to physically move the can next to the orange.
An overview of CogAct Model

CogAct has already been used in robotics scenarios like mobile manipulation, indoor navigation, warehouse automation, and multi-agent collaboration tasks that require high-level understanding and fine-grained action control. Its architecture enables robots to act with context, intent, and temporal coherence.

But with this level of intelligence comes significant computational overhead, making edge deployment a real challenge. That’s where our work at MulticoreWare comes in: transforming powerful but heavy models like CogAct into edge-ready systems without compromising their core capabilities.

Our Approach:

CogAct, with its 7.6B parameters and multi-stream architecture, presented a unique challenge. Using a combination of optimization techniques including quantization, pruning, and model graph tuning, we significantly reduced its inference time. As a result, we have achieved 1.3× faster performance, translating to around 26% reduction in latency, all while preserving the model’s original accuracy and behaviour.

Results of the Original CogAct Model 
Results of the MulticoreWare Optimized
CogAct Model (1.3x faster)

We successfully deployed the optimized model on real-world edge platforms proving that even foundation-scale robotics models like CogAct can be made efficient and practical for on-device execution.

Applications of VLA Models: Why Optimization Matters

VLA models like CogAct are ushering in a new era of robotic intelligence by enabling machines to understand and act upon complex, high-level instructions. Their potential applications span a wide array of real-world domains:

Warehouse Automation

Robots can understand flexible commands like “Stack all the red boxes near the loading bay,” and figure out object types, spatial relationships, and task sequences on the fly.

Healthcare Robotics

In hospitals or elder care settings, robots powered by VLA models can safely follow spoken instructions, navigate through crowded spaces, and assist with simple fetch-and-carry tasks.

Household Assistance

Whether it’s tidying up or following multi-step instructions like “Put the dishes in the sink and wipe the counter,” VLA-based robots make it easier for humans to interact with machines in a natural way.

Multi-Agent Collaboration

In environments where several robots need to work together, like coordinating drone fleets or warehouse bots shared understanding of language and vision helps improve coordination, efficiency, and safety.

But while these models promise general-purpose autonomy, deploying them in the field, especially on low-power, mobile, or real-time systems requires overcoming steep computational challenges. That’s why optimization is not just beneficial, but essential. Edge optimization ensures that:

  • Fast, real-time responses to dynamic environments.
  • Energy efficiency for mobile or battery-powered robots.
  • Compliance with strict memory and compute limits on embedded platforms.
  • Reliable performance for safety-critical tasks.

By optimizing VLA models like CogAct, we bridge the gap between foundational intelligence and deployable autonomy, bringing sophisticated reasoning to practical robotics applications, from warehouses to wheels to underwater exploration.

Our Expertise in AI powered Edge Solutions

  • Proficiency Across 150+ SOTA AI Models: Custom optimization for CPUs, GPUs, DSPs, NPUs, and low-power edge AI SoCs across modalities.
  • Edge-First BEV (Bird’s Eye View) Algorithms for Diverse Mobility Systems: Tailored BEV pipelines for micro-mobility, two-wheeled, and four-wheeled & AMR / AGV based mobile robotics platforms.
  • End-to-End Robotics Perception Stack Development: Experience building modular perception systems including object detection, depth estimation, semantic segmentation, and sensor fusion tailored for robotics use cases.
  • Expertise in BEV & Vision Transformers: Optimized models like BEVFormer, BEV-SegFormer, and Lift-Splat-Shoot (LSS) etc for automotive AI accelerators.
  • Advanced Quantization of Fusion Models: INT8 quantization of camera+LiDAR models like DeepFusion, BEV-Det, and DeepInteraction without sacrificing accuracy.
  • SLAM, Mapping & Navigation Algorithms: In-house expertise in visual-inertial SLAM, 3D mapping, and real-time navigation for autonomous robotic systems in GPS-denied and dynamic environments.

Conclusion: From Intelligence to On-Device Autonomy

Our method is optimized for hardware efficiency, prioritizes low latency, and emphasizes high accuracy-designed to enable real-world deployment in industries such as autonomous vehicles, warehouse robotics, last-mile delivery, smart infrastructure, and more. At MulticoreWare, we leverage our specialized expertise to enhance and accelerate AI solutions, tailored to the specific demands of your unique use cases. To learn more about how we are building efficient AI solutions, write to us at info@multicorewareinc.com

Mobility & Transportation Industry | Automotive Compute

How can we help you?

1. Optimization Expertise

We specialize in hand-optimizing complex, multi-modal AI models for real-time performance on edge hardware. With deep experience across a wide range of SoCs, our team is skilled in tailoring large models to meet the strict latency, power, and memory constraints of embedded platforms without compromising accuracy.

CogAct, with its 7.6B parameters and multi-stream architecture, presented a unique challenge. Using a combination of optimization techniques including quantization, pruning, and model graph tuning, we significantly reduced its inference time. As a result, we have achieved 1.3× faster performance, translating to around 36% reduction in latency, all while preserving the model’s original accuracy and behaviour.

We successfully deployed the optimized model on real-world edge platforms proving that even foundation-scale robotics models like CogAct can be made efficient and practical for on-device execution.

2. Multi-Modal Edge AI & Robotics

Our expertise in pushing the boundaries of AI model deployment for edge robotics and automotive systems includes:

  • Proficiency Across 150+ SOTA AI Models: Custom optimization for CPUs, GPUs, DSPs, NPUs, and low-power edge AI SoCs across modalities
  • Edge-First BEV (Bird’s Eye View) Algorithms for Diverse Mobility Systems: Tailored BEV pipelines for micro-mobility, two-wheeled, and four-wheeled & AMR / AGV based mobile robotics platforms.
  • End-to-End Robotics Perception Stack Development: Experience building modular perception systems including object detection, depth estimation, semantic segmentation, and sensor fusion tailored for robotics use cases.
  • Expertise in BEV & Vision Transformers: Optimized models like BEVFormer, BEV-SegFormer, and Lift-Splat-Shoot (LSS) etc for automotive AI accelerators.
  • Advanced Quantization of Fusion Models: INT8 quantization of camera+LiDAR models like DeepFusion, BEV-Det, and DeepInteraction without sacrificing accuracy.
  • SLAM, Mapping & Navigation Algorithms: In-house expertise in visual-inertial SLAM, 3D mapping, and real-time navigation for autonomous robotic systems in GPS-denied and dynamic environments

Our approach is hardware-aware, latency-sensitive, and accuracy-focused—built to support real-world deployment across industries like autonomous driving, warehouse robotics, last-mile delivery, smart infrastructure, and beyond.

Conclusion: From Intelligence to On-Device Autonomy

At MulticoreWare, we leverage our specialized expertise to enhance and accelerate AI solutions, tailored to the specific demands of your unique use cases.To learn more about how we are building efficient AI solutions, write to us at info@multicorewareinc.com

Share Via

Explore More

Jul 25 2024

Revolutionizing Search & Rescue Operations with AI and Machine Learning

Search And Rescue (SAR) missions are critical operations aimed at locating and assisting individuals in distress.

Read more
Apr 4 2024 Explainable-AI-The-building-block-for-trustworthy-AI-Systems

Explainable AI – The building block for trustworthy AI Systems

Just a few decades ago, the idea of machines that could think belonged to the realm of science fiction. But, today machines have evolved beyond mere tools – aid us in thinking, creating, and decision-making.

Read more
Dec 8 2023 Future of Gen-AI powered Drone

The Future of Gen-AI powered Drone Advancements

Drones have emerged as a transformative force, revolutionizing industries and shaping the future of transportation, delivery, and surveillance.

Read more

GET IN TOUCH

    Please note: Personal emails like Gmail, Hotmail, etc. are not accepted
    (Max 2000 characters)