MulticoreWare

Case Studies

Agentic AI for RAN Observability, Explainability and Orchestration

April 9, 2026

Customer

A global telecommunications and network infrastructure company that provides advanced software, hardware, and services for building, managing, and optimizing large-scale telecom and enterprise networks. Its solutions leverage AI, automation, and end-to-end visibility to help operators enhance performance, ensure reliability, and efficiently manage complex, multi-domain network environments.

Problem Statement

Radio Access Networks (RAN) are the foundational layer of modern cellular infrastructure, responsible for connecting millions of end-user devices to the core network through a distributed fabric of Radio Units (RUs), Distributed Units (DUs), and Centralized Units (CUs). As RANs evolve into highly distributed and dynamic systems, traditional monitoring of dashboards and AI-assisted tools still require significant human involvement.

Operators must continuously observe telemetry streams, manually interpret large volumes of metrics, and determine corrective actions whenever anomalies occur. This reactive approach slows incident resolution and increases operational complexity.

The customer’s existing monitoring framework lacked the intelligence required to autonomously respond to operational issues. Although the platform could collect telemetry from different layers of multiple network elements, it could not effectively correlate events, identify root causes, or orchestrate remediation workflows without human intervention. There was also no automated fault management workflow, and no traceability and trustability across the monitoring pipeline.

The platform faced several operational limitations:

  • Limited ability to accurately identify root causes across distributed RAN environments
  • Significant manual effort required to interpret raw telemetry metrics, logs, and event signals
  • Lack of proactive insights to predict service degradation before outages occur.
  • Inability to automatically remediate issues without human intervention
  • Limited traceability, observability, and explainability across the monitoring workflow

To address these challenges, the customer required an intelligent, scalable framework capable of continuously analyzing network telemetry, correlating events, and autonomously orchestrating remediation workflows.

Solution Overview

MulticoreWare designed and implemented an Agentic AI–driven framework to enhance RAN observability, explainability, and automated orchestration. The solution integrates real-time telemetry including performance metrics, logs, and event signals from RAN elements and processes it through a coordinated ecosystem of AI agents.

The architecture is built on Agent-to-Agent (A2A) communication and the Model Context Protocol (MCP), enabling structured context sharing and coordinated reasoning across agents. Telemetry data is continuously ingested, normalized, and analyzed in near real time to detect anomalies and evaluate network state.

Typically, in a large-scale network, most of the issues are raw, which are not clustered in any order. And to solve them, typically, either a human operator needs to intervene, or the flow will be a HITL(Human-In-The-Loop), where again for every decision/alarm, a human operator needs to adjust, and approve the suggestion provided by the automation. This process itself will typically take anywhere from minutes to hours based on the scale of the issue. That is where this solution comes in.

The solution addresses two primary RAN fault classes: PCI Collision and RU Failure, through dedicated ML-based detection modules. For RU health monitoring, the system continuously tracks a comprehensive set of hardware and fronthaul signals:

  • Fronthaul link status
  • PA Temperature
  • Fronthaul latency
  • PTP Lock Status
  • VSWR
  • Optical RX Power
  • GPS Lock Status
  • Transmit Power level
  • Output Power per Sector
  • Critical Alarm active
  • Input Voltage level
  • Active UE Count
  • RU availability
  • Fronthaul Packet loss rate
  • Fronthaul link status
  • PA Temperature
  • Fronthaul latency
  • PTP Lock Status
  • VSWR
  • Optical RX Power
  • GPS Lock Status
  • Transmit Power level
  • Output Power per Sector
  • Critical Alarm active
  • Input Voltage level
  • Active UE Count
  • RU availability
  • Fronthaul Packet loss rate
Fronthaul Link Status PTP Lock Status GPS Lock Status Critical Alarm Active RU availability

PA temperature

VSWR
Transmit power level
Input voltage level
Fronthaul Packet loss rate

Fronthaul latency

Optical RX Power
Output Power per sector
Optical RX Power

For PCI collision detection, RF KPIs including RSRP, RSRQ, SINR, Cell ID, EARFCN, and intra-frequency handover success rates are ingested per cell, with rolling statistics computed to identify cells under active collision impact based on topology overlap and simultaneous signal degradation across all four dimensions.

A total of 11 specialized AI agents were developed, each responsible for distinct functions such as anomaly detection, validation, contextual reasoning, and remediation. A central orchestrator agent manages workflow execution, maintains context across agents, and ensures synchronized decision-making.

Key capabilities of the solution include:

  • Multi-agent orchestration: Distributed agents collaborate using A2A messaging, enabling parallel processing of telemetry and faster root cause identification.
  • Real-time telemetry processing: Continuous ingestion and normalization of RAN data enable near real-time anomaly detection and state evaluation.
  • Context-aware reasoning: MCP enables agents to share structured context, improving decision accuracy and consistency across workflows.
  • AI insight generation: An LLM-based engine correlates events and interprets telemetry to generate actionable operational insights.
  • Automated issue reporting: Detected anomalies trigger automated JIRA ticket creation, prioritized based on severity and impact.
  • Autonomous remediation workflows: A remediation agent executes corrective actions, with updates communicated via Microsoft Teams webhook notifications.
  • Modularity: The entire system is easily scalable, from the inter-agent communication to the agents themselves. The A2A communication uses a bus architecture, with inspiration from message queues.
  • Mean Time to Resolution:
    • For each of the issue resolution, the resolution is started by first checking if the issue is persistent. Then, a corresponding JIRA ticket will be created, based on the criticality level of the issue. Finally, the issue will be remediated by the resolution agent.
    • For persistence checking, the issue will be observed for a certain period of time.
    • PCI Collision: 2 min 49 seconds
    • RU Failure:
      • Complete RU Failure: 3 min 51 seconds
      • PA Thermal Failure: 3 min 30 seconds
      • Fronthaul Failure: 2 min 53 seconds

Agent Features:

The solution incorporates a structured multi-agent framework designed to ensure explainability, reliability, and coordinate decision-making across the network.

  • Multi-Agent Workflow: Tasks are distributed across validation, reasoning, and remediation agents, enabling efficient root cause analysis and automated orchestration.
  • Agent Explainability: All agent inputs, outputs, and tool interactions are logged, enabling full traceability. AI-based evaluation mechanisms assess the correctness, relevance, and impact of each action.
  • Agent Guardrails: Each agent performs a defined task, with outputs validated by downstream agents to ensure reliability and prevent error propagation.

Technology Stack

  • Model: Z.AI GLM 4.7
  • Token Generation Rate: ~200 tokens/sec
  • Frameworks and Tools: LangChain, Model Context Protocol (MCP), Custom Agent-to-Agent (A2A) Bus

Business Impact

The Agentic AI framework significantly improves network operations by reducing manual monitoring efforts and accelerating issue resolution. Automated telemetry analysis, event correlation, and remediation workflows enable faster root cause identification and reduce mean-time-to-resolution (MTTR).

The architecture is highly scalable and extensible, supporting integration with additional network devices, APIs, and protocols such as NETCONF and RESTCONF, enabling future-ready deployment across evolving telecom infrastructures.

Conclusion

By deploying an Agentic AI–powered RAN observability and orchestration platform, the customer transformed its traditional monitoring framework into an intelligent, autonomous system capable of real-time analysis and self-directed remediation.

The solution enables proactive issue detection, automated decision-making, and improved operational transparency, setting up a foundation for self-healing network operations.

MulticoreWare’s expertise in AI-driven automation, intelligent observability, and scalable network orchestration enabled this transformation. To learn how we can help your organization build intelligent, self-healing network operations using Agentic AI. Please contact info@multicorewareinc.com.

Customer

A Telecom company that develops platforms for monitoring, analyzing, and managing large-scale telecom and enterprise network infrastructures. Their solutions enable operators to maintain operational visibility and ensure reliability across complex and distributed network environments.

Problem Statement

As telecom networks evolve into highly distributed and dynamic systems, traditional monitoring of dashboards and AI-assisted tools still require significant human involvement. Operators must continuously observe telemetry streams, manually interpret large volumes of metrics, and determine corrective actions whenever anomalies occur. This reactive approach slows incident resolution and increases operational complexity.

The customer’s existing monitoring framework lacked the intelligence required to autonomously analyze network behavior and respond to operational issues. Although the platform could collect telemetry from multiple network elements, it could not effectively correlate events, identify root causes, or orchestrate remediation workflows without human intervention.
The platform faced several operational limitations:

  • Limited ability to accurately identify root causes across distributed RAN environments
  • Significant manual effort required to interpret raw telemetry metrics, logs, and event signals
  • Lack of proactive insights to predict service degradation before outages occur.
  • Inability to automatically remediate issues without human intervention.
  • Limited traceability, observability, and explainability across the monitoring workflow

To address these challenges, the customer required an intelligent, scalable framework capable of continuously analyzing network telemetry, correlating events, and autonomously orchestrating remediation workflows.

Solution Overview

MulticoreWare designed and implemented an Agentic AI–driven framework to enhance RAN observability, explainability, and automated orchestration. The solution integrates real-time telemetry including performance metrics, logs, and event signals from RAN elements and processes it through a coordinated ecosystem of AI agents.

The architecture is built on Agent-to-Agent (A2A) communication and the Model Context Protocol (MCP), enabling structured context sharing and coordinated reasoning across agents. Telemetry data is continuously ingested, normalized, and analyzed in near real time to detect anomalies and evaluate network state.

A total of 11 specialized AI agents were developed, each responsible for distinct functions such as anomaly detection, validation, contextual reasoning, and remediation. A central orchestrator agent manages workflow execution, maintains context across agents, and ensures synchronized decision-making.

Key capabilities of the solution include:

  • Multi-agent orchestration: Distributed agents collaborate using A2A messaging, enabling parallel processing of telemetry and faster root cause identification.
  • Real-time telemetry processing: Continuous ingestion and normalization of RAN data enable near real-time anomaly detection and state evaluation.
  • Context-aware reasoning: MCP enables agents to share structured context, improving decision accuracy and consistency across workflows.
  • AI insight generation: An LLM-based engine correlates events and interprets telemetry to generate actionable operational insights.
  • Automated issue reporting: Detected anomalies trigger automated JIRA ticket creation, prioritized based on severity and impact.
  • Autonomous remediation workflows: A remediation agent executes corrective actions, with updates communicated via Microsoft Teams webhook notifications.

Agent Features:

The solution incorporates a structured multi-agent framework designed to ensure explainability, reliability, and coordinate decision-making across the network.

  • Agent Explainability
    • Each agent’s input/output is logged via custom implementation and LangFuse. For each workflow run, hallucination and correctness scores are computed on demand using a reasoning LLM.
    • Scoring uses fine-tuned system/user prompts with evaluation criteria including Tool calls, Time per tool call, Relevance of tool calls to the agent workflow and Impact of tool calls on underlying data
  • Agent Guardrails
    • Tasks are decomposed so each agent handles a single responsibility
    • Prevents goal drift and context poisoning in long iterations
    • Outputs are validated by downstream agents instead of being blindly trusted.
    • For error detection (e.g., PCI collision, RU failure), a verification agent validates before escalation to the meta-reasoner (root cause analyzer).
    • Ensures issues are persistent true positives, not transient noise.
    • Multi-Agent Workflow

Tasks are split into sub-tasks executed by specialized agents:

    • Enacter agent: Enacter agent orchestrates inter-agent communication
    • Validation agent
    • Meta-Reasoner agent: Meta-Reasoner analyzes metadata and historical data to determine root cause
    • Remediation agent: Remediation agent proposes changes based on alerts and impact, aligned with company policies, and subject to governance validation before execution.

Technology Stack

  • Model: Z.AI GLM 4.7
  • Token Generation Rate: ~200 tokens/sec
  • Frameworks and Tools: LangChain, Model Context Protocol (MCP), Custom Agent-to-Agent (A2A) Bus

Business Impact

The Agentic AI framework significantly improves network operations by reducing manual monitoring efforts and accelerating issue resolution. Automated telemetry analysis, event correlation, and remediation workflows enable faster root cause identification and reduce mean-time-to-resolution (MTTR).

The architecture is highly scalable and extensible, supporting integration with additional network devices, APIs, and protocols such as NETCONF, enabling future-ready deployment across evolving telecom infrastructures.

Conclusion

By deploying an Agentic AI–powered RAN observability and orchestration platform, the customer transformed its traditional monitoring framework into an intelligent, autonomous system capable of real-time analysis and self-directed remediation.

The solution enables proactive issue detection, automated decision-making, and improved operational transparency, setting up a foundation for self-healing network operations.

MulticoreWare’s expertise in AI-driven automation, intelligent observability, and scalable network orchestration enabled this transformation. To learn how we can help your organization build intelligent, self-healing network operations using Agentic AI.
Please contact info@multicorewareinc.com.

Share Via

Explore More

May 8 2026

Optimizing Android Application Performance for Remote GPU Rendering Platforms

Customer
The customer is a technology company specializing in GPU virtualization middleware that enables discrete processing units to be aggregated into shared resource pools and accessed remotely across conventional network infrastructure.

Read more
Apr 3 2026

Embedded Platform Optimization for Advanced Drone Systems: Lidar and Motor Control Integration

Client A leading drone and robotics company developing high-performance UAV platforms for autonomous operations, industrial inspection, and surveying in complex or restricted environments. Problem Statement Simultaneously executing high-throughput LiDAR processing and latency-critical motor control on resource-constrained embedded systems creates a fundamental bottleneck in real-time performance and scalable UAV autonomy. Challenge 1: High-Speed Sensor Integration Integrating  … Read more

Read more
Dec 15 2025

AI-Powered Dynamic Policy Management for Auto Healing Networks

Client The client is a global leader in network management software, delivering end-to-end network and service management solutions for enterprise, telecom, industrial, and data centre networks. Their platform manages a vast and diverse range of devices across enterprise, cloud, edge, and hybrid environments providing large-scale configuration, monitoring, and remediation capabilities. Problem Statement As networks grow  … Read more

Read more

GET IN TOUCH

    Please note: Personal emails like Gmail, Hotmail, etc. are not accepted
    (Max 2000 characters)