Kernel-level · On-premise · Zero cloud calls

Your GPU dashboard
is lying to you.

95% utilisation doesn't mean 95% useful work. Megh reads every layer of the compute stack — from silicon to scheduler — and tells you exactly which job is on which GPU, whether it's doing real work, and what it costs. All on-premise. No SaaS. No cloud dependency.

Megh is a GPU intelligence layer that sits on your existing cluster — reads every signal from kernel to BMC — and tells you, in plain English, why jobs run slow, which GPU-hours are wasted, and what that costs. On-premise. No cloud. No rip-and-replace.

India-built Air-gap ready NVIDIA + AMD

The 9-layer compute stack

⬛

Job / Workload

slurm sacct

Observe

🔧

Framework

torch profiler

Observe

⚡

Runtime / Serving

vLLM /metrics

Observe

📦

Container

cgroup v2

Observe

🗂

Scheduler

squeue

Observe

◈

Kernel (Linux)

eBPF agent

Megh's layer

🖥

OS Userspace

DCGM · nvidia-smi

Observe

🔌

Firmware / BMC

Redfish · IPMI

Observe

⬢

Hardware

DCGM · IB · PDU

Observe

Products

One control plane.
Every layer of your cluster.

Megh deploys inside your environment — on-premise or air-gapped — and connects to the tools you already run. No rip-and-replace. No data leaving your network.

Commercial · Edition 1

Observe

Visibility + efficiency

SaaS — agentless

Reads your existing Prometheus / DCGM. No agent to install. Connects in minutes.

$30 / GPU / mo

On-premise — full agent

All 9 telemetry bands including kernel eBPF. Zero data egress. Air-gap compatible.

$35 / GPU / mo

What's included

Every telemetry band fused — fd→GPU attribution, per-tenant usage
MFU vs util · io-wait (idle vs storage-starved)
DCGM / vLLM / scheduler correlation
Dashboards, alerts, and pattern detection library
GPU routing intelligence — NVIDIA, AMD, Intel

Recommended

Commercial · Edition 2

Govern

Observe + the full compliance plane

SaaS

2× Observe SaaS rate. All frameworks included.

$60 / GPU / mo

On-premise

Full kernel agent + complete compliance plane.

$70 / GPU / mo

Everything in Observe, plus —

Compliance frameworks: ISO 42001 · NIST AI RMF · SOC 2 · GxP
Pramaan Score — live governance metric across all frameworks
Declared-vs-detected workload gating
Signed HMAC audit envelopes
Policy + placement enforcement
179 prioritised compliance fixes · gap engine with "Show me how"

Sovereign Edition

Sovereign Platform

For national defence, government & regulated buyers

Where data sovereignty and provable per-job audit are mandatory — deployable in-country or air-gapped, from an independent vendor of allied / non-aligned origin, outside sanctioned or adversarial jurisdictions.

Proof of Audit — land

Scoped agent on 1–2 nodes, capped at 8 weeks → per-job audit trail + data-movement report + compliance-gap assessment. 100% creditable to year 1.

$10,000 creditable

Sovereign Platform — annual

Behavioral audit + data-movement / USB / egress tracing + forensic query + retained audit trail + all compliance frameworks + air-gap + agent. All bundled.

$30K–60K / site / yr + $2K–3K / node / yr

What a dashboard cannot give —

Per job & per user: which datasets were read, what left the node, where it went
USB / removable-media events — no metric stack records this
Jobs behaving unlike their declared workload — forbidden paths, novel egress
Retained, queryable 12-month audit trail + forensic reconstruction
Every signal attributed at the kernel — deployed in your jurisdiction

⚡ Founding partner pilot: free for up to 5 organisations — conversion terms signed at pilot start — offer expires 31 Dec 2026.

Connects to your existing stack —

Slurm

Kubernetes

Prometheus

DCGM

vLLM

Triton

Redfish / IPMI

Lustre / NFS

InfiniBand

Terraform

Ansible

Pulumi

NVIDIA DCGM

AMD MIGraphX

OpenVINO

Warewulf / xCAT

What's inside

Every module. One screen.

From cluster health to compliance score to inference runtime — all visible, all governed, all on your hardware.

Workloads

Dashboard

Cluster health

GPU utilisation, active nodes, PCIe throughput, ECC errors, active patterns, node status, and recommended actions — in one executive + ops view. Answers three questions: where am I wasting GPU time, am I compliant, are my models healthy.

Patterns

17 detectors

17 pre-built detection patterns across scheduling, memory, thermal, hardware, storage, network, jobs, and accounting. Ships ready — no configuration. Each pattern is status-tracked: ACTIVE / WARNING / CLEAN / PENDING.

Jobs

Kernel attribution

Role-scoped job view with Kernel Attribution — which job is driving or starving each GPU at the kernel level. Includes Pramaan Trust Score per job: flags suspicious workloads before they reach the cluster.

Job Analytics

GPU heatmap · MFU

275 jobs tracked, 56.1% average GPU utilisation. GPU utilisation heatmap by partition and hour. Partition efficiency summary with MFU% — the real model-FLOP utilisation number your CFO wants.

Scheduler

Herd-intelligent

Herd-Intelligent Job Placement — nodes scored via live ECC history, PCIe health, and NCCL herd patterns before every job submission. Prevents placement on degraded hardware.

GPU Routing

NVIDIA · AMD · Intel

Automatic workload-to-GPU routing by job class. LLM inference → HBM2e. GROMACS → AMD MI210 (74% faster). CUDA training → RTX Ada. ECMP elephant flow analysis and per-port queue depth for fabric health.

Intelligence

Inference Runtime

vLLM · Triton · on-prem LLM

Which LLM is powering the explanation engine, on which GPU, at what speed — TTFT, ITL, KV-cache%, queue depth. Supports Mistral, Llama, Gemma. Sovereign Stack: OpenVINO, AMD MIGraphX, Intel Habana. Zero cloud calls.

Consciousness

RAG · IaC · audit-grade

Governance findings aligned to ISO 42001 and NIST AI RMF. Multilingual Knowledge Base (EN/HI/GU/TA/TE) with per-PI audit-signed RAG. IaC Connector showing live Terraform, Ansible, and Pulumi drift.

Digital Twin

What-if · fault injection

Simulate cluster changes before committing them. Adjust GPU load, nodes, queue size, and memory pressure. 85% confidence scoring. Fault injection: node failure or degraded NIC. Recommended actions generated automatically.

Infrastructure

Nodes

Hardware registry · CMDB

Complete per-node hardware visibility: CPU, RAM, GPU (utilisation, VRAM, temp, power, PCIe TX, ECC, CUDA driver), storage, network, BMC. Living hardware registry — the CMDB your cluster never had.

Topology

1,000-node canvas · RDMA

1,000-node canvas map with rack-level drill-down. RDMA latency matrix: microsecond pairwise GPUDirect measurements. Five view modes: Cluster, Rack, Node, GPU Util, Network.

Vulnerabilities

Air-gap CVE · signed bundles

Air-gap-safe CVE scanning via signed .vulnbundle — the only CVE approach that works in DRDO, NIC, and classified environments. CVE × Pattern attribution. PDF export for compliance evidence.

Governance

Compliance Intelligence

Pramaan Score

Pramaan Score maps live infrastructure posture across 10 frameworks — ISO 42001, NIST AI RMF, EU AI Act, DPDPA, CERT-In, SOC 2, GxP, MeitY, CAG, STQC. 179 prioritised fixes. Gap engine with "Show me how" remediation.

Finance Intelligence

₹/$ · kWh · kg CO₂

Your cluster cost ₹45,000 this month. 120 jobs delivered. 1% of compute was productively used. Energy and carbon per job. PI-variant reports with student anonymisation. Pramaan evidence PDF for grant reporting.

Admin & RBAC

13 roles · LDAP · PAM

13 user roles — admin, auditor, finance, pi-lead, cluster-engineer, researcher, supervisor, and more. LDAP-backed. All role changes written to the tamper-evident audit log. Enforced, not optional.

Telemetry layers
Job → Framework → Runtime → Container → Scheduler → OS → Kernel → Firmware → Hardware

Detection patterns
Ships ready — scheduling, memory, thermal, hardware, storage, network, jobs, accounting

38.8

Pramaan Score — live
SOC 2 Type II · 7 Jun 2026 · on-premise · zero cloud calls

User roles
admin · auditor · finance · pi-lead · researcher · supervisor · and more

Capabilities

Everything your cluster needs.
Nothing it doesn't.

Ten capabilities. One agent. Deployed in your environment — not ours.

See the Truth

Kernel-level attribution ties every GPU cycle back to the exact job and tenant. What DCGM, Slurm, and Prometheus structurally cannot show — Megh shows.

fd→GPU · per-tenant · io-wait

Know the Cost

GPU-hours wasted, energy consumed, carbon emitted, and rupee cost — attributed per job, per user, per team. Your CFO finally gets a straight answer.

₹/$ per job · kWh · kg CO₂

Stay Compliant

Pramaan Score maps your live infrastructure posture across ISO 42001, NIST AI RMF, EU AI Act, DPDPA, CERT-In, SOC 2, GxP and more — with 179 prioritised fixes.

10 frameworks · 179 fixes · live evidence

Stay Sovereign

Fully on-premise. Air-gap deployable. Signed offline CVE bundles. No foreign-jurisdiction cloud dependency. India-built, independent vendor of allied origin.

Air-gap · CERT-In · DPDPA · Make-in-India

Detect Failures Early

17 pre-built detection patterns across scheduling, memory, thermal, hardware, storage, network, and accounting. Ships ready — no configuration required.

17 patterns · 8 categories · auto-detect

Govern Workloads

Per-job Pramaan Trust Score flags suspicious workloads at admission — before they reach the cluster. Declarations drive policy; the kernel agent drives proof.

Trust scoring · workload gating · HMAC audit

Pattern Detection Library

From GPU Idle Despite Queue to InfiniBand Link Degradation — each pattern is independently tracked, status-flagged (ACTIVE / WARNING / CLEAN), and linked to causal explanation.

P1–P17 · scheduling to network

GPU Routing Intelligence

Automatically routes workloads to the right GPU by job class — LLM inference to HBM2e, CUDA training to CUDA-optimised silicon. NVIDIA, AMD, and Intel in one plane.

Workload routing · NVIDIA + AMD + Intel

Compliance Intelligence

Not a checklist — a live gap engine. Each gap shows which frameworks it closes, which controls it satisfies, the exact remediation step, and API endpoints to verify after the fix.

Gap engine · cross-framework · Show me how

Built for Your Requirements

Every cluster is different. If your environment needs a capability Megh doesn't ship today — a custom integration, a specific compliance framework, a hardware connector, or a bespoke reporting module — we build it with you.

Custom integrations · bespoke modules · your stack · your rules

Custom scheduler connector

Proprietary hardware telemetry

Bespoke compliance framework

Custom reporting & dashboards

Internal audit format export

Your use case →

Tell us what you need →

What your current stack misses

Three ways dashboards lie. Every day.

Your GPU monitoring stack gives you device-level aggregates. It cannot tie kernel-level evidence back to the job or the tenant. That gap is the whole problem.

util% = useful work

→ MFU tells the truth

A GPU pinned at 95% can be spinning, waiting on a NCCL collective, or KV-cache thrashing. DCGM sees the chip is busy. It does not see whether the work is real. MFU — model FLOP utilisation — does.

idle% = available

→ io-wait tells the truth

A GPU at 20% during AlphaFold's MSA phase isn't free — it's blocked on Lustre reads. Reclaiming it would kill the job. Only kernel io-wait tells idle apart from storage-starved.

shared = attributed

→ kernel fd-table tells the truth

On a shared, MIG, or opaque GPU, who is actually holding the allocation? Only the kernel's file-descriptor table names the tenant. No downstream Prometheus join can reconstruct it once the aggregate is emitted.

Architecture

Megh reads every layer.
It's the only one reading the middle.

Most stacks bolt one tool per band and call it coverage. The runtime and device ends are well-observed. The kernel — where causation actually lives — is a blind spot.

Layer

What it tells you

Standard tool

Coverage

Job / Workload

State, runtime, GPU alloc, exit code

slurm sacct · k8s

Partial

Framework

Loss curves, tokens/s, samples/s

torch profiler

Partial

Runtime / Serving

TTFT · ITL · KV-cache% · MFU

vLLM /metrics

Partial

Container

Image provenance, cgroup pressure

cgroup v2

Partial

Scheduler

Queue wait, placement, GRES allocation

squeue · kube-sched

Partial

Kernel (Linux)

io-wait · fd→GPU map · per-tenant attribution · declared-vs-actual behaviour

eBPF — Megh's layer

Megh only

OS Userspace

Device counters, NCCL events, dmesg

DCGM · nvidia-smi

Partial

Firmware / BMC

Power, thermal, fan, PSU, ECC

Redfish · IPMI

Partial

Hardware

GPU util%, temp, NVLink, IB counters

DCGM · IB · PDU

Partial

Use Cases

What the kernel layer
Changes in Practice

Three real workloads where the standard stack gives you a number. Megh gives you the reason.

USE CASE 01

vLLM inference — TTFT spiking

DCGM: 90% util. Problem unseen.

KV-cache is full. vLLM is preempting and swapping. The GPU looks busy because it is — doing the wrong work. Kernel attribution names the tenant causing the evictions. Add capacity or evict the right job. Don't trust the util gauge.

USE CASE 02

AlphaFold — the GPU that looks idle

DCGM: 20% util. Operator reclaims it.

MSA search is running: massive sequential reads from BFD and UniRef on Lustre. The GPU is waiting on storage. Kernel io-wait proves it's blocked, not free. The fix is the data path — not handing the GPU to another job.

USE CASE 03

Isaac Sim — right GPU, wrong silicon

Scheduler: GPU assigned. Job runs.

The sim runs on any GPU but quietly collapses on cards without RT cores. An A100 has none. Megh gates placement on RT-core capability and surfaces sim-on-sim contention via per-tenant context-switch monitoring.

Pramaan Score

Governance that's
measured, not claimed.

One number. Ten frameworks. Live infrastructure evidence — not self-assessment checkboxes. Every gap ranked by cross-framework impact with a clear remediation path.

Governance policy document missing. Closes 10 controls across CERT-In CI-5.1, DPDPA 1.2/1.5, EU AI Act AIA-2.1, ISO 42001, NIST AI RMF.

OFI

No CVE scan recorded. Ingest a signed .vulnbundle and run scan. Closes CERT-In CI-2.1, NIST MAN-3.1.

System operations monitoring — Conformant. Hash-chain audit log + Prometheus telemetry + vulnerability intake all verified.

38.8

/100

Pramaan Score — Live · SOC 2 Type II · 7 Jun 2026

Score reflects live infrastructure evidence collected from cluster nodes at report generation. Not a self-assessment — not fabricated. A qualified auditor must review before reliance.

FRAMEWORKS COVERED

ISO 42001

NIST AI RMF

SOC 2 TSC

EU AI Act

DPDPA 2023

CERT-In 2025

GxP / 21 CFR 11

MeitY 2025

CAG IS Audit

STQC AI QA

179 unique fixes · 39 frameworks · ranked by impact

Pricing — V2 · June 2026

~1–3% of the GPU spend it sits on.

One reclaimed H100 pays for roughly 50 GPUs monitored. Per-GPU subscription, no per-node minimums on the metric that matters.

Commercial · SaaS

Observe

Agentless. Reads your existing Prometheus and DCGM.

$30/GPU/mo

$60 node minimum · metered on peak concurrent GPUs

All 9 telemetry bands fused
fd→GPU kernel attribution
MFU vs util · io-wait starvation
Per-tenant usage · dashboards & alerts
DCGM / vLLM / scheduler correlation

Commercial · On-prem

Govern

Full kernel agent + complete compliance plane. All frameworks included.

$70/GPU/mo

On-prem · $35 Observe + $35 compliance · annual prepay = 2 mo free

Everything in Observe, plus —
ISO 42001 · NIST AI RMF · SOC 2 · GxP
Declared-vs-detected workload gating
Signed HMAC audit envelopes
Policy + placement enforcement

Sovereign / Defence

Sovereign

Data sovereignty + provable per-job audit. Air-gapped. India-built.

$30K–60K/site/yr

+ $2K–3K/node/yr · Proof of Audit $10K (creditable)

USB / removable-media forensics
Retained 12-month queryable audit trail
Forensic reconstruction per job/user
Air-gap deployment · geographic attestation
Founding partner pilot: free (5 slots, Dec 2026)

64+ GPUs −10% · 256+ GPUs −20% · 1,000+ custom · Volume tiers on aggregate across clusters

Who We Serve

Built for people who run real clusters.
Not cloud-credit holders.

If you own the hardware, share GPUs across teams, and have ever wondered where the hours actually went — this is for you.

☁️

GPU Clouds & Neoclouds

Per-tenant attribution is directly monetizable. Know which tenant holds which GPU, whether it's idle on I/O, and bill accurately.

→ Reclaim idle · bill precisely · enforce SLAs

🧬

Drug Discovery & Life Sciences

AlphaFold and molecular dynamics pipelines are I/O-starved by design. We prove it. 21 CFR Part 11 GxP audit trails built in.

→ GxP · HIPAA · idle ≠ free

🏛

National Defence & Govt

Sovereignty is a mandatory requirement, not a preference. Air-gap deployment, signed bundles, no foreign-jurisdiction cloud dependency.

→ DRDO · C-DAC · CERT-In · DPDPA

🔬

Research & HPC Centres

Shared infrastructure, per-PI accountability, utilisation reports in 10 minutes instead of 3 weeks. Grant reporting that doesn't require a spreadsheet.

→ IIT · IISc · C-DAC · NSM clusters

Not for you if —

You call a hosted LLM API · you run a single-user box · you use managed serverless GPUs · your infrastructure is 100% hyperscaler-managed. No shame — just not the problem we solve.

About

Built in India.
Sovereign by design.

Megh Communications is an India-based infrastructure software company building the governance and observability layer for on-premise AI compute. We exist because the organisations that need this most — national labs, defence programs, research institutions, regulated enterprises — cannot use foreign-jurisdiction SaaS tools, and no one was building for them.

We are currently in beta with a small number of design partners. We move carefully, we say what we mean, and we build in public where we can.

🇮🇳 Bangalore, India

Beta — pilot stage

On-premise · Air-gap

Independent · non-aligned origin

Telemetry layers read — Job to Hardware

Pre-built detection patterns — ships ready

Compliance frameworks — ISO 42001 to DPDPA

179

Unique compliance fixes across 39 frameworks

Get in touch

info@meghcommunications.com

For pilots, partnerships, and early access conversations.

Run the 2-minute test.
Then decide if you need us.

nvidia-smi dmon -s u -d 1

Watch sm% and mem% simultaneously. If high sm% isn't producing the throughput you expect, your dashboards have been hiding where the money goes.

No demo-bot. No funnel. Reply to a real email — or send a DCGM/Prometheus export. Nothing leaves your box beyond the metrics you choose.

Your GPU dashboard is lying to you.

One control plane.Every layer of your cluster.

Every module. One screen.

Everything your cluster needs.Nothing it doesn't.

Three ways dashboards lie. Every day.

Megh reads every layer.It's the only one reading the middle.

What the kernel layerChanges in Practice

Governance that'smeasured, not claimed.

~1–3% of the GPU spend it sits on.

Built for people who run real clusters.Not cloud-credit holders.

Built in India.Sovereign by design.

Run the 2-minute test.Then decide if you need us.

Your GPU dashboard
is lying to you.

One control plane.
Every layer of your cluster.

Everything your cluster needs.
Nothing it doesn't.

Megh reads every layer.
It's the only one reading the middle.

What the kernel layer
Changes in Practice

Governance that's
measured, not claimed.

Built for people who run real clusters.
Not cloud-credit holders.

Built in India.
Sovereign by design.

Run the 2-minute test.
Then decide if you need us.