Kernel-level · On-premise · Zero cloud calls

Your GPU dashboard
is lying to you.

95% utilisation doesn't mean 95% useful work. Megh reads every layer of the compute stack — from silicon to scheduler — and tells you exactly which job is on which GPU, whether it's doing real work, and what it costs. All on-premise. No SaaS. No cloud dependency.

Megh is a GPU intelligence layer that sits on your existing cluster — reads every signal from kernel to BMC — and tells you, in plain English, why jobs run slow, which GPU-hours are wasted, and what that costs. On-premise. No cloud. No rip-and-replace.
India-built Air-gap ready NVIDIA + AMD
The 9-layer compute stack
Job / Workload
slurm sacct
Observe
🔧
Framework
torch profiler
Observe
Runtime / Serving
vLLM /metrics
Observe
📦
Container
cgroup v2
Observe
🗂
Scheduler
squeue
Observe
Kernel (Linux)
eBPF agent
Megh's layer
🖥
OS Userspace
DCGM · nvidia-smi
Observe
🔌
Firmware / BMC
Redfish · IPMI
Observe
Hardware
DCGM · IB · PDU
Observe
Products

One control plane.
Every layer of your cluster.

Megh deploys inside your environment — on-premise or air-gapped — and connects to the tools you already run. No rip-and-replace. No data leaving your network.

Commercial · Edition 1
Observe
Visibility + efficiency
SaaS — agentless
Reads your existing Prometheus / DCGM. No agent to install. Connects in minutes.
$30 / GPU / mo
On-premise — full agent
All 9 telemetry bands including kernel eBPF. Zero data egress. Air-gap compatible.
$35 / GPU / mo
What's included
  • Every telemetry band fused — fd→GPU attribution, per-tenant usage
  • MFU vs util · io-wait (idle vs storage-starved)
  • DCGM / vLLM / scheduler correlation
  • Dashboards, alerts, and pattern detection library
  • GPU routing intelligence — NVIDIA, AMD, Intel
Sovereign Edition
Sovereign Platform
For national defence, government & regulated buyers
Where data sovereignty and provable per-job audit are mandatory — deployable in-country or air-gapped, from an independent vendor of allied / non-aligned origin, outside sanctioned or adversarial jurisdictions.
Proof of Audit — land
Scoped agent on 1–2 nodes, capped at 8 weeks → per-job audit trail + data-movement report + compliance-gap assessment. 100% creditable to year 1.
$10,000 creditable
Sovereign Platform — annual
Behavioral audit + data-movement / USB / egress tracing + forensic query + retained audit trail + all compliance frameworks + air-gap + agent. All bundled.
$30K–60K / site / yr + $2K–3K / node / yr
What a dashboard cannot give —
  • Per job & per user: which datasets were read, what left the node, where it went
  • USB / removable-media events — no metric stack records this
  • Jobs behaving unlike their declared workload — forbidden paths, novel egress
  • Retained, queryable 12-month audit trail + forensic reconstruction
  • Every signal attributed at the kernel — deployed in your jurisdiction
⚡ Founding partner pilot: free for up to 5 organisations — conversion terms signed at pilot start — offer expires 31 Dec 2026.
Connects to your existing stack —
Slurm
Kubernetes
Prometheus
DCGM
vLLM
Triton
Redfish / IPMI
Lustre / NFS
InfiniBand
Terraform
Ansible
Pulumi
NVIDIA DCGM
AMD MIGraphX
OpenVINO
Warewulf / xCAT
What's inside

Every module. One screen.

From cluster health to compliance score to inference runtime — all visible, all governed, all on your hardware.

Workloads
Dashboard
Cluster health
GPU utilisation, active nodes, PCIe throughput, ECC errors, active patterns, node status, and recommended actions — in one executive + ops view. Answers three questions: where am I wasting GPU time, am I compliant, are my models healthy.
Patterns
17 detectors
17 pre-built detection patterns across scheduling, memory, thermal, hardware, storage, network, jobs, and accounting. Ships ready — no configuration. Each pattern is status-tracked: ACTIVE / WARNING / CLEAN / PENDING.
Jobs
Kernel attribution
Role-scoped job view with Kernel Attribution — which job is driving or starving each GPU at the kernel level. Includes Pramaan Trust Score per job: flags suspicious workloads before they reach the cluster.
Job Analytics
GPU heatmap · MFU
275 jobs tracked, 56.1% average GPU utilisation. GPU utilisation heatmap by partition and hour. Partition efficiency summary with MFU% — the real model-FLOP utilisation number your CFO wants.
Scheduler
Herd-intelligent
Herd-Intelligent Job Placement — nodes scored via live ECC history, PCIe health, and NCCL herd patterns before every job submission. Prevents placement on degraded hardware.
GPU Routing
NVIDIA · AMD · Intel
Automatic workload-to-GPU routing by job class. LLM inference → HBM2e. GROMACS → AMD MI210 (74% faster). CUDA training → RTX Ada. ECMP elephant flow analysis and per-port queue depth for fabric health.
Intelligence
Inference Runtime
vLLM · Triton · on-prem LLM
Which LLM is powering the explanation engine, on which GPU, at what speed — TTFT, ITL, KV-cache%, queue depth. Supports Mistral, Llama, Gemma. Sovereign Stack: OpenVINO, AMD MIGraphX, Intel Habana. Zero cloud calls.
Consciousness
RAG · IaC · audit-grade
Governance findings aligned to ISO 42001 and NIST AI RMF. Multilingual Knowledge Base (EN/HI/GU/TA/TE) with per-PI audit-signed RAG. IaC Connector showing live Terraform, Ansible, and Pulumi drift.
Digital Twin
What-if · fault injection
Simulate cluster changes before committing them. Adjust GPU load, nodes, queue size, and memory pressure. 85% confidence scoring. Fault injection: node failure or degraded NIC. Recommended actions generated automatically.
Infrastructure
Nodes
Hardware registry · CMDB
Complete per-node hardware visibility: CPU, RAM, GPU (utilisation, VRAM, temp, power, PCIe TX, ECC, CUDA driver), storage, network, BMC. Living hardware registry — the CMDB your cluster never had.
Topology
1,000-node canvas · RDMA
1,000-node canvas map with rack-level drill-down. RDMA latency matrix: microsecond pairwise GPUDirect measurements. Five view modes: Cluster, Rack, Node, GPU Util, Network.
Vulnerabilities
Air-gap CVE · signed bundles
Air-gap-safe CVE scanning via signed .vulnbundle — the only CVE approach that works in DRDO, NIC, and classified environments. CVE × Pattern attribution. PDF export for compliance evidence.
Governance
Compliance Intelligence
Pramaan Score
Pramaan Score maps live infrastructure posture across 10 frameworks — ISO 42001, NIST AI RMF, EU AI Act, DPDPA, CERT-In, SOC 2, GxP, MeitY, CAG, STQC. 179 prioritised fixes. Gap engine with "Show me how" remediation.
Finance Intelligence
₹/$ · kWh · kg CO₂
Your cluster cost ₹45,000 this month. 120 jobs delivered. 1% of compute was productively used. Energy and carbon per job. PI-variant reports with student anonymisation. Pramaan evidence PDF for grant reporting.
Admin & RBAC
13 roles · LDAP · PAM
13 user roles — admin, auditor, finance, pi-lead, cluster-engineer, researcher, supervisor, and more. LDAP-backed. All role changes written to the tamper-evident audit log. Enforced, not optional.
9
Telemetry layers
Job → Framework → Runtime → Container → Scheduler → OS → Kernel → Firmware → Hardware
17
Detection patterns
Ships ready — scheduling, memory, thermal, hardware, storage, network, jobs, accounting
38.8
Pramaan Score — live
SOC 2 Type II · 7 Jun 2026 · on-premise · zero cloud calls
13
User roles
admin · auditor · finance · pi-lead · researcher · supervisor · and more
Capabilities

Everything your cluster needs.
Nothing it doesn't.

Ten capabilities. One agent. Deployed in your environment — not ours.

See the Truth
Kernel-level attribution ties every GPU cycle back to the exact job and tenant. What DCGM, Slurm, and Prometheus structurally cannot show — Megh shows.
fd→GPU · per-tenant · io-wait
Know the Cost
GPU-hours wasted, energy consumed, carbon emitted, and rupee cost — attributed per job, per user, per team. Your CFO finally gets a straight answer.
₹/$ per job · kWh · kg CO₂
Stay Compliant
Pramaan Score maps your live infrastructure posture across ISO 42001, NIST AI RMF, EU AI Act, DPDPA, CERT-In, SOC 2, GxP and more — with 179 prioritised fixes.
10 frameworks · 179 fixes · live evidence
Stay Sovereign
Fully on-premise. Air-gap deployable. Signed offline CVE bundles. No foreign-jurisdiction cloud dependency. India-built, independent vendor of allied origin.
Air-gap · CERT-In · DPDPA · Make-in-India
Detect Failures Early
17 pre-built detection patterns across scheduling, memory, thermal, hardware, storage, network, and accounting. Ships ready — no configuration required.
17 patterns · 8 categories · auto-detect
Govern Workloads
Per-job Pramaan Trust Score flags suspicious workloads at admission — before they reach the cluster. Declarations drive policy; the kernel agent drives proof.
Trust scoring · workload gating · HMAC audit
Pattern Detection Library
From GPU Idle Despite Queue to InfiniBand Link Degradation — each pattern is independently tracked, status-flagged (ACTIVE / WARNING / CLEAN), and linked to causal explanation.
P1–P17 · scheduling to network
GPU Routing Intelligence
Automatically routes workloads to the right GPU by job class — LLM inference to HBM2e, CUDA training to CUDA-optimised silicon. NVIDIA, AMD, and Intel in one plane.
Workload routing · NVIDIA + AMD + Intel
Compliance Intelligence
Not a checklist — a live gap engine. Each gap shows which frameworks it closes, which controls it satisfies, the exact remediation step, and API endpoints to verify after the fix.
Gap engine · cross-framework · Show me how
Built for Your Requirements
Every cluster is different. If your environment needs a capability Megh doesn't ship today — a custom integration, a specific compliance framework, a hardware connector, or a bespoke reporting module — we build it with you.
Custom integrations · bespoke modules · your stack · your rules
Custom scheduler connector
Proprietary hardware telemetry
Bespoke compliance framework
Custom reporting & dashboards
Internal audit format export
Your use case →
Tell us what you need →
What your current stack misses

Three ways dashboards lie. Every day.

Your GPU monitoring stack gives you device-level aggregates. It cannot tie kernel-level evidence back to the job or the tenant. That gap is the whole problem.

util% = useful work
→ MFU tells the truth
A GPU pinned at 95% can be spinning, waiting on a NCCL collective, or KV-cache thrashing. DCGM sees the chip is busy. It does not see whether the work is real. MFU — model FLOP utilisation — does.
idle% = available
→ io-wait tells the truth
A GPU at 20% during AlphaFold's MSA phase isn't free — it's blocked on Lustre reads. Reclaiming it would kill the job. Only kernel io-wait tells idle apart from storage-starved.
shared = attributed
→ kernel fd-table tells the truth
On a shared, MIG, or opaque GPU, who is actually holding the allocation? Only the kernel's file-descriptor table names the tenant. No downstream Prometheus join can reconstruct it once the aggregate is emitted.
Architecture

Megh reads every layer.
It's the only one reading the middle.

Most stacks bolt one tool per band and call it coverage. The runtime and device ends are well-observed. The kernel — where causation actually lives — is a blind spot.

Layer
What it tells you
Standard tool
Coverage
Job / Workload
State, runtime, GPU alloc, exit code
slurm sacct · k8s
Partial
Framework
Loss curves, tokens/s, samples/s
torch profiler
Partial
Runtime / Serving
TTFT · ITL · KV-cache% · MFU
vLLM /metrics
Partial
Container
Image provenance, cgroup pressure
cgroup v2
Partial
Scheduler
Queue wait, placement, GRES allocation
squeue · kube-sched
Partial
Kernel (Linux)
io-wait · fd→GPU map · per-tenant attribution · declared-vs-actual behaviour
eBPF — Megh's layer
Megh only
OS Userspace
Device counters, NCCL events, dmesg
DCGM · nvidia-smi
Partial
Firmware / BMC
Power, thermal, fan, PSU, ECC
Redfish · IPMI
Partial
Hardware
GPU util%, temp, NVLink, IB counters
DCGM · IB · PDU
Partial
Use Cases

What the kernel layer
Changes in Practice

Three real workloads where the standard stack gives you a number. Megh gives you the reason.

USE CASE 01
vLLM inference — TTFT spiking
DCGM: 90% util. Problem unseen.
KV-cache is full. vLLM is preempting and swapping. The GPU looks busy because it is — doing the wrong work. Kernel attribution names the tenant causing the evictions. Add capacity or evict the right job. Don't trust the util gauge.
USE CASE 02
AlphaFold — the GPU that looks idle
DCGM: 20% util. Operator reclaims it.
MSA search is running: massive sequential reads from BFD and UniRef on Lustre. The GPU is waiting on storage. Kernel io-wait proves it's blocked, not free. The fix is the data path — not handing the GPU to another job.
USE CASE 03
Isaac Sim — right GPU, wrong silicon
Scheduler: GPU assigned. Job runs.
The sim runs on any GPU but quietly collapses on cards without RT cores. An A100 has none. Megh gates placement on RT-core capability and surfaces sim-on-sim contention via per-tenant context-switch monitoring.
Pramaan Score

Governance that's
measured, not claimed.

One number. Ten frameworks. Live infrastructure evidence — not self-assessment checkboxes. Every gap ranked by cross-framework impact with a clear remediation path.

NC
Governance policy document missing. Closes 10 controls across CERT-In CI-5.1, DPDPA 1.2/1.5, EU AI Act AIA-2.1, ISO 42001, NIST AI RMF.
OFI
No CVE scan recorded. Ingest a signed .vulnbundle and run scan. Closes CERT-In CI-2.1, NIST MAN-3.1.
C
System operations monitoring — Conformant. Hash-chain audit log + Prometheus telemetry + vulnerability intake all verified.
38.8
/100
Pramaan Score — Live · SOC 2 Type II · 7 Jun 2026
Score reflects live infrastructure evidence collected from cluster nodes at report generation. Not a self-assessment — not fabricated. A qualified auditor must review before reliance.
FRAMEWORKS COVERED
ISO 42001
NIST AI RMF
SOC 2 TSC
EU AI Act
DPDPA 2023
CERT-In 2025
GxP / 21 CFR 11
MeitY 2025
CAG IS Audit
STQC AI QA
179 unique fixes · 39 frameworks · ranked by impact
Pricing — V2 · June 2026

~1–3% of the GPU spend it sits on.

One reclaimed H100 pays for roughly 50 GPUs monitored. Per-GPU subscription, no per-node minimums on the metric that matters.

Commercial · SaaS
Observe
Agentless. Reads your existing Prometheus and DCGM.
$30/GPU/mo
$60 node minimum · metered on peak concurrent GPUs
  • All 9 telemetry bands fused
  • fd→GPU kernel attribution
  • MFU vs util · io-wait starvation
  • Per-tenant usage · dashboards & alerts
  • DCGM / vLLM / scheduler correlation
Sovereign / Defence
Sovereign
Data sovereignty + provable per-job audit. Air-gapped. India-built.
$30K–60K/site/yr
+ $2K–3K/node/yr · Proof of Audit $10K (creditable)
  • USB / removable-media forensics
  • Retained 12-month queryable audit trail
  • Forensic reconstruction per job/user
  • Air-gap deployment · geographic attestation
  • Founding partner pilot: free (5 slots, Dec 2026)
64+ GPUs −10%  ·  256+ GPUs −20%  ·  1,000+ custom  ·  Volume tiers on aggregate across clusters
Who We Serve

Built for people who run real clusters.
Not cloud-credit holders.

If you own the hardware, share GPUs across teams, and have ever wondered where the hours actually went — this is for you.

☁️
GPU Clouds & Neoclouds
Per-tenant attribution is directly monetizable. Know which tenant holds which GPU, whether it's idle on I/O, and bill accurately.
→ Reclaim idle · bill precisely · enforce SLAs
🧬
Drug Discovery & Life Sciences
AlphaFold and molecular dynamics pipelines are I/O-starved by design. We prove it. 21 CFR Part 11 GxP audit trails built in.
→ GxP · HIPAA · idle ≠ free
🏛
National Defence & Govt
Sovereignty is a mandatory requirement, not a preference. Air-gap deployment, signed bundles, no foreign-jurisdiction cloud dependency.
→ DRDO · C-DAC · CERT-In · DPDPA
🔬
Research & HPC Centres
Shared infrastructure, per-PI accountability, utilisation reports in 10 minutes instead of 3 weeks. Grant reporting that doesn't require a spreadsheet.
→ IIT · IISc · C-DAC · NSM clusters
Not for you if —
You call a hosted LLM API · you run a single-user box · you use managed serverless GPUs · your infrastructure is 100% hyperscaler-managed. No shame — just not the problem we solve.
About

Built in India.
Sovereign by design.

Megh Communications is an India-based infrastructure software company building the governance and observability layer for on-premise AI compute. We exist because the organisations that need this most — national labs, defence programs, research institutions, regulated enterprises — cannot use foreign-jurisdiction SaaS tools, and no one was building for them.

We are currently in beta with a small number of design partners. We move carefully, we say what we mean, and we build in public where we can.

🇮🇳 Bangalore, India
Beta — pilot stage
On-premise · Air-gap
Independent · non-aligned origin
9
Telemetry layers read — Job to Hardware
17
Pre-built detection patterns — ships ready
10
Compliance frameworks — ISO 42001 to DPDPA
179
Unique compliance fixes across 39 frameworks
Get in touch
info@meghcommunications.com
For pilots, partnerships, and early access conversations.

Run the 2-minute test.
Then decide if you need us.

nvidia-smi dmon -s u -d 1

Watch sm% and mem% simultaneously. If high sm% isn't producing the throughput you expect, your dashboards have been hiding where the money goes.

No demo-bot. No funnel. Reply to a real email — or send a DCGM/Prometheus export. Nothing leaves your box beyond the metrics you choose.