The Observability Stack Grows, but Observation Isn't Prevention

MLflow 3 launched at the Databricks Data + AI Summit this month, and it's a significant release. AI agents are now first-class citizens. OpenTelemetry-powered tracing gives you comprehensive visibility into what your AI systems are doing. LoggedModel creates a proper entity for tracking model lifecycle. LLM judges let you evaluate AI output with AI. Git-like prompt versioning brings version control discipline to prompt engineering.

It's a well-engineered product. And it represents a category of solution that's growing fast — Databricks also announced Lakebase, Agent Bricks for multi-agent systems, and serverless GPU compute at the same summit. The observability and tooling ecosystem around AI is maturing rapidly.

But here's the question that MLflow 3's feature list prompts: if you can see everything your AI is doing, does that mean you can prevent it from doing the wrong thing?

The answer is no. And the distinction between observation and prevention is one of the most consequential gaps in AI infrastructure today.

What MLflow 3 Gets Right

Credit where it's due. MLflow has been a pillar of the ML ecosystem since its open-source launch, and version 3 is a thoughtful evolution.

The OpenTelemetry integration matters because it connects AI observability with the broader observability ecosystem. Teams already using OpenTelemetry for their application stack can now instrument their AI components with the same tools, dashboards, and alerting systems. That's a meaningful reduction in operational complexity.

Making agents first-class citizens reflects the industry's shift from single-model inference to multi-step, tool-using AI workflows. When your AI system involves multiple models, external tool calls, and multi-step reasoning chains, the tracing infrastructure needs to understand that structure. MLflow 3 does.

LLM judges — using AI to evaluate AI output — is an interesting capability that addresses the scale problem of AI quality assurance. When your AI generates thousands of outputs per hour, human review of every output isn't feasible. Using a separate AI model to evaluate quality is a pragmatic solution.

And Git-like prompt versioning brings version control discipline to what has historically been a chaotic process. Knowing which version of a prompt produced which results, and being able to roll back when a new version degrades performance, is basic engineering hygiene that the AI space has largely lacked.

All of this is good engineering. MLflow 3 makes it easier to see what your AI is doing, track how it's changing, and evaluate whether its output is improving or degrading.

What it doesn't do is stop your AI from doing the wrong thing.

The Observation-Prevention Gap

There's an analogy that clarifies this distinction. Observability in AI is like a security camera system. It records everything. It gives you dashboards showing what's happening. It can alert you when something unusual occurs. And it's absolutely essential infrastructure.

But a security camera doesn't lock the door. It doesn't stop an intruder. It documents what happened after the fact, and — if you're watching in real time — it lets you respond. The prevention mechanism is different: it's the lock, the access control system, the physical barrier.

In AI systems, the observation-prevention gap manifests in several specific ways.

The first is the timing problem. Observability tells you what happened. Prevention stops what shouldn't happen. MLflow 3's tracing can show you that an AI agent took an unexpected path through a multi-step workflow. But by the time you see that trace, the agent has already completed the workflow. If the unexpected path resulted in a bad output — a code change that introduces a subtle bug, a response that contradicts policy, a decision that violates a constraint — the damage is done. You know about it, which is better than not knowing. But knowing isn't the same as preventing.

The second is the evaluation problem. LLM judges are a practical solution to the scale of AI evaluation, but they introduce a philosophical challenge: you're using a system with known reliability limitations to evaluate the reliability of another system with the same limitations. If your AI generates a code change and an LLM judge evaluates it as correct, what's your confidence level? The judge might catch obvious errors, but subtle issues — off-by-one errors, race conditions, spec violations that require understanding the broader system architecture — are exactly the kind of thing LLMs struggle with.

The third is the drift problem. Git-like prompt versioning helps you track changes and roll back when things go wrong. But it assumes you notice when things go wrong. If a prompt change causes a 2% increase in subtle errors — not dramatic failures, just a slight degradation in output quality — you might not catch it in time. Observability tools can show you the metrics, but someone or something needs to define the threshold at which a drift becomes a problem. That "something" is a specification — a machine-readable definition of what correct looks like — and MLflow 3 doesn't provide that layer.

Where Guardrails AI Fits — and Where It Doesn't

It's worth noting that Guardrails AI — the startup, not the generic concept — hit $1.1 million in revenue this month with just a 10-person team. That's meaningful market validation for the idea that AI systems need more than observability. The market is clearly willing to pay for guardrails.

But the guardrails conversation in the industry is still largely focused on input/output filtering — checking whether the AI's input is appropriate and whether its output meets certain criteria. This is important, especially for content safety and policy compliance. But it's only one dimension of what production AI systems need.

Input/output guardrails catch what goes in and what comes out. They don't address what happens in between — the multi-step reasoning, the tool calls, the intermediate decisions that an AI agent makes during a complex task. MLflow 3's tracing can show you what happened in between. But neither MLflow nor most guardrail implementations can enforce constraints on the intermediate steps in real time.

What Production AI Actually Needs: The Three-Layer Stack

After building systems that manage over 1,000 AI session handoffs at 92% automation, here's the model I think the industry needs to converge on. Production AI requires three distinct layers, not one.

The first layer is observability — seeing what happened. This is where MLflow 3, Datadog's AI monitoring, and similar tools live. They provide the telemetry, tracing, and dashboards that tell you what your AI is doing. Essential, but not sufficient.

The second layer is guardrails — enforcing constraints on what can happen. This is where input/output filtering, content safety checks, and basic policy enforcement live. Companies like Guardrails AI operate here. Also essential, but focused on the boundaries of AI interactions rather than the internal quality.

The third layer is verification — proving that what happened was correct. This is where specification-driven checks, automated audits, and evidence-based compliance live. It's the least mature layer in the current ecosystem, and it's where the most consequential failures occur. A model that passes input/output guardrails and produces observable telemetry can still generate subtly wrong code, make decisions that violate business logic, or accumulate technical debt that doesn't surface until months later.

Most organizations have invested in the first layer. Some have invested in the second. Very few have invested in the third. The result is that the average AI deployment can tell you what happened and prevent the most egregious failures, but it can't prove that the output is correct.

The Databricks Ecosystem Question

Databricks' announcements alongside MLflow 3 — Lakebase, Agent Bricks, serverless GPU compute — describe a comprehensive platform play. The strategic logic is clear: if Databricks can own the data layer, the compute layer, the orchestration layer, and the observability layer, they become the default platform for enterprise AI.

This creates both opportunity and risk for teams building on the Databricks ecosystem. The opportunity is obvious: a well-integrated stack reduces operational complexity. The risk is platform lock-in — and in the AI governance context, platform-specific observability creates platform-specific compliance evidence.

When your audit trail lives entirely within one vendor's ecosystem, your ability to demonstrate compliance is dependent on that vendor's data export capabilities, retention policies, and continued operation. Provider-independent governance infrastructure — systems that capture, verify, and document AI behavior regardless of which platform or model is being used — becomes more important as platform ecosystems consolidate.

The Emerging Maturity Model

Here's how I'd assess the current state of AI infrastructure maturity by layer:

Compute and training infrastructure: Mature. AWS, Azure, GCP, and specialized providers offer robust solutions.

Observability and monitoring: Rapidly maturing. MLflow 3, Datadog, Arize, and others provide comprehensive visibility.

Guardrails and safety: Early but growing. Guardrails AI, NVIDIA NeMo Guardrails, and platform-native solutions address input/output filtering.

Verification and compliance: Nascent. Most organizations are still relying on manual review processes designed for human-written code and human-made decisions.

The maturity gap between the top and bottom of this stack is where the real risk lies. Organizations are deploying AI systems with enterprise-grade compute, good observability, and basic guardrails — but verification and compliance infrastructure that hasn't changed since before AI was part of the picture.

MLflow 3 moves the observability layer forward. What needs to move forward next is the verification layer — the infrastructure that doesn't just show you what happened, but proves that what happened was correct.

Looking Ahead

The observability-versus-prevention distinction isn't unique to AI. It's the same distinction that drove the evolution from application logging to application firewalls, from network monitoring to intrusion prevention systems, from financial reporting to financial controls.

In each case, the industry went through a phase where visibility was considered sufficient — "if we can see it, we can manage it." In each case, that assumption eventually proved inadequate, and enforcement infrastructure was built alongside (not instead of) observability infrastructure.

AI is early in this evolution. MLflow 3 is good observability. The industry now needs equally good enforcement. Not instead of observation, but alongside it. Because seeing the problem isn't the same as stopping it.