Blog

Writing on AI observability, telemetry, and what real systems do in production.

Practical essays on monitoring, drift, trust, routing, cost, and the signals that matter once AI systems move past the demo.

Mar 10, 2026 AI Observability

The Most Dangerous AI Systems Are the Ones That Look Healthy

"Healthy" is a claim, not a fact. The most dangerous AI systems in production are the ones whose definition of healthy was inherited, never challenged, and quietly never updated.

Read article

Feb 18, 2026 Monitoring

Your AI Observability Problem Is Probably Not Technical

Most teams shipping generative systems do not lack tooling. They lack an on-call rotation, an ownership matrix, and a runbook for behavioural anomalies — the operational scaffolding every other production system gets by default.

Read article

Jan 12, 2026 Prompts

A Prompt Change Is a Production Change

Prompt edits alter system behaviour, often in ways no one watching infrastructure can see. They deserve the versioning, telemetry expectations, rollout criteria, and rollback paths every other production change gets.

Read article

Dec 8, 2025 Dashboards

Your AI Dashboard Is Full of Data and Empty of Meaning

Why token counts, latency charts, and model usage graphs often say very little unless they are connected to user outcomes and product semantics.

Read article

Nov 4, 2025 Incidents

Silent Failure Is the Real Production Problem in AI

Why the hardest AI incidents are rarely outages. They are the systems that keep responding, keep looking healthy, and quietly stop being useful.

Read article