Build the Substrate First: Why Platform Quality Is Capped by the Model Underneath It

2026·6 min read

platformreliabilityml

Most platforms that reason about a system are really stacks of queries against a model of that system. The recurring lesson is that the model is the constraint: build it as an authoritative product first, keep every layer above it a traceable query, and the quality of everything above is capped by the coverage and freshness of what is beneath.

flowchart TD
  S["telemetry sources
traces, logs, config, data stores"] -->|"fuse, resolve identity and time"| G["dependency graph
source of truth, streaming, query API"]
  G --> E["explanations
deterministic + LLM agent"]
  G --> N["notifications + DPP
route risk to owners"]
  G --> C["hard / soft classifier
is the failure load-bearing?"]
  C -->|"disable one dependency"| F["fault injection
disable one dep, watch SLOs"]
  F -->|"SLO-verified labels"| C
  classDef key fill:#e8f1fb,stroke:#1e1e1e,color:#1e1e1e
  class G key
  linkStyle 5 stroke:#2383E2,stroke-width:2px,color:#2383E2

The graph is the source of truth every layer reads. Explanations make the alerts actionable and the classifier makes them worth subscribing to; the fault-injection loop (blue) feeds SLO-verified labels back, so the classifier stays accurate as the fleet changes underneath it.

A platform is a stack of queries against a model of the system

Any tool that reasons about a running system - an alerting layer, a dependency analyzer, an impact estimator, an explanation engine - is ultimately answering questions about a model of that system. What depends on what, who owns this, what breaks if that fails. The features users see are the queries; the model is what the queries run against. This is easy to miss because the model is rarely the thing anyone is excited to build. The exciting work is the feature on top, so the model gets assembled opportunistically from whatever signals were handy when each feature was added.

That is the mistake worth naming. When you treat the underlying model as a side effect of the features rather than as a product in its own right, the quality of every feature is silently capped by a model nobody is accountable for. A useful discipline is to invert the order: decide what authoritative model the features will query, build that model first with an owner and a stated quality bar, and only then build features as queries against it. The rest of this note is about why that order matters and what goes wrong when you skip it.

Most organizations have a dozen partial models, not one

The reason the model is hard is that the truth about a system is scattered. Different sources each hold a true but partial view. One source knows what calls what when the code runs. Another knows what was declared to depend on what at build time. A third knows ownership and topology from a registry. A fourth knows what is actually talking to what right now from live telemetry. Each is correct about its own slice and blind outside it. A call site can exist in code but never fire in production; a live edge can have no declared counterpart anywhere.

Build your platform on any single one of these and you inherit its specific blind spots wholesale, usually without realizing which ones. The harder and more valuable path is fusion: reconcile the sources into one authoritative model, and when two of them disagree, resolve the conflict explicitly rather than letting whichever source you happened to query win by accident. Carry provenance so any element of the model can be traced back to the signals that produced it. This is general data-platform work - the same shape applies to a configuration management database, a service inventory, an asset graph, or any canonical representation fused from overlapping feeds. The point is not the specific feed set; it is that one reconciled model with provenance beats a dozen partial ones queried ad hoc, and that reconciliation is the work.

Keep every layer above it a traceable query, not a fresh inference

Once the model exists, the discipline that keeps the platform honest is that each layer above expresses itself as a query against the model rather than re-deriving the world from raw signals. An alerting layer should resolve the affected owner from the model's ownership data and name the specific element at risk, not broadcast a generic degradation to everyone. An impact estimate should walk the model's edges rather than guess. The value of this is precision: a layer that routes a precise, owner-scoped, actionable signal earns attention, while one that fires broadly trains its audience to ignore it, and an ignored signal is worse than none because it costs attention and builds nothing.

The same traceability discipline carries over when a language model enters the picture as an explanation or summarization layer. The durable pattern is to keep the model on a short leash. Put deterministic tools underneath it - queries that walk the model, fetches that pull the actual source, attribution that ties a fact to the change that introduced it - and have those tools return reproducible facts. The language model sits on top as an agent harness that orchestrates explicit tool calls: it decides which tool to call, in what order, reads what comes back, and assembles a human-readable account. The retrieval here is auditable tool calls against a known model and the code, not similarity search over an embedded corpus, which is why every factual claim in the output can be traced back to a tool result rather than to the model's recollection. Lowering the model's sampling temperature makes these runs reproducible and reduces hallucinated paths; it is a reproducibility lever, not a correctness guarantee, so the deterministic tools stay the source of truth and the model stays the narrator.

If the model cannot verify itself, grade it against observed truth

A model of a system invites judgments that are cheap to compute and hard to confirm. Consider classifying, across a large fleet, whether each dependency is hard - its caller goes down when it fails - or soft, degrading through a cache, a fallback, or a feature that quietly turns off. A language model can read a call site and reason fluently to a label, and a prediction like that is cheap relative to running a physical experiment per element, so it can cover the entire fleet from source. But it cannot verify itself: a model can reason its way to a confident answer and still be wrong, because real behavior lives in retry policy, timeouts, and fallbacks that source does not always make legible.

Where a judgment has an observable ground truth, the answer is to measure it rather than trust the inference. That is a topic in its own right - I have written separately about pairing a cheap predictor with an expensive verifier that observes outcomes by breaking one thing at a time and watching the SLOs, and treating their disagreements as the highest-value signal. The point here is only its place in the picture: the verifier grades the predictor against observed reality, and the observed labels flow back to correct the predictor's reference set and thresholds through eval and tuning, with no model weights trained. The reason this is even possible is the substrate again - you can only run a clean per-element experiment, or attribute a result back to a specific part of the system, if you have an authoritative model that says what that part is and what it touches.

Build in dependency order, and keep the substrate fresh

The throughline is an ordering principle, not a fixed architecture. Whatever a given platform's features are, they sit on a model of the system, and that model caps their quality. So build the model first as an owned product with a stated coverage and freshness bar, then build each feature as a traceable query against it, and where a feature makes an unverifiable judgment, grade it against observed truth and feed the corrections back. This is the opposite of the common pattern where detection, dashboards, runbooks, and analyzers are bolted together from whatever signals were on hand and no layer can fully trust the data beneath it.

The part that is easy to underweight is freshness. Systems change continuously - dependencies get added, fallbacks rot, ownership moves, a retry policy shifts - so a model built once and trusted forever decays silently as the system moves underneath it, and every feature querying it decays with it. Treating the model as a product with an SLA, rather than a side effect of a dashboard, is what keeps the decay in check. The same logic applies to any judgment layer graded against reality: corrected on a cadence, it stays good; graded once, it goes stale. For anyone building one of these, the transferable rule is the order itself - substrate first, features as queries, judgments verified against observed truth - because the substrate's coverage and freshness set the ceiling for everything you put above it.

Takeaways

A platform that reasons about a system is really a stack of queries against a model of that system. The model is the constraint, so build it first as an owned product with a stated coverage and freshness bar - not as a side effect of the features that query it.
The truth about a system is scattered across sources that are each true but partial - runtime, build-time, ownership, live telemetry. Fuse them into one authoritative model with provenance and resolve disagreements explicitly, rather than building on a single source and inheriting its blind spots.
Keep each layer above the model a traceable query, not a fresh inference. Route precise, owner-scoped, actionable signals instead of broadcasts, and when a language model narrates, run it as an agent harness over deterministic tools - auditable tool calls, not similarity search over embeddings - so every factual claim traces back to a tool result.
Lowering a model's sampling temperature buys reproducibility and reduces hallucinated paths; it is not a correctness guarantee. The deterministic tools remain the source of truth and the model remains the narrator.
Where a judgment has an observable ground truth, grade it against measured reality and feed the corrections back through eval and tuning, with no weights trained - and remember the substrate is what makes both the experiment and the attribution possible. Build in dependency order, keep the substrate fresh, and its coverage caps the ceiling for everything above it.