SLO-Driven Risk: Turning "Is This Scary?" Into a Number

2026·5 min read

slo/slireliabilityrisk

Reliability effort gets spent on whatever feels scary in the room. Here is how to replace that gut feel with a defensible number built from SLOs, error budgets, and a clean hard-versus-soft dependency distinction - so you can prove where the risk is and justify what you do about it.

flowchart TD
  D["dependency fails"] -->|"no fallback"| H["HARD
breaches error budget"]
  D -->|"fallback"| S["SOFT
degrades gracefully"]
  classDef hot fill:#fde7e7,stroke:#1e1e1e,color:#1e1e1e
  classDef good fill:#eaf6ec,stroke:#1e1e1e,color:#1e1e1e
  class H hot
  class S good
  linkStyle 0 stroke:#c92a2a,color:#c92a2a
  linkStyle 1 stroke:#2f9e44,color:#2f9e44

A dependency is hard if its failure breaches the caller's error budget and soft if the caller absorbs it. Measure which, with fault injection, and spend reliability effort on the hard ones.

The problem: criticality is usually a vibe

Ask a team which of their dependencies are critical and you will get a confident answer that nobody can defend. It is shaped by the last incident, by which service has a scary name, by who shouted loudest in the last review. The trouble is that this gut-feel ranking drives real spending - retries, replication, failover drills, on-call load - and it is wrong in both directions at once. Teams pour effort into a dependency that would degrade gracefully if it vanished, and leave a quiet single-point-of-failure untouched because it never broke loudly enough to earn attention.

The fix is not better intuition. It is to make criticality a measured property instead of an opinion. That means picking a definition of harm you can observe, expressing it as a number, and ranking dependencies by how much they move that number. When criticality is a measurement, two engineers who disagree can resolve it by looking at the same evidence, and a reliability investment can be justified to someone holding a budget.

The unit of measurement: SLOs, SLIs, and the error budget

You cannot turn risk into a number without first agreeing on what "working" means. That is what a service-level objective gives you. An SLI is the thing you actually measure - success rate, latency below a threshold, freshness of a result. The SLO is the target you hold that indicator to over a window. The gap between the target and a perfect score is the error budget: the amount of failure you have explicitly decided is acceptable.

The error budget is the key idea, because it converts reliability from a moral question ("should this ever fail?") into an accounting one ("how much failure can this absorb before users notice?"). A dependency that, when it misbehaves, consumes a tiny slice of the budget is cheap to tolerate. One that blows through the whole budget the moment it stutters is expensive. Now risk has a denominator. "Is this scary?" becomes "how much of the caller's error budget does this dependency put at stake, and how often?" - and that is a question with an answer you can compute and revisit as the system changes.

The distinction that makes the number useful: hard versus soft

The single most clarifying cut is whether a dependency is hard or soft. A hard dependency takes its caller down when it fails - there is no fallback, no cache to serve stale from, no feature to quietly switch off. A soft dependency degrades gracefully: the caller absorbs the loss inside its error budget through a fallback path, a cached value, a default, or a feature that turns off without taking the request with it.

This distinction matters because it is where the budget is won or lost. A hard dependency passes its failures straight through to your SLI: when it breaks, your indicator moves by the full amount. A soft dependency is a shock absorber, and the size of that absorber is exactly the headroom in the error budget. The same external service can be hard for one caller and soft for another depending entirely on what the caller does when it does not get an answer. So the classification is not a property of the dependency in isolation - it is a property of the edge between two services, and that is the right granularity to reason about. Get the hard set right and you know where a single failure becomes an outage. Everything else is a candidate for being left alone.

Spending the effort where the number says to

Once every dependency edge carries an estimate of how much error budget it puts at risk and how often, prioritization stops being a debate. Rank the edges by expected budget burned - the share of error budget an edge puts at risk times how often it fails - and spend from the top. The hard edges with high traffic and no fallback are where retries, circuit breakers, replication, and failover drills actually buy down risk. The soft edges near the bottom are where the same effort buys almost nothing, and the honest move is to leave them alone and write down why.

The discipline cuts the other way too. If you are about to add a retry policy or a standby replica, you should be able to say which SLO it protects and how much budget it preserves. If you cannot, you are spending on a feeling. This is also how you defend the work upward: "this edge is hard, it carries a large share of our traffic, and an outage here spends the whole quarter's budget in minutes" is an argument a non-engineer can evaluate. "It seems risky" is not. The number does not replace judgment - it forces the judgment to be explicit and reviewable, which is what makes it possible to say no to plausible-sounding work that protects nothing.

Keeping the number honest over time

A risk number is only as good as the day it was computed. Services change, fallbacks get added and then rot, traffic shifts, a soft dependency silently becomes hard when the fallback it relied on - a cache, a default - is removed by someone who did not know it was load-bearing. A criticality ranking that is computed once and trusted forever decays into the same gut feel it replaced - just with a spreadsheet for cover.

The defense is to tie the ranking to things you can re-observe rather than to a one-time judgment. Anchor each classification to evidence - the actual call site, the present fallback behavior, the measured budget impact - so a label can be re-derived when the code moves under it. Where you can, verify a classification by injecting a fault that disables exactly that edge in a controlled, non-production setting and watching whether the caller's SLI moves: if disabling it breaches the budget, it is hard, and you have an observed outcome rather than an inference. You do not need to re-verify everything; you need to re-verify enough of the boundary cases to keep the hard-soft line calibrated. That is the difference between a risk model that was right on the day it shipped and one that stays right as the system moves underneath it.

Takeaways

Replace gut-feel criticality with a measured quantity: pick an SLI, set an SLO, and rank dependencies by how much of the error budget they put at risk and how often. "Is this scary?" becomes a number you can compute and defend.
The error budget is the denominator that makes risk an accounting question instead of a moral one. A dependency that burns a sliver of budget is cheap to tolerate; one that can spend it all in minutes is where your effort belongs.
Hard versus soft is the highest-value distinction, and it lives on the edge between two services, not on a service in isolation. A hard dependency passes its failures straight to your SLI; a soft one absorbs them inside the budget through a fallback. The same dependency can be hard for one caller and soft for another.
Prioritize reliability spending top-down by expected budget burned - the share of error budget an edge puts at risk times how often it fails. Every retry, replica, or failover should name the SLO it protects and the budget it preserves; if it cannot, you are spending on a feeling.
A risk ranking decays as fallbacks rot and traffic shifts. Anchor each classification to re-observable evidence, and verify the boundary cases by injecting a fault that disables exactly that edge in a controlled setting - re-verify enough to keep the hard-soft line calibrated, not everything.