The Service Graph Is a Lower Bound: Finding the Dependencies No RPC Edge Shows

2026·5 min read

distributed systemsobservability

The dependency graph you draw from RPC traffic is real but incomplete. The dependencies that take you down are usually the ones no edge in that graph represents - shared config, shared stores, shared infrastructure - and you have to go looking for them on purpose.

flowchart TD
  A["service A"] -->|"RPC"| B["service B"]
  B -->|"RPC"| C2["service C"]
  A -.->|"shared"| S["shared store / config"]
  B -.-> S
  C2 -.-> S
  classDef hot fill:#fde7e7,stroke:#1e1e1e,color:#1e1e1e
  class S hot
  linkStyle 2 stroke:#c92a2a,color:#c92a2a
  linkStyle 3 stroke:#c92a2a,color:#c92a2a
  linkStyle 4 stroke:#c92a2a,color:#c92a2a

The RPC call graph (solid) is a lower bound on coupling. The incidents that surprise you come from fate-sharing on a shared store or config (dashed, red) that no call edge ever shows.

The graph you have is the graph of calls, not the graph of coupling

Most service-dependency graphs are built from one source: who calls whom. You scrape RPC traces or mesh telemetry, draw an edge from caller to callee, and you get a clean directed graph that everyone treats as the dependency map. It is genuinely useful. It is also a partial picture, and the part it leaves out is the part that hurts.

A call edge captures one kind of coupling - synchronous request-response. But two services with no edge between them can still fail together. They read the same config store. They share a database, or a cache, or a feature-flag system. One writes to a queue the other drains. They sit on the same host pool, the same availability zone, the same network path, the same secret-rotation job. None of that shows up as an arrow, because no request crosses between them. The graph says they are unrelated. The next incident says otherwise.

I have come to treat the call graph as a lower bound on coupling, never the whole of it. When someone says two systems are independent because nothing connects them in the service map, the honest answer is that nothing connects them in one view of the map.

The incidents nobody predicted live in the implicit edges

Walk back through the incidents that surprised people - not the ones where a known critical dependency fell over, but the ones where the post-mortem opened with some version of we did not think those two things were related. Almost all of them are implicit edges.

A config push goes out and a dozen unrelated services degrade at once, because they all read the same key and the push was bad. A cache everyone treated as a soft optimization goes cold - a restart, a mass expiry, or a node loss - the read traffic stampedes the origin, and a system with no edge to the cache owner falls over from the secondary effect. A shared queue backs up, and a consumer you forgot was downstream starts missing its deadlines. A single data store hosts tables for several teams, and one team's runaway query starves everyone else's reads.

The common shape is fate-sharing on a resource that the call graph never names. The blast radius does not follow request paths - it follows the shared thing. And because the shared thing is shared, the failure arrives at several services simultaneously, which is exactly the pattern that makes an incident hard to diagnose: lots of unrelated-looking alerts, no obvious caller-callee chain to walk up. The graph offered no warning because the coupling was never an edge in it.

Surfacing the hidden edges starts with reconciling identity across signals

You cannot model an edge you cannot see, and the reason these edges are invisible is usually mundane: the signals that would reveal them live in different systems that name things differently. The config system knows a config key and the principals that read it. The storage layer knows a database instance and the clients connected to it. The deploy system knows a service name. The trace pipeline knows yet another identifier. Same underlying service, four different names, no join key.

So the first real work is identity reconciliation - building a mapping that says this service, this deploy unit, this database client, and this config consumer are the same thing. It is unglamorous and it is most of the battle. Once you can resolve identity across signals, the implicit edges start to fall out almost mechanically: enumerate everything that reads a given config key and you have the set of services that share its fate; enumerate every client of a data store and you have a shared-fate cluster; enumerate every consumer of a queue and you have the async dependencies the call graph dropped.

The discipline that pays off is to enumerate from the shared resource outward, not from the service inward. Asking what does this service depend on tends to surface only the calls it makes. Asking who else touches this store, this key, this host pool surfaces the siblings that will fail alongside it.

Model indirect edges as first-class, and label them by how they propagate failure

Once you have the identities reconciled, resist the urge to flatten everything into the same kind of arrow. A direct call and a shared-database relationship propagate failure differently, and a graph that draws them identically will mislead you in a new way.

I find it useful to type the edges. A direct synchronous edge fails fast and propagates upstream along the request path. A shared-store or shared-config edge propagates sideways - to every sibling at once, with no caller-callee ordering. An async-queue edge propagates with delay and can absorb a burst before it breaks, which changes both the symptom and the time-to-impact. A pure fate-sharing edge, same host pool or same zone, has no application-level call relationship at all; the coupling is at the resource or failure-domain level - shared CPU, memory, network, power, or blast radius - so it should be modeled as fate-sharing rather than pretended into a call.

The practical payoff of typed indirect edges is that the graph starts answering the question that actually matters during an incident: if this thing degrades, what set of services moves together, and in what order. That is a different and more honest question than what calls what. It is also worth being clear about confidence - an edge inferred from shared config is a strong signal; an edge inferred from co-location is weaker and more circumstantial. Carry that uncertainty in the model instead of laundering an inference into a fact, because an over-confident hidden edge can send responders down the wrong path just as easily as a missing one.

Takeaways

Treat the RPC call graph as a lower bound on coupling, not the full dependency map. Two services with no edge between them can still fail together through shared config, shared stores, caches, queues, or shared infrastructure.
The incidents that surprise people are usually implicit edges - fate-sharing on a resource the call graph never names. The blast radius follows the shared thing, not the request path, which is why several unrelated-looking systems break at once.
Surfacing hidden edges is mostly an identity-reconciliation problem: the same service is named differently across the config, storage, deploy, and trace systems. Build the join first; the edges fall out once you can resolve identity across signals.
Enumerate from the shared resource outward - who else touches this store, key, or host pool - not from the service inward, which only re-discovers the calls you already had.
Model indirect edges as first-class and type them by how they propagate failure (synchronous-upstream, shared-sideways, async-delayed, pure co-location). Carry the confidence of each inferred edge so a weak signal is not laundered into a fact.