Dependency Service: one graph to trust

case study·2024 - present·10 min read

ruststreaminggraphsreliability

$Summary - A Rust streaming system that fuses six telemetry sources (~20 GB/s) into one canonical service-dependency graph. It became the source of truth behind dependency notifications, a protection program adopted by 1,000+ services, and a fault-injection loop. Details below are simplified and shared at a level consistent with public engineering writing.

Context

At hyperscale, "which services depend on which" sounds like a solved problem. It isn't. Six different telemetry systems each held a partial, mutually inconsistent picture: RPC traces, mesh flow logs, network connections, client-library declarations, configuration intent, and routing topology. During incidents - the only time anyone urgently needs this data - engineers got six different answers.

The problem, precisely

Coverage gaps: sampled tracing misses low-QPS but critical calls; the mesh can't see sidecar-bypassing traffic.
Staleness: declared dependencies outlive real ones - nobody deletes old client configs.
Identity mismatch: network-level data sees endpoints, not services.
Trust: with six sources disagreeing, every consumer built their own ad-hoc reconciliation - badly.

Architecture

 rpc-traces ─┐
 mesh-flows ─┤                       ┌─→ notifications
 net-conns ──┤   ┌──────────────┐    │
 client-libs ┼──▶│ stream fusion │───┼─→ protection program
 config-decl─┤   │  (rust, ~20  │    │
 lb-routes ──┘   │   GB/s)      │    └─→ risk classifier
                 └──────┬───────┘
                        ▼
              canonical dependency graph
              (evidence + confidence per edge)

six sources in, one graph out - every edge carries its evidence

The pipeline is a Rust streaming service. Each telemetry source is normalized into Observation events; edges accumulate observations rather than verdicts:

// simplified: edges carry evidence, not assertions
struct Edge {
    src: ServiceId,
    dst: ServiceId,
    evidence: SmallVec<[Observation; 6]>,
    confidence: f32,   // derived continuously, never asserted
    last_seen: Timestamp,
}

Three design decisions did most of the work:

Model evidence, not facts. Consumers see why an edge exists and how confident the system is - an edge seen by tracing and mesh flows in the last hour is near-certain; one living only in a 90-day-stale client config is flagged "probably dead."
Freshness beats completeness. Confidence decays continuously instead of being recomputed in nightly batches. A 95%-complete graph that's thirty seconds fresh beats a complete one that's six hours stale - incident responders live in the present.
Never silently delete. Edges are demoted and flagged, never dropped without human confirmation. Reliability tooling that silently removes data is how you cause the incident you were built to prevent.

Performance notes

Sustaining ~20 GB/s of fused telemetry came down to unglamorous engineering: zero-copy deserialization on the hot path, per-source backpressure so one lagging stream can't stall the rest, and shard-by-service partitioning so edge state updates stay core-local. Rust's ownership model made the zero-copy design safe to maintain across contributors.

Impact

Became the canonical dependency source - downstream teams deleted their ad-hoc reconciliation pipelines.
Powers dependency notifications and a protection program adopted by 1,000+ services.
Feeds the fault-injection loop that produces ground-truth criticality labels.
Foundation for LLM-agent dependency explanations (its own story - start here).

What I'd do differently

Start the confidence model earlier. The first iteration ranked sources by priority order, and unwinding that into evidence-based scoring cost a quarter. And invest in the decay function sooner - it ended up being the most-tuned ten lines in the codebase.

Dependency Service: one graph to trust

Context

The problem, precisely

Architecture

Performance notes

Impact

What I'd do differently

Prabhav Nalhe