← selected work / case study
Dependency Service: one graph to trust
Context
At hyperscale, "which services depend on which" sounds like a solved problem. It isn't. Six different telemetry systems each held a partial, mutually inconsistent picture: RPC traces, mesh flow logs, network connections, client-library declarations, configuration intent, and routing topology. During incidents — the only time anyone urgently needs this data — engineers got six different answers.
The problem, precisely
- Coverage gaps: sampled tracing misses low-QPS but critical calls; the mesh can't see sidecar-bypassing traffic.
- Staleness: declared dependencies outlive real ones — nobody deletes old client configs.
- Identity mismatch: network-level data sees endpoints, not services.
- Trust: with six sources disagreeing, every consumer built their own ad-hoc reconciliation — badly.
Architecture
rpc-traces ─┐
mesh-flows ─┤ ┌─→ notifications
net-conns ──┤ ┌──────────────┐ │
client-libs ┼──▶│ stream fusion │───┼─→ protection program
config-decl─┤ │ (rust, ~20 │ │
lb-routes ──┘ │ GB/s) │ └─→ risk classifier
└──────┬───────┘
▼
canonical dependency graph
(evidence + confidence per edge)The pipeline is a Rust streaming service. Each telemetry source is normalized into Observation events; edges accumulate observations rather than verdicts:
// simplified: edges carry evidence, not assertions
struct Edge {
src: ServiceId,
dst: ServiceId,
evidence: SmallVec<[Observation; 6]>,
confidence: f32, // derived continuously, never asserted
last_seen: Timestamp,
}
Three design decisions did most of the work:
- Model evidence, not facts. Consumers see why an edge exists and how confident the system is — an edge seen by tracing and mesh flows in the last hour is near-certain; one living only in a 90-day-stale client config is flagged "probably dead."
- Freshness beats completeness. Confidence decays continuously instead of being recomputed in nightly batches. A 95%-complete graph that's thirty seconds fresh beats a complete one that's six hours stale — incident responders live in the present.
- Never silently delete. Edges are demoted and flagged, never dropped without human confirmation. Reliability tooling that silently removes data is how you cause the incident you were built to prevent.
Performance notes
Sustaining ~20 GB/s of fused telemetry came down to unglamorous engineering: zero-copy deserialization on the hot path, per-source backpressure so one lagging stream can't stall the rest, and shard-by-service partitioning so edge state updates stay core-local. Rust's ownership model made the zero-copy design safe to maintain across contributors.
Impact
- Became the canonical dependency source — downstream teams deleted their ad-hoc reconciliation pipelines.
- Powers dependency notifications and a protection program adopted by 1,000+ services.
- Feeds the fault-injection loop that produces ground-truth criticality labels.
- Foundation for LLM-agent dependency explanations (its own story — start here).
What I'd do differently
Start the confidence model earlier. The first iteration ranked sources by priority order, and unwinding that into evidence-based scoring cost a quarter. And invest in the decay function sooner — it ended up being the most-tuned ten lines in the codebase.