Notes on systems, reliability, and AI systems.
Build the Substrate First: Why Platform Quality Is Capped by the Model Underneath It
Most platforms that reason about a system are really stacks of queries against a model of that system. The recurring lesson is that the model is the constraint: build it as an authoritative product first, keep every layer above it a traceable query, and the quality of everything above is capped by the coverage and freshness of what is beneath.
2026
Closing the Loop: Fault Injection as a Ground-Truth Engine for an LLM Classifier
A pattern worth reusing: when an LLM makes a judgment at scale, do not let it grade itself. Pair a cheap predictor with an expensive verifier that observes ground truth by breaking things on purpose - and treat their disagreements as the most valuable signal you have.
2026
LLM as a Judge: Quality Gates That Are Not Vibes
A separate model can score outputs you cannot label by hand - but only if you treat it like a measuring instrument: a few binary dimensions, a calibrated reading, an answer key it never gets to see.
2026
Evals Before Features: Benchmarking LLMs for a Production Task
Before you wire an LLM into a real workflow, decide how you will know it is good enough - because the eval is the gate, and the model is just the thing that has to pass it.
2026
Agent Harness vs RAG: When Structured Tool-Calls Beat Vector Search
Two ways to feed a model context: retrieve fuzzy passages by similarity, or call purpose-built tools that return exact records. Here is how I decide which one a problem actually needs - and why, when an answer has to trace back to evidence, the tool-calling agent wins.
2026
Cost-Aware LLM Pipelines: Match Compute to Criticality
Most items in a large workload are easy, and a few are genuinely hard. The cheapest reliable pipeline is the one that spends accordingly: a deterministic fast path for the easy majority, the full agent loop reserved for the long tail, and routing that decides which is which before you pay for the expensive option.
2026
Designing APIs for Agents, Not Humans
When an LLM is the caller, the interface is the prompt. Typed responses, idempotent writes, granular composable tools, and machine-readable errors are not nice-to-haves - they are the difference between an agent that chains calls reliably and one that scrapes a human UI and guesses.
2026
Your Model Is Not the Product: The Delivery Layer Is
A correct model or a sharp analysis is necessary but not sufficient. What gets internal AI actually used is the last mile almost no one designs for - routing each insight to its owner, defaulting it into their workflow, and making it actionable enough to act on without a meeting.
2026
The Service Graph Is a Lower Bound: Finding the Dependencies No RPC Edge Shows
The dependency graph you draw from RPC traffic is real but incomplete. The dependencies that take you down are usually the ones no edge in that graph represents - shared config, shared stores, shared infrastructure - and you have to go looking for them on purpose.
2026
SLO-Driven Risk: Turning "Is This Scary?" Into a Number
Reliability effort gets spent on whatever feels scary in the room. Here is how to replace that gut feel with a defensible number built from SLOs, error budgets, and a clean hard-versus-soft dependency distinction - so you can prove where the risk is and justify what you do about it.
2026
Streaming Ingest in Rust: Reconciling Identity and Time Across Messy Sources
When you fuse many noisy event streams into one model, the hard problems are not throughput - they are identity and time. Here is how I think about normalizing heterogeneous sources, resolving who an event is about, and ordering events that arrive late, out of order, and duplicated.
2026
Scaling a Live Stream to a Billion Viewers
A live broadcast turns one source into millions of simultaneous viewers in seconds. The hard part is not the video - it is keeping a viral spike from melting your origin. Here are the patterns that make it work.
2026
Natural Language to SQL, Then and Now: A 2018 Research Project Meets the LLM Era
In 2018 I helped build an RNN model that turned English questions into SQL, trained on WikiSQL and later published. Today a general-purpose LLM does the same thing with no task-specific training. Here is what changed, what did not, and what the old problem still teaches.
2018, revisited 2026