Natural Language to SQL, Then and Now: A 2018 Research Project Meets the LLM Era

2026·3 min read

nlpllmresearch

In 2018 I helped build an RNN model that turned English questions into SQL, trained on WikiSQL and later published. Today a general-purpose LLM does the same thing with no task-specific training. Here is what changed, what did not, and what the old problem still teaches.

flowchart LR
  T["2018: train an RNN
on a labeled benchmark"] -->|"the model got free"| N["today: prompt an LLM
zero-shot"]
  classDef key fill:#e8f1fb,stroke:#1e1e1e,color:#1e1e1e
  class N key
  linkStyle 0 stroke:#2383E2,color:#2383E2

The capability moved from a bespoke trained model to a prompt. What stayed hard, and moved up a layer, is making sure the generated SQL is actually right.

The problem we were solving in 2018

Natural language to SQL is a simple idea with a deep tail: let someone ask a database a question in plain English and get back the query that answers it. In 2018, as an undergraduate researcher, I worked on exactly that. We built an RNN-based model that read a question together with a table's schema and generated a SQL query against it, trained and evaluated on WikiSQL, a benchmark of question-and-SQL pairs over single tables. It reached around 94% on that benchmark, and the work was later published in IEEE Transactions on Audio, Speech, and Language Processing.

It is worth being honest about the scope. WikiSQL is a constrained slice of the real problem: single tables, a fixed query shape - a select, an aggregation, a few where clauses - no joins, no nesting. Hitting 94% there was a real result for the time, but it was 94% of a deliberately narrow world.

What it took back then

Everything was task-specific. You needed a labeled dataset in exactly the right format. You needed an architecture built for the job: an encoder for the question and the schema, and a decoder constrained so its output was valid SQL rather than merely plausible text. You spent real effort on those constraints, because a sequence model left to its own devices would happily emit a query that did not parse. And then you trained, on that dataset, for that shape of question.

The intelligence lived in the architecture and the data, and generalization was narrow by construction. Step outside the benchmark's shape - a join, an unusual phrasing, a much bigger schema - and accuracy fell off, because the model had only ever seen the slice you trained it on.

What an LLM changed

Today you do not train anything. You hand a general-purpose model the schema and the question in a prompt and it returns SQL, zero-shot or with a couple of examples. The capability moved from a model you build into a model you ask. Joins, nested queries, multiple dialects, oddly phrased questions - things that in 2018 would each have been their own research effort - are mostly handled out of the box.

The bottleneck moved with it. In 2018 the hard question was whether the model could produce valid SQL at all. Now valid SQL is nearly free, and the hard question is whether it produced the SQL you actually meant.

What did not change

The genuinely hard parts were never about SQL syntax. They were semantic, and they are still here. Schema linking: when a user says revenue, which column in a messy hundred-table schema is that? Ambiguity: the same English question has two reasonable SQL readings. Scale: a real warehouse is not one tidy table. None of that was solved by making the model fluent.

Fluency also brought new failure modes. A confident model will reference a column that does not exist, or write a query that runs cleanly and answers a slightly different question than the one asked. That is more dangerous than a syntax error, because a syntax error fails loudly and a wrong-but-runnable query fails silently. So the evaluation discipline we cared about in 2018 - execution accuracy, did the query return the right rows against a real schema, not did the string match - matters more now, not less.

The throughline

Looking back across those years, the lesson that survived is the one I keep relearning: the model is the easy part, and the truth is the hard part. In 2018 we poured our effort into getting a model to emit valid SQL at all. Today the model is essentially free, and the effort moves to making sure the SQL is right - grounding it in the real schema, testing it by execution, and never trusting a fluent answer on its confidence alone.

Same discipline, one layer up. The tools got dramatically better; the obligation to verify did not go anywhere. If anything, the more capable and fluent the model gets, the more that verification is the whole job.

Takeaways

Natural-language-to-SQL went from a task you trained a bespoke model for (an RNN on WikiSQL, around 94% on a constrained benchmark) to something a general LLM does zero-shot. The capability moved from training-time to prompt-time.
The benchmark was the easy slice: single tables, a fixed query shape. The real problem - large schemas, joins, ambiguity, schema linking - was always the hard part, and still is.
LLMs removed the produce-valid-SQL bottleneck and added a new one: fluent SQL that answers the wrong question, or references columns that do not exist.
Execution-based evaluation - does the query return the right rows against a real schema - mattered in 2018 and matters more now. A fluent answer is not a correct one.
The throughline across the years: the model is the easy part, the ground truth is the hard part. The effort just moved up a layer.