NL2SQL Benchmark: Vanna AI vs WrenAI vs DB-GPT

In my previous post, I dissected the architectures of three leading open-source NL2SQL tools — Vanna AI (RAG-first), WrenAI (semantic model), and DB-GPT (multi-agent). That was a code-level teardown of how each tool works. This post answers the natural follow-up: how well do they actually work?

I built a benchmark with 50 gold-standard SQL queries across 5 categories and 3 tiers of schema complexity, all against the same DuckDB database. Every tool used GPT-4o as the underlying LLM, so the differences come down to prompt engineering, context management, and pipeline design — not model capability.

Methodology note: Vanna AI was run as-is using its Python SDK with ChromaDB. For WrenAI and DB-GPT, I replicated each tool's core methodology (semantic model context and multi-step pipeline, respectively) using direct OpenAI API calls, since both require heavyweight Docker deployments that weren't practical for isolated benchmarking. The prompts and pipeline logic mirror what each tool does internally, but these are not results from the full production systems. Think of them as "approach benchmarks" — testing the strategy, not the full product.

Benchmark Setup

Parameter	Value
Database	DuckDB — 16 tables, 3 tiers of schema complexity
Queries	50 gold-standard (10 per category)
Categories	simple, aggregation, multi-join, nested subquery, advanced
Tier 1	5 core tables — clean, descriptive names (customers, orders, products, etc.)
Tier 2	+5 tables — departments, employees, suppliers, shipments, reviews
Tier 3	+6 tables — abbreviated names (wh, inv, promo, ret, pay_mthd, aud_log)
LLM	GPT-4o (same model for all approaches)
Evaluation	Execution accuracy — does the generated SQL produce the same result set as gold?

How Results Are Scored

Each query gets one of four results:

Result	Meaning
PASS	Generated SQL produces the exact same result set as the gold SQL (may include exact match)
~SQL	SQL executed successfully but returned different results — wrong columns, wrong rows, or wrong aggregation
FAIL	SQL returned results but they were completely wrong (different row count, wrong data)
ERROR	SQL failed to execute entirely — syntax error, missing table, DuckDB incompatibility

Three Approaches Tested

Approach	How It Works
Vanna AI (RAG + ChromaDB)	Trains DDL, documentation, and example Q-SQL pairs into a ChromaDB vector store. For each question, retrieves the most relevant context and sends it with the question to the LLM. Run via Vanna's Python SDK.
WrenAI-style (Semantic Model)	Sends a rich MDL (Modeling Definition Language) with table descriptions, column display names, and relationship definitions as context in a single LLM call. Replicates WrenAI's core methodology.
DB-GPT-style (Multi-Step Agent)	Three sequential LLM calls: (1) schema linking to identify relevant tables, (2) SQL generation with focused schema + sample data, (3) SQL correction if execution fails. Replicates DB-GPT's AWEL pipeline logic.

Headline Results

Approach	Exec Accuracy	Exact Match	Errors	Avg Latency	Tokens Used
Vanna AI	20/50 (40.0%)	7/50 (14.0%)	2	1,332ms	N/A
WrenAI-style	11/50 (22.0%)	2/50 (4.0%)	2	1,125ms	193,969
DB-GPT-style	10/50 (20.0%)	2/50 (4.0%)	0	2,916ms	74,607

Vanna AI leads at 40% execution accuracy — nearly double the other two approaches. But let's break this down further.

Accuracy by Category

Category	Vanna AI	WrenAI-style	DB-GPT-style
Simple	7/10 (70%)	3/10 (30%)	3/10 (30%)
Aggregation	3/10 (30%)	2/10 (20%)	2/10 (20%)
Multi-join	4/10 (40%)	3/10 (30%)	2/10 (20%)
Nested subquery	4/10 (40%)	2/10 (20%)	2/10 (20%)
Advanced	2/10 (20%)	1/10 (10%)	1/10 (10%)

Vanna dominates simple queries (70% vs 30%) and holds a meaningful edge on multi-join and nested subqueries. All three approaches collapse below 20% on advanced queries (window functions, CTEs, cohort analysis).

Accuracy by Schema Tier

Tier	Description	Vanna AI	WrenAI-style	DB-GPT-style
Tier 1	Clean names (5 tables)	13/23 (56.5%)	7/23 (30.4%)	6/23 (26.1%)
Tier 2	More tables (10 total)	4/16 (25.0%)	1/16 (6.2%)	1/16 (6.2%)
Tier 3	Abbreviated names (16 total)	3/11 (27.3%)	3/11 (27.3%)	3/11 (27.3%)

The Tier 2 drop is dramatic — accuracy falls off a cliff when the schema expands from 5 to 10 tables. Interestingly, all three approaches score identically on Tier 3 (27.3%), suggesting that abbreviated names aren't harder than having more tables to choose from.

Vanna AI: The Training Data Effect

One of Vanna's distinguishing features is its training pipeline. You feed it DDL, documentation, and example question-SQL pairs, and it stores everything in a ChromaDB vector store for retrieval. Here's the impact of adding just 5 example pairs:

Mode	Queries	Exec Accuracy	Exact Match	Errors	Avg Latency
Zero-shot (DDL + docs only)	23	4/23 (17.4%)	0/23 (0.0%)	1	1,662ms
5 example Q-SQL pairs	23	11/23 (47.8%)	5/23 (21.7%)	0	1,331ms
Improvement		+30.4pp	+21.7pp	-1	-331ms

Five training examples nearly tripled accuracy from 17.4% to 47.8%. Zero-shot Vanna (17.4%) is actually worse than the other approaches on the same Tier 1 queries, which tells us the RAG retrieval of relevant examples is doing most of the heavy lifting — not the architecture itself.

Deep Dive: Vanna AI Results

Approach: RAG + ChromaDB with 5 example Q-SQL training pairs, DDL, and documentation

By Category

Category	Accuracy	Errors	Avg Latency
Simple	7/10 (70%)	0	1,030ms
Aggregation	3/10 (30%)	1	1,049ms
Multi-join	4/10 (40%)	1	1,676ms
Nested subquery	4/10 (40%)	0	1,328ms
Advanced	2/10 (20%)	0	1,577ms

By Tier

Tier	Accuracy	Errors
Tier 1	13/23 (56.5%)	0
Tier 2	4/16 (25.0%)	1
Tier 3	3/11 (27.3%)	1

Selected Query Results

Expand full per-query results (50 queries)

#	Category	Tier	Question	Result	Latency
1	simple	T1	List all customers from California.	PASS	1,059ms
2	simple	T1	Show the 5 most expensive products.	PASS	1,111ms
3	simple	T1	How many orders have been cancelled?	PASS	1,014ms
4	simple	T1	What are all the product categories?	~SQL	847ms
5	simple	T1	Find all orders placed in January 2024.	PASS	750ms
6	simple	T1	Which products cost less than $50?	PASS	712ms
7	simple	T1	Show all delivered orders sorted by total amount desc.	~SQL	792ms
8	simple	T2	List all employees in the Engineering department.	~SQL	1,561ms
9	simple	T2	What are the names and countries of all suppliers?	PASS	746ms
10	simple	T2	Show all 5-star reviews.	PASS	1,705ms
11	aggregation	T1	What is the total revenue from all delivered orders?	PASS	804ms
12	aggregation	T1	How many orders does each customer have? Top 10.	PASS	900ms
13	aggregation	T1	What is the average order value by month in 2024?	~SQL	1,127ms
14	aggregation	T1	Which product category generates the most revenue?	~SQL	1,415ms
15	aggregation	T1	What is the profit margin for each product?	~SQL	718ms
16	aggregation	T2	What is the average product rating by category?	~SQL	1,172ms
17	aggregation	T2	What is the total salary expense per department?	~SQL	1,035ms
18	aggregation	T2	Which carrier delivers the most shipments?	~SQL	873ms
19	aggregation	T2	What is the average delivery time in days by carrier?	FAIL	1,058ms
20	aggregation	T3	How many returns by reason with total refund amount?	PASS	1,390ms
21	multi-join	T1	Show each customer's name and total spend.	PASS	839ms
22	multi-join	T1	What are the best-selling products by quantity?	~SQL	1,196ms
23	multi-join	T1	Most popular product category per state.	PASS	2,394ms
24	multi-join	T2	Products with avg review rating and review count.	~SQL	2,200ms
25	multi-join	T2	List employees and their managers' names.	~SQL	1,298ms
26	multi-join	T2	Customers who ordered from every category.	PASS	1,314ms
27	multi-join	T3	Inventory levels across warehouses for products under 50.	~SQL	1,553ms
28	multi-join	T3	Most popular payment methods by customer state.	PASS	1,593ms
29	multi-join	T3	Warehouse manager with most total inventory.	FAIL	2,152ms
30	multi-join	T3	Total refund amount by product category for completed returns.	~SQL	2,224ms
31	nested	T1	Customers who spent more than the average.	PASS	2,097ms
32	nested	T1	Products that have never been ordered.	PASS	747ms
33	nested	T1	Most recent order for each customer.	~SQL	941ms
34	nested	T1	Categories with products above average price.	PASS	871ms
35	nested	T2	Employees earning more than department average.	PASS	1,237ms
36	nested	T2	Products with above-average review ratings.	~SQL	1,271ms
37	nested	T2	Suppliers who shipped above-average total orders.	~SQL	2,004ms
38	nested	T3	Products where inventory is below reorder point.	~SQL	1,172ms
39	nested	T3	Customers who used more than 2 payment methods.	~SQL	1,124ms
40	nested	T3	Orders with total above average for that customer state.	~SQL	1,815ms
41	advanced	T1	Rank customers by total spend with percentile.	PASS	1,117ms
42	advanced	T1	Month-over-month revenue growth rate for 2024.	~SQL	1,366ms
43	advanced	T1	Product revenue as percentage of total.	~SQL	1,954ms
44	advanced	T1	Running total of orders by date for 2024.	~SQL	1,329ms
45	advanced	T2	Rank employees within department by salary.	~SQL	2,115ms
46	advanced	T2	Top 3 products by review count with avg rating.	~SQL	1,119ms
47	advanced	T2	Customer cohort analysis by signup month.	~SQL	1,217ms
48	advanced	T3	Top 5 promo codes by usage with estimated revenue.	~SQL	2,503ms
49	advanced	T3	Total orders and returns combined via UNION.	~SQL	1,783ms
50	advanced	T3	Audit log activity by table and action type.	PASS	1,270ms

SQL Examples

"How many orders have been cancelled?" — PASS (execution match)

-- Gold
SELECT COUNT(*) AS cancelled_orders FROM orders WHERE status = 'cancelled'

-- Vanna AI
SELECT COUNT(order_id) AS cancelled_orders_count FROM orders WHERE status = 'cancelled'

Different column alias and COUNT(order_id) vs COUNT(*), but same result: correct.

"What are all the product categories?" — ~SQL (close miss)

-- Gold
SELECT name, description FROM categories

-- Vanna AI
SELECT DISTINCT name FROM categories;

Vanna returned only the name column, missing description. Same row count but different columns.

Deep Dive: WrenAI-style Results

Approach: Single LLM call with full MDL (table descriptions, column display names, relationships) as context

By Category

Category	Accuracy	Errors	Avg Latency
Simple	3/10 (30%)	0	920ms
Aggregation	2/10 (20%)	1	1,086ms
Multi-join	3/10 (30%)	0	1,333ms
Nested subquery	2/10 (20%)	1	1,054ms
Advanced	1/10 (10%)	0	1,233ms

By Tier

Tier	Accuracy	Errors
Tier 1	7/23 (30.4%)	0
Tier 2	1/16 (6.2%)	2
Tier 3	3/11 (27.3%)	0

Expand full per-query results (50 queries)

#	Category	Tier	Question	Result	Latency
1	simple	T1	List all customers from California.	~SQL	1,191ms
2	simple	T1	Show the 5 most expensive products.	~SQL	888ms
3	simple	T1	How many orders have been cancelled?	PASS	383ms
4	simple	T1	What are all the product categories?	~SQL	278ms
5	simple	T1	Find all orders placed in January 2024.	PASS	1,238ms
6	simple	T1	Which products cost less than $50?	~SQL	446ms
7	simple	T1	Show all delivered orders sorted by total amount desc.	~SQL	1,610ms
8	simple	T2	List all employees in the Engineering department.	~SQL	1,617ms
9	simple	T2	What are the names and countries of all suppliers?	PASS	911ms
10	simple	T2	Show all 5-star reviews.	~SQL	640ms
11	aggregation	T1	What is the total revenue from all delivered orders?	PASS	1,200ms
12	aggregation	T1	How many orders does each customer have? Top 10.	~SQL	635ms
13	aggregation	T1	What is the average order value by month in 2024?	~SQL	615ms
14	aggregation	T1	Which product category generates the most revenue?	~SQL	2,760ms
15	aggregation	T1	What is the profit margin for each product?	~SQL	634ms
16	aggregation	T2	What is the average product rating by category?	~SQL	1,831ms
17	aggregation	T2	What is the total salary expense per department?	~SQL	1,135ms
18	aggregation	T2	Which carrier delivers the most shipments?	~SQL	414ms
19	aggregation	T2	What is the average delivery time in days by carrier?	FAIL	827ms
20	aggregation	T3	How many returns by reason with total refund amount?	PASS	810ms
21	multi-join	T1	Show each customer's name and total spend.	PASS	1,590ms
22	multi-join	T1	What are the best-selling products by quantity?	~SQL	478ms
23	multi-join	T1	Most popular product category per state.	PASS	1,911ms
24	multi-join	T2	Products with avg review rating and review count.	~SQL	2,865ms
25	multi-join	T2	List employees and their managers' names.	~SQL	909ms
26	multi-join	T2	Customers who ordered from every category.	~SQL	688ms
27	multi-join	T3	Inventory levels across warehouses for products under 50.	~SQL	2,177ms
28	multi-join	T3	Most popular payment methods by customer state.	PASS	957ms
29	multi-join	T3	Warehouse manager with most total inventory.	~SQL	967ms
30	multi-join	T3	Total refund amount by product category for completed returns.	~SQL	790ms
31	nested	T1	Customers who spent more than the average.	~SQL	2,187ms
32	nested	T1	Products that have never been ordered.	PASS	809ms
33	nested	T1	Most recent order for each customer.	~SQL	1,459ms
34	nested	T1	Categories with products above average price.	PASS	785ms
35	nested	T2	Employees earning more than department average.	~SQL	888ms
36	nested	T2	Products with above-average review ratings.	~SQL	814ms
37	nested	T2	Suppliers who shipped above-average total orders.	FAIL	944ms
38	nested	T3	Products where inventory is below reorder point.	~SQL	952ms
39	nested	T3	Customers who used more than 2 payment methods.	~SQL	820ms
40	nested	T3	Orders with total above average for that customer state.	~SQL	886ms
41	advanced	T1	Rank customers by total spend with percentile.	~SQL	1,014ms
42	advanced	T1	Month-over-month revenue growth rate for 2024.	~SQL	1,767ms
43	advanced	T1	Product revenue as percentage of total.	~SQL	1,124ms
44	advanced	T1	Running total of orders by date for 2024.	~SQL	1,536ms
45	advanced	T2	Rank employees within department by salary.	~SQL	786ms
46	advanced	T2	Top 3 products by review count with avg rating.	~SQL	1,729ms
47	advanced	T2	Customer cohort analysis by signup month.	~SQL	1,200ms
48	advanced	T3	Top 5 promo codes by usage with estimated revenue.	~SQL	834ms
49	advanced	T3	Total orders and returns combined via UNION.	~SQL	1,707ms
50	advanced	T3	Audit log activity by table and action type.	PASS	633ms

SQL Examples

"List all customers from California." — ~SQL (close miss)

-- Gold
SELECT first_name, last_name, email, city FROM customers WHERE state = 'CA'

-- WrenAI-style
SELECT * FROM customers WHERE state = 'CA';

Despite having column display names in the MDL, the model defaulted to SELECT *. Same rows, different columns.

Deep Dive: DB-GPT-style Results

Approach: 3-step pipeline — schema linking, SQL generation with sample data, SQL correction on failure

By Category

Category	Accuracy	Avg Latency
Simple	3/10 (30%)	2,220ms
Aggregation	2/10 (20%)	2,661ms
Multi-join	2/10 (20%)	3,768ms
Nested subquery	2/10 (20%)	3,018ms
Advanced	1/10 (10%)	2,913ms

By Tier

Tier	Accuracy	Errors
Tier 1	6/23 (26.1%)	0
Tier 2	1/16 (6.2%)	0
Tier 3	3/11 (27.3%)	0

The standout stat: 0 execution errors across all 50 queries. The correction step works — when SQL fails to execute, the LLM fixes it. But syntactically valid SQL that returns wrong results still counts as a miss.

Expand full per-query results (50 queries)

#	Category	Tier	Question	Result	Latency
1	simple	T1	List all customers from California.	~SQL	2,124ms
2	simple	T1	Show the 5 most expensive products.	~SQL	2,097ms
3	simple	T1	How many orders have been cancelled?	PASS	1,710ms
4	simple	T1	What are all the product categories?	~SQL	2,280ms
5	simple	T1	Find all orders placed in January 2024.	PASS	2,064ms
6	simple	T1	Which products cost less than $50?	~SQL	1,920ms
7	simple	T1	Show all delivered orders sorted by total amount desc.	~SQL	2,441ms
8	simple	T2	List all employees in the Engineering department.	~SQL	4,232ms
9	simple	T2	What are the names and countries of all suppliers?	PASS	1,640ms
10	simple	T2	Show all 5-star reviews.	~SQL	1,688ms
11	aggregation	T1	What is the total revenue from all delivered orders?	PASS	2,676ms
12	aggregation	T1	How many orders does each customer have? Top 10.	~SQL	2,974ms
13	aggregation	T1	What is the average order value by month in 2024?	~SQL	2,686ms
14	aggregation	T1	Which product category generates the most revenue?	~SQL	3,367ms
15	aggregation	T1	What is the profit margin for each product?	~SQL	1,737ms
16	aggregation	T2	What is the average product rating by category?	~SQL	3,613ms
17	aggregation	T2	What is the total salary expense per department?	~SQL	2,982ms
18	aggregation	T2	Which carrier delivers the most shipments?	~SQL	1,791ms
19	aggregation	T2	What is the average delivery time in days by carrier?	~SQL	2,728ms
20	aggregation	T3	How many returns by reason with total refund amount?	PASS	2,054ms
21	multi-join	T1	Show each customer's name and total spend.	~SQL	2,854ms
22	multi-join	T1	What are the best-selling products by quantity?	~SQL	1,907ms
23	multi-join	T1	Most popular product category per state.	PASS	4,462ms
24	multi-join	T2	Products with avg review rating and review count.	~SQL	3,814ms
25	multi-join	T2	List employees and their managers' names.	~SQL	2,534ms
26	multi-join	T2	Customers who ordered from every category.	~SQL	2,865ms
27	multi-join	T3	Inventory levels across warehouses for products under 50.	~SQL	3,487ms
28	multi-join	T3	Most popular payment methods by customer state.	PASS	2,973ms
29	multi-join	T3	Warehouse manager with most total inventory.	~SQL	4,169ms
30	multi-join	T3	Total refund amount by product category for completed returns.	~SQL	8,612ms
31	nested	T1	Customers who spent more than the average.	~SQL	4,072ms
32	nested	T1	Products that have never been ordered.	PASS	3,126ms
33	nested	T1	Most recent order for each customer.	~SQL	2,995ms
34	nested	T1	Categories with products above average price.	PASS	2,794ms
35	nested	T2	Employees earning more than department average.	~SQL	2,819ms
36	nested	T2	Products with above-average review ratings.	~SQL	2,711ms
37	nested	T2	Suppliers who shipped above-average total orders.	~SQL	3,288ms
38	nested	T3	Products where inventory is below reorder point.	~SQL	2,879ms
39	nested	T3	Customers who used more than 2 payment methods.	~SQL	3,035ms
40	nested	T3	Orders with total above average for that customer state.	~SQL	2,465ms
41	advanced	T1	Rank customers by total spend with percentile.	~SQL	3,030ms
42	advanced	T1	Month-over-month revenue growth rate for 2024.	~SQL	2,867ms
43	advanced	T1	Product revenue as percentage of total.	~SQL	3,340ms
44	advanced	T1	Running total of orders by date for 2024.	~SQL	2,316ms
45	advanced	T2	Rank employees within department by salary.	~SQL	2,840ms
46	advanced	T2	Top 3 products by review count with avg rating.	~SQL	2,352ms
47	advanced	T2	Customer cohort analysis by signup month.	~SQL	2,685ms
48	advanced	T3	Top 5 promo codes by usage with estimated revenue.	~SQL	4,946ms
49	advanced	T3	Total orders and returns combined via UNION.	~SQL	2,793ms
50	advanced	T3	Audit log activity by table and action type.	PASS	1,960ms

SQL Examples

"Which products cost less than $50?" — ~SQL (close miss)

-- Gold
SELECT name, price FROM products WHERE price < 50 ORDER BY price

-- DB-GPT-style
SELECT product_id, name, price, cost
FROM products
WHERE cost < 50;

Schema linking correctly identified the products table, but the LLM picked the wrong column (cost instead of price) and added extra columns. The sample data step didn't help disambiguate.

Head-to-Head: Same Query, Three Approaches

Here's how each approach handled the same questions — same LLM, same database, different context strategies.

"Show the 5 most expensive products."

Vanna AI (PASS) — trained on similar examples, nailed the column selection:

SELECT name, price FROM products ORDER BY price DESC LIMIT 5

WrenAI-style (~SQL) — had column display names in MDL but still added product_id:

SELECT product_id, name, price FROM products ORDER BY price DESC LIMIT 5;

DB-GPT-style (~SQL) — schema linking found the right table, but same extra column:

SELECT product_id, name, price FROM products ORDER BY price DESC LIMIT 5;

Gold: SELECT name, price FROM products ORDER BY price DESC LIMIT 5

The pattern is clear: Vanna's example Q-SQL pairs teach the model which columns to select. The other approaches give the model the full schema and let it decide — and it consistently over-selects.

"How many orders does each customer have? Show the top 10."

Vanna AI (PASS):

SELECT c.first_name, c.last_name, COUNT(o.order_id) AS order_count
FROM customers c JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name
ORDER BY order_count DESC LIMIT 10

WrenAI-style (~SQL) — used LEFT JOIN instead of INNER JOIN:

SELECT customers.customer_id, customers.first_name, customers.last_name,
    COUNT(orders.order_id) AS order_count
FROM customers
LEFT JOIN orders ON customers.customer_id = orders.customer_id
GROUP BY customers.customer_id, customers.first_name, customers.last_name
ORDER BY order_count DESC LIMIT 10;

DB-GPT-style (~SQL) — also LEFT JOIN:

SELECT c.customer_id, c.first_name, c.last_name, COUNT(o.order_id) AS order_count
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name
ORDER BY order_count DESC LIMIT 10;

Gold: SELECT c.first_name, c.last_name, COUNT(o.order_id) AS order_count FROM customers c JOIN orders o ON c.customer_id = o.customer_id GROUP BY c.customer_id, c.first_name, c.last_name ORDER BY order_count DESC LIMIT 10

Both WrenAI-style and DB-GPT-style used LEFT JOIN (which includes customers with 0 orders), while the gold SQL uses INNER JOIN. This changes the result set.

Key Findings

1. RAG with examples beats semantic models and multi-step pipelines

Vanna's 40% vs WrenAI's 22% and DB-GPT's 20% isn't close. The difference comes down to one thing: Vanna shows the LLM examples of correct SQL for similar questions. The other approaches give it schema information and hope for the best. Few-shot examples remain the most effective way to steer LLM SQL generation.

2. The column selection problem is universal

The single biggest failure mode across all approaches: SELECT * or selecting extra columns when the gold SQL specifies particular ones. This isn't a logic error — the SQL is semantically reasonable — but it fails execution accuracy. A production system needs either strict column selection rules or a more forgiving evaluation metric.

3. Multi-step pipelines trade accuracy for reliability

DB-GPT's 3-step approach achieved 0 execution errors (vs 2 each for the others) but scored lowest on accuracy. The correction step catches syntax errors but can't fix wrong logic. The extra LLM calls add ~1.5s latency per query. In production, zero errors matters — but you need to pair it with better initial generation.

4. Schema complexity is the real enemy

All three approaches score 27-57% on Tier 1 (5 tables with clean names) and collapse to 6-25% on Tier 2 (10 tables). The problem isn't abbreviated names (Tier 3 scores equal Tier 2) — it's having more tables for the model to choose from. Schema pruning and table selection are critical unsolved problems.

5. None of these approaches are production-ready at 40%

The best score here is 40%, and that's with the simplest evaluation metric (does the result set match). Real users would consider many of the ~SQL results "close enough." But even with generous grading, NL2SQL tools need significant improvement — or a very controlled schema — before they can reliably replace writing SQL.

Latency and Cost

Approach	Avg Latency	LLM Calls/Query	Total Tokens	Why
WrenAI-style	1,125ms	1	193,969	Single call with large MDL context (~3,800 tokens/query)
Vanna AI	1,332ms	1	N/A	Single call + ChromaDB vector search overhead
DB-GPT-style	2,916ms	2-3	74,607	Schema linking + generation + optional correction

WrenAI-style is fastest but uses the most tokens (large MDL context). DB-GPT-style uses the fewest tokens per call but makes 2-3 calls, resulting in 2.6x the latency. For production cost optimization, Vanna's approach is the best balance.

What I'd Do Differently

If I were building a production NL2SQL system based on these results:

Start with Vanna's approach — RAG with curated Q-SQL examples gives the best accuracy floor. Invest in building a quality training set rather than better prompts.
Add DB-GPT's correction step — The 0-error rate is valuable. Use a 2-step pipeline: RAG-based generation first, then error correction if execution fails.
Borrow WrenAI's semantic context — Table and column descriptions help, especially for non-obvious schemas. Combine MDL-style descriptions with example pairs.
Solve column selection separately — Either constrain the LLM to always use SELECT * (and handle projection downstream) or add a post-processing step that maps columns.
Invest in schema pruning — The Tier 2 drop shows that giving the LLM the full schema hurts. DB-GPT's schema linking idea is right; it just needs to be more reliable.

The benchmark code, all 50 queries, and raw results are available on GitHub. Built with DuckDB, GPT-4o, and a lot of patience.