NL2SQL Benchmark: Vanna AI vs WrenAI vs DB-GPT — 50 Queries, One Database, Three Approaches

February 21, 2026·28 min read·Sudipta Pathak
Text-to-SQLnl2sqltext-to-sqlbenchmarkvannawrenaidb-gptduckdbllm

NL2SQL Benchmark: Vanna AI vs WrenAI vs DB-GPT

In my previous post, I dissected the architectures of three leading open-source NL2SQL tools — Vanna AI (RAG-first), WrenAI (semantic model), and DB-GPT (multi-agent). That was a code-level teardown of how each tool works. This post answers the natural follow-up: how well do they actually work?

I built a benchmark with 50 gold-standard SQL queries across 5 categories and 3 tiers of schema complexity, all against the same DuckDB database. Every tool used GPT-4o as the underlying LLM, so the differences come down to prompt engineering, context management, and pipeline design — not model capability.

Methodology note: Vanna AI was run as-is using its Python SDK with ChromaDB. For WrenAI and DB-GPT, I replicated each tool's core methodology (semantic model context and multi-step pipeline, respectively) using direct OpenAI API calls, since both require heavyweight Docker deployments that weren't practical for isolated benchmarking. The prompts and pipeline logic mirror what each tool does internally, but these are not results from the full production systems. Think of them as "approach benchmarks" — testing the strategy, not the full product.


Benchmark Setup

Parameter Value
Database DuckDB — 16 tables, 3 tiers of schema complexity
Queries 50 gold-standard (10 per category)
Categories simple, aggregation, multi-join, nested subquery, advanced
Tier 1 5 core tables — clean, descriptive names (customers, orders, products, etc.)
Tier 2 +5 tables — departments, employees, suppliers, shipments, reviews
Tier 3 +6 tables — abbreviated names (wh, inv, promo, ret, pay_mthd, aud_log)
LLM GPT-4o (same model for all approaches)
Evaluation Execution accuracy — does the generated SQL produce the same result set as gold?

How Results Are Scored

Each query gets one of four results:

Result Meaning
PASS Generated SQL produces the exact same result set as the gold SQL (may include exact match)
~SQL SQL executed successfully but returned different results — wrong columns, wrong rows, or wrong aggregation
FAIL SQL returned results but they were completely wrong (different row count, wrong data)
ERROR SQL failed to execute entirely — syntax error, missing table, DuckDB incompatibility

Three Approaches Tested

Approach How It Works
Vanna AI (RAG + ChromaDB) Trains DDL, documentation, and example Q-SQL pairs into a ChromaDB vector store. For each question, retrieves the most relevant context and sends it with the question to the LLM. Run via Vanna's Python SDK.
WrenAI-style (Semantic Model) Sends a rich MDL (Modeling Definition Language) with table descriptions, column display names, and relationship definitions as context in a single LLM call. Replicates WrenAI's core methodology.
DB-GPT-style (Multi-Step Agent) Three sequential LLM calls: (1) schema linking to identify relevant tables, (2) SQL generation with focused schema + sample data, (3) SQL correction if execution fails. Replicates DB-GPT's AWEL pipeline logic.

Headline Results

Approach Exec Accuracy Exact Match Errors Avg Latency Tokens Used
Vanna AI 20/50 (40.0%) 7/50 (14.0%) 2 1,332ms N/A
WrenAI-style 11/50 (22.0%) 2/50 (4.0%) 2 1,125ms 193,969
DB-GPT-style 10/50 (20.0%) 2/50 (4.0%) 0 2,916ms 74,607

Vanna AI leads at 40% execution accuracy — nearly double the other two approaches. But let's break this down further.

Accuracy by Category

Category Vanna AI WrenAI-style DB-GPT-style
Simple 7/10 (70%) 3/10 (30%) 3/10 (30%)
Aggregation 3/10 (30%) 2/10 (20%) 2/10 (20%)
Multi-join 4/10 (40%) 3/10 (30%) 2/10 (20%)
Nested subquery 4/10 (40%) 2/10 (20%) 2/10 (20%)
Advanced 2/10 (20%) 1/10 (10%) 1/10 (10%)

Vanna dominates simple queries (70% vs 30%) and holds a meaningful edge on multi-join and nested subqueries. All three approaches collapse below 20% on advanced queries (window functions, CTEs, cohort analysis).

Accuracy by Schema Tier

Tier Description Vanna AI WrenAI-style DB-GPT-style
Tier 1 Clean names (5 tables) 13/23 (56.5%) 7/23 (30.4%) 6/23 (26.1%)
Tier 2 More tables (10 total) 4/16 (25.0%) 1/16 (6.2%) 1/16 (6.2%)
Tier 3 Abbreviated names (16 total) 3/11 (27.3%) 3/11 (27.3%) 3/11 (27.3%)

The Tier 2 drop is dramatic — accuracy falls off a cliff when the schema expands from 5 to 10 tables. Interestingly, all three approaches score identically on Tier 3 (27.3%), suggesting that abbreviated names aren't harder than having more tables to choose from.


Vanna AI: The Training Data Effect

One of Vanna's distinguishing features is its training pipeline. You feed it DDL, documentation, and example question-SQL pairs, and it stores everything in a ChromaDB vector store for retrieval. Here's the impact of adding just 5 example pairs:

Mode Queries Exec Accuracy Exact Match Errors Avg Latency
Zero-shot (DDL + docs only) 23 4/23 (17.4%) 0/23 (0.0%) 1 1,662ms
5 example Q-SQL pairs 23 11/23 (47.8%) 5/23 (21.7%) 0 1,331ms
Improvement +30.4pp +21.7pp -1 -331ms

Five training examples nearly tripled accuracy from 17.4% to 47.8%. Zero-shot Vanna (17.4%) is actually worse than the other approaches on the same Tier 1 queries, which tells us the RAG retrieval of relevant examples is doing most of the heavy lifting — not the architecture itself.


Deep Dive: Vanna AI Results

Approach: RAG + ChromaDB with 5 example Q-SQL training pairs, DDL, and documentation

By Category

Category Accuracy Errors Avg Latency
Simple 7/10 (70%) 0 1,030ms
Aggregation 3/10 (30%) 1 1,049ms
Multi-join 4/10 (40%) 1 1,676ms
Nested subquery 4/10 (40%) 0 1,328ms
Advanced 2/10 (20%) 0 1,577ms

By Tier

Tier Accuracy Errors
Tier 1 13/23 (56.5%) 0
Tier 2 4/16 (25.0%) 1
Tier 3 3/11 (27.3%) 1

Selected Query Results

Expand full per-query results (50 queries)
# Category Tier Question Result Latency
1 simple T1 List all customers from California. PASS 1,059ms
2 simple T1 Show the 5 most expensive products. PASS 1,111ms
3 simple T1 How many orders have been cancelled? PASS 1,014ms
4 simple T1 What are all the product categories? ~SQL 847ms
5 simple T1 Find all orders placed in January 2024. PASS 750ms
6 simple T1 Which products cost less than $50? PASS 712ms
7 simple T1 Show all delivered orders sorted by total amount desc. ~SQL 792ms
8 simple T2 List all employees in the Engineering department. ~SQL 1,561ms
9 simple T2 What are the names and countries of all suppliers? PASS 746ms
10 simple T2 Show all 5-star reviews. PASS 1,705ms
11 aggregation T1 What is the total revenue from all delivered orders? PASS 804ms
12 aggregation T1 How many orders does each customer have? Top 10. PASS 900ms
13 aggregation T1 What is the average order value by month in 2024? ~SQL 1,127ms
14 aggregation T1 Which product category generates the most revenue? ~SQL 1,415ms
15 aggregation T1 What is the profit margin for each product? ~SQL 718ms
16 aggregation T2 What is the average product rating by category? ~SQL 1,172ms
17 aggregation T2 What is the total salary expense per department? ~SQL 1,035ms
18 aggregation T2 Which carrier delivers the most shipments? ~SQL 873ms
19 aggregation T2 What is the average delivery time in days by carrier? FAIL 1,058ms
20 aggregation T3 How many returns by reason with total refund amount? PASS 1,390ms
21 multi-join T1 Show each customer's name and total spend. PASS 839ms
22 multi-join T1 What are the best-selling products by quantity? ~SQL 1,196ms
23 multi-join T1 Most popular product category per state. PASS 2,394ms
24 multi-join T2 Products with avg review rating and review count. ~SQL 2,200ms
25 multi-join T2 List employees and their managers' names. ~SQL 1,298ms
26 multi-join T2 Customers who ordered from every category. PASS 1,314ms
27 multi-join T3 Inventory levels across warehouses for products under 50. ~SQL 1,553ms
28 multi-join T3 Most popular payment methods by customer state. PASS 1,593ms
29 multi-join T3 Warehouse manager with most total inventory. FAIL 2,152ms
30 multi-join T3 Total refund amount by product category for completed returns. ~SQL 2,224ms
31 nested T1 Customers who spent more than the average. PASS 2,097ms
32 nested T1 Products that have never been ordered. PASS 747ms
33 nested T1 Most recent order for each customer. ~SQL 941ms
34 nested T1 Categories with products above average price. PASS 871ms
35 nested T2 Employees earning more than department average. PASS 1,237ms
36 nested T2 Products with above-average review ratings. ~SQL 1,271ms
37 nested T2 Suppliers who shipped above-average total orders. ~SQL 2,004ms
38 nested T3 Products where inventory is below reorder point. ~SQL 1,172ms
39 nested T3 Customers who used more than 2 payment methods. ~SQL 1,124ms
40 nested T3 Orders with total above average for that customer state. ~SQL 1,815ms
41 advanced T1 Rank customers by total spend with percentile. PASS 1,117ms
42 advanced T1 Month-over-month revenue growth rate for 2024. ~SQL 1,366ms
43 advanced T1 Product revenue as percentage of total. ~SQL 1,954ms
44 advanced T1 Running total of orders by date for 2024. ~SQL 1,329ms
45 advanced T2 Rank employees within department by salary. ~SQL 2,115ms
46 advanced T2 Top 3 products by review count with avg rating. ~SQL 1,119ms
47 advanced T2 Customer cohort analysis by signup month. ~SQL 1,217ms
48 advanced T3 Top 5 promo codes by usage with estimated revenue. ~SQL 2,503ms
49 advanced T3 Total orders and returns combined via UNION. ~SQL 1,783ms
50 advanced T3 Audit log activity by table and action type. PASS 1,270ms

SQL Examples

"How many orders have been cancelled?" — PASS (execution match)

-- Gold
SELECT COUNT(*) AS cancelled_orders FROM orders WHERE status = 'cancelled'

-- Vanna AI
SELECT COUNT(order_id) AS cancelled_orders_count FROM orders WHERE status = 'cancelled'

Different column alias and COUNT(order_id) vs COUNT(*), but same result: correct.

"What are all the product categories?" — ~SQL (close miss)

-- Gold
SELECT name, description FROM categories

-- Vanna AI
SELECT DISTINCT name FROM categories;

Vanna returned only the name column, missing description. Same row count but different columns.


Deep Dive: WrenAI-style Results

Approach: Single LLM call with full MDL (table descriptions, column display names, relationships) as context

By Category

Category Accuracy Errors Avg Latency
Simple 3/10 (30%) 0 920ms
Aggregation 2/10 (20%) 1 1,086ms
Multi-join 3/10 (30%) 0 1,333ms
Nested subquery 2/10 (20%) 1 1,054ms
Advanced 1/10 (10%) 0 1,233ms

By Tier

Tier Accuracy Errors
Tier 1 7/23 (30.4%) 0
Tier 2 1/16 (6.2%) 2
Tier 3 3/11 (27.3%) 0
Expand full per-query results (50 queries)
# Category Tier Question Result Latency
1 simple T1 List all customers from California. ~SQL 1,191ms
2 simple T1 Show the 5 most expensive products. ~SQL 888ms
3 simple T1 How many orders have been cancelled? PASS 383ms
4 simple T1 What are all the product categories? ~SQL 278ms
5 simple T1 Find all orders placed in January 2024. PASS 1,238ms
6 simple T1 Which products cost less than $50? ~SQL 446ms
7 simple T1 Show all delivered orders sorted by total amount desc. ~SQL 1,610ms
8 simple T2 List all employees in the Engineering department. ~SQL 1,617ms
9 simple T2 What are the names and countries of all suppliers? PASS 911ms
10 simple T2 Show all 5-star reviews. ~SQL 640ms
11 aggregation T1 What is the total revenue from all delivered orders? PASS 1,200ms
12 aggregation T1 How many orders does each customer have? Top 10. ~SQL 635ms
13 aggregation T1 What is the average order value by month in 2024? ~SQL 615ms
14 aggregation T1 Which product category generates the most revenue? ~SQL 2,760ms
15 aggregation T1 What is the profit margin for each product? ~SQL 634ms
16 aggregation T2 What is the average product rating by category? ~SQL 1,831ms
17 aggregation T2 What is the total salary expense per department? ~SQL 1,135ms
18 aggregation T2 Which carrier delivers the most shipments? ~SQL 414ms
19 aggregation T2 What is the average delivery time in days by carrier? FAIL 827ms
20 aggregation T3 How many returns by reason with total refund amount? PASS 810ms
21 multi-join T1 Show each customer's name and total spend. PASS 1,590ms
22 multi-join T1 What are the best-selling products by quantity? ~SQL 478ms
23 multi-join T1 Most popular product category per state. PASS 1,911ms
24 multi-join T2 Products with avg review rating and review count. ~SQL 2,865ms
25 multi-join T2 List employees and their managers' names. ~SQL 909ms
26 multi-join T2 Customers who ordered from every category. ~SQL 688ms
27 multi-join T3 Inventory levels across warehouses for products under 50. ~SQL 2,177ms
28 multi-join T3 Most popular payment methods by customer state. PASS 957ms
29 multi-join T3 Warehouse manager with most total inventory. ~SQL 967ms
30 multi-join T3 Total refund amount by product category for completed returns. ~SQL 790ms
31 nested T1 Customers who spent more than the average. ~SQL 2,187ms
32 nested T1 Products that have never been ordered. PASS 809ms
33 nested T1 Most recent order for each customer. ~SQL 1,459ms
34 nested T1 Categories with products above average price. PASS 785ms
35 nested T2 Employees earning more than department average. ~SQL 888ms
36 nested T2 Products with above-average review ratings. ~SQL 814ms
37 nested T2 Suppliers who shipped above-average total orders. FAIL 944ms
38 nested T3 Products where inventory is below reorder point. ~SQL 952ms
39 nested T3 Customers who used more than 2 payment methods. ~SQL 820ms
40 nested T3 Orders with total above average for that customer state. ~SQL 886ms
41 advanced T1 Rank customers by total spend with percentile. ~SQL 1,014ms
42 advanced T1 Month-over-month revenue growth rate for 2024. ~SQL 1,767ms
43 advanced T1 Product revenue as percentage of total. ~SQL 1,124ms
44 advanced T1 Running total of orders by date for 2024. ~SQL 1,536ms
45 advanced T2 Rank employees within department by salary. ~SQL 786ms
46 advanced T2 Top 3 products by review count with avg rating. ~SQL 1,729ms
47 advanced T2 Customer cohort analysis by signup month. ~SQL 1,200ms
48 advanced T3 Top 5 promo codes by usage with estimated revenue. ~SQL 834ms
49 advanced T3 Total orders and returns combined via UNION. ~SQL 1,707ms
50 advanced T3 Audit log activity by table and action type. PASS 633ms

SQL Examples

"List all customers from California." — ~SQL (close miss)

-- Gold
SELECT first_name, last_name, email, city FROM customers WHERE state = 'CA'

-- WrenAI-style
SELECT * FROM customers WHERE state = 'CA';

Despite having column display names in the MDL, the model defaulted to SELECT *. Same rows, different columns.


Deep Dive: DB-GPT-style Results

Approach: 3-step pipeline — schema linking, SQL generation with sample data, SQL correction on failure

By Category

Category Accuracy Errors Avg Latency
Simple 3/10 (30%) 0 2,220ms
Aggregation 2/10 (20%) 0 2,661ms
Multi-join 2/10 (20%) 0 3,768ms
Nested subquery 2/10 (20%) 0 3,018ms
Advanced 1/10 (10%) 0 2,913ms

By Tier

Tier Accuracy Errors
Tier 1 6/23 (26.1%) 0
Tier 2 1/16 (6.2%) 0
Tier 3 3/11 (27.3%) 0

The standout stat: 0 execution errors across all 50 queries. The correction step works — when SQL fails to execute, the LLM fixes it. But syntactically valid SQL that returns wrong results still counts as a miss.

Expand full per-query results (50 queries)
# Category Tier Question Result Latency
1 simple T1 List all customers from California. ~SQL 2,124ms
2 simple T1 Show the 5 most expensive products. ~SQL 2,097ms
3 simple T1 How many orders have been cancelled? PASS 1,710ms
4 simple T1 What are all the product categories? ~SQL 2,280ms
5 simple T1 Find all orders placed in January 2024. PASS 2,064ms
6 simple T1 Which products cost less than $50? ~SQL 1,920ms
7 simple T1 Show all delivered orders sorted by total amount desc. ~SQL 2,441ms
8 simple T2 List all employees in the Engineering department. ~SQL 4,232ms
9 simple T2 What are the names and countries of all suppliers? PASS 1,640ms
10 simple T2 Show all 5-star reviews. ~SQL 1,688ms
11 aggregation T1 What is the total revenue from all delivered orders? PASS 2,676ms
12 aggregation T1 How many orders does each customer have? Top 10. ~SQL 2,974ms
13 aggregation T1 What is the average order value by month in 2024? ~SQL 2,686ms
14 aggregation T1 Which product category generates the most revenue? ~SQL 3,367ms
15 aggregation T1 What is the profit margin for each product? ~SQL 1,737ms
16 aggregation T2 What is the average product rating by category? ~SQL 3,613ms
17 aggregation T2 What is the total salary expense per department? ~SQL 2,982ms
18 aggregation T2 Which carrier delivers the most shipments? ~SQL 1,791ms
19 aggregation T2 What is the average delivery time in days by carrier? ~SQL 2,728ms
20 aggregation T3 How many returns by reason with total refund amount? PASS 2,054ms
21 multi-join T1 Show each customer's name and total spend. ~SQL 2,854ms
22 multi-join T1 What are the best-selling products by quantity? ~SQL 1,907ms
23 multi-join T1 Most popular product category per state. PASS 4,462ms
24 multi-join T2 Products with avg review rating and review count. ~SQL 3,814ms
25 multi-join T2 List employees and their managers' names. ~SQL 2,534ms
26 multi-join T2 Customers who ordered from every category. ~SQL 2,865ms
27 multi-join T3 Inventory levels across warehouses for products under 50. ~SQL 3,487ms
28 multi-join T3 Most popular payment methods by customer state. PASS 2,973ms
29 multi-join T3 Warehouse manager with most total inventory. ~SQL 4,169ms
30 multi-join T3 Total refund amount by product category for completed returns. ~SQL 8,612ms
31 nested T1 Customers who spent more than the average. ~SQL 4,072ms
32 nested T1 Products that have never been ordered. PASS 3,126ms
33 nested T1 Most recent order for each customer. ~SQL 2,995ms
34 nested T1 Categories with products above average price. PASS 2,794ms
35 nested T2 Employees earning more than department average. ~SQL 2,819ms
36 nested T2 Products with above-average review ratings. ~SQL 2,711ms
37 nested T2 Suppliers who shipped above-average total orders. ~SQL 3,288ms
38 nested T3 Products where inventory is below reorder point. ~SQL 2,879ms
39 nested T3 Customers who used more than 2 payment methods. ~SQL 3,035ms
40 nested T3 Orders with total above average for that customer state. ~SQL 2,465ms
41 advanced T1 Rank customers by total spend with percentile. ~SQL 3,030ms
42 advanced T1 Month-over-month revenue growth rate for 2024. ~SQL 2,867ms
43 advanced T1 Product revenue as percentage of total. ~SQL 3,340ms
44 advanced T1 Running total of orders by date for 2024. ~SQL 2,316ms
45 advanced T2 Rank employees within department by salary. ~SQL 2,840ms
46 advanced T2 Top 3 products by review count with avg rating. ~SQL 2,352ms
47 advanced T2 Customer cohort analysis by signup month. ~SQL 2,685ms
48 advanced T3 Top 5 promo codes by usage with estimated revenue. ~SQL 4,946ms
49 advanced T3 Total orders and returns combined via UNION. ~SQL 2,793ms
50 advanced T3 Audit log activity by table and action type. PASS 1,960ms

SQL Examples

"Which products cost less than $50?" — ~SQL (close miss)

-- Gold
SELECT name, price FROM products WHERE price < 50 ORDER BY price

-- DB-GPT-style
SELECT product_id, name, price, cost
FROM products
WHERE cost < 50;

Schema linking correctly identified the products table, but the LLM picked the wrong column (cost instead of price) and added extra columns. The sample data step didn't help disambiguate.


Head-to-Head: Same Query, Three Approaches

Here's how each approach handled the same questions — same LLM, same database, different context strategies.

"Show the 5 most expensive products."

Vanna AI (PASS) — trained on similar examples, nailed the column selection:

SELECT name, price FROM products ORDER BY price DESC LIMIT 5

WrenAI-style (~SQL) — had column display names in MDL but still added product_id:

SELECT product_id, name, price FROM products ORDER BY price DESC LIMIT 5;

DB-GPT-style (~SQL) — schema linking found the right table, but same extra column:

SELECT product_id, name, price FROM products ORDER BY price DESC LIMIT 5;

Gold: SELECT name, price FROM products ORDER BY price DESC LIMIT 5

The pattern is clear: Vanna's example Q-SQL pairs teach the model which columns to select. The other approaches give the model the full schema and let it decide — and it consistently over-selects.

"How many orders does each customer have? Show the top 10."

Vanna AI (PASS):

SELECT c.first_name, c.last_name, COUNT(o.order_id) AS order_count
FROM customers c JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name
ORDER BY order_count DESC LIMIT 10

WrenAI-style (~SQL) — used LEFT JOIN instead of INNER JOIN:

SELECT customers.customer_id, customers.first_name, customers.last_name,
    COUNT(orders.order_id) AS order_count
FROM customers
LEFT JOIN orders ON customers.customer_id = orders.customer_id
GROUP BY customers.customer_id, customers.first_name, customers.last_name
ORDER BY order_count DESC LIMIT 10;

DB-GPT-style (~SQL) — also LEFT JOIN:

SELECT c.customer_id, c.first_name, c.last_name, COUNT(o.order_id) AS order_count
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name
ORDER BY order_count DESC LIMIT 10;

Gold: SELECT c.first_name, c.last_name, COUNT(o.order_id) AS order_count FROM customers c JOIN orders o ON c.customer_id = o.customer_id GROUP BY c.customer_id, c.first_name, c.last_name ORDER BY order_count DESC LIMIT 10

Both WrenAI-style and DB-GPT-style used LEFT JOIN (which includes customers with 0 orders), while the gold SQL uses INNER JOIN. This changes the result set.


Key Findings

1. RAG with examples beats semantic models and multi-step pipelines

Vanna's 40% vs WrenAI's 22% and DB-GPT's 20% isn't close. The difference comes down to one thing: Vanna shows the LLM examples of correct SQL for similar questions. The other approaches give it schema information and hope for the best. Few-shot examples remain the most effective way to steer LLM SQL generation.

2. The column selection problem is universal

The single biggest failure mode across all approaches: SELECT * or selecting extra columns when the gold SQL specifies particular ones. This isn't a logic error — the SQL is semantically reasonable — but it fails execution accuracy. A production system needs either strict column selection rules or a more forgiving evaluation metric.

3. Multi-step pipelines trade accuracy for reliability

DB-GPT's 3-step approach achieved 0 execution errors (vs 2 each for the others) but scored lowest on accuracy. The correction step catches syntax errors but can't fix wrong logic. The extra LLM calls add ~1.5s latency per query. In production, zero errors matters — but you need to pair it with better initial generation.

4. Schema complexity is the real enemy

All three approaches score 27-57% on Tier 1 (5 tables with clean names) and collapse to 6-25% on Tier 2 (10 tables). The problem isn't abbreviated names (Tier 3 scores equal Tier 2) — it's having more tables for the model to choose from. Schema pruning and table selection are critical unsolved problems.

5. None of these approaches are production-ready at 40%

The best score here is 40%, and that's with the simplest evaluation metric (does the result set match). Real users would consider many of the ~SQL results "close enough." But even with generous grading, NL2SQL tools need significant improvement — or a very controlled schema — before they can reliably replace writing SQL.


Latency and Cost

Approach Avg Latency LLM Calls/Query Total Tokens Why
WrenAI-style 1,125ms 1 193,969 Single call with large MDL context (~3,800 tokens/query)
Vanna AI 1,332ms 1 N/A Single call + ChromaDB vector search overhead
DB-GPT-style 2,916ms 2-3 74,607 Schema linking + generation + optional correction

WrenAI-style is fastest but uses the most tokens (large MDL context). DB-GPT-style uses the fewest tokens per call but makes 2-3 calls, resulting in 2.6x the latency. For production cost optimization, Vanna's approach is the best balance.


What I'd Do Differently

If I were building a production NL2SQL system based on these results:

  1. Start with Vanna's approach — RAG with curated Q-SQL examples gives the best accuracy floor. Invest in building a quality training set rather than better prompts.

  2. Add DB-GPT's correction step — The 0-error rate is valuable. Use a 2-step pipeline: RAG-based generation first, then error correction if execution fails.

  3. Borrow WrenAI's semantic context — Table and column descriptions help, especially for non-obvious schemas. Combine MDL-style descriptions with example pairs.

  4. Solve column selection separately — Either constrain the LLM to always use SELECT * (and handle projection downstream) or add a post-processing step that maps columns.

  5. Invest in schema pruning — The Tier 2 drop shows that giving the LLM the full schema hurts. DB-GPT's schema linking idea is right; it just needs to be more reliable.


The benchmark code, all 50 queries, and raw results are available on GitHub. Built with DuckDB, GPT-4o, and a lot of patience.