NL2SQL Benchmark: Vanna AI vs WrenAI vs DB-GPT — 50 Queries, One Database, Three Approaches
NL2SQL Benchmark: Vanna AI vs WrenAI vs DB-GPT
In my previous post, I dissected the architectures of three leading open-source NL2SQL tools — Vanna AI (RAG-first), WrenAI (semantic model), and DB-GPT (multi-agent). That was a code-level teardown of how each tool works. This post answers the natural follow-up: how well do they actually work?
I built a benchmark with 50 gold-standard SQL queries across 5 categories and 3 tiers of schema complexity, all against the same DuckDB database. Every tool used GPT-4o as the underlying LLM, so the differences come down to prompt engineering, context management, and pipeline design — not model capability.
Methodology note: Vanna AI was run as-is using its Python SDK with ChromaDB. For WrenAI and DB-GPT, I replicated each tool's core methodology (semantic model context and multi-step pipeline, respectively) using direct OpenAI API calls, since both require heavyweight Docker deployments that weren't practical for isolated benchmarking. The prompts and pipeline logic mirror what each tool does internally, but these are not results from the full production systems. Think of them as "approach benchmarks" — testing the strategy, not the full product.
Benchmark Setup
| Parameter | Value |
|---|---|
| Database | DuckDB — 16 tables, 3 tiers of schema complexity |
| Queries | 50 gold-standard (10 per category) |
| Categories | simple, aggregation, multi-join, nested subquery, advanced |
| Tier 1 | 5 core tables — clean, descriptive names (customers, orders, products, etc.) |
| Tier 2 | +5 tables — departments, employees, suppliers, shipments, reviews |
| Tier 3 | +6 tables — abbreviated names (wh, inv, promo, ret, pay_mthd, aud_log) |
| LLM | GPT-4o (same model for all approaches) |
| Evaluation | Execution accuracy — does the generated SQL produce the same result set as gold? |
How Results Are Scored
Each query gets one of four results:
| Result | Meaning |
|---|---|
| PASS | Generated SQL produces the exact same result set as the gold SQL (may include exact match) |
| ~SQL | SQL executed successfully but returned different results — wrong columns, wrong rows, or wrong aggregation |
| FAIL | SQL returned results but they were completely wrong (different row count, wrong data) |
| ERROR | SQL failed to execute entirely — syntax error, missing table, DuckDB incompatibility |
Three Approaches Tested
| Approach | How It Works |
|---|---|
| Vanna AI (RAG + ChromaDB) | Trains DDL, documentation, and example Q-SQL pairs into a ChromaDB vector store. For each question, retrieves the most relevant context and sends it with the question to the LLM. Run via Vanna's Python SDK. |
| WrenAI-style (Semantic Model) | Sends a rich MDL (Modeling Definition Language) with table descriptions, column display names, and relationship definitions as context in a single LLM call. Replicates WrenAI's core methodology. |
| DB-GPT-style (Multi-Step Agent) | Three sequential LLM calls: (1) schema linking to identify relevant tables, (2) SQL generation with focused schema + sample data, (3) SQL correction if execution fails. Replicates DB-GPT's AWEL pipeline logic. |
Headline Results
| Approach | Exec Accuracy | Exact Match | Errors | Avg Latency | Tokens Used |
|---|---|---|---|---|---|
| Vanna AI | 20/50 (40.0%) | 7/50 (14.0%) | 2 | 1,332ms | N/A |
| WrenAI-style | 11/50 (22.0%) | 2/50 (4.0%) | 2 | 1,125ms | 193,969 |
| DB-GPT-style | 10/50 (20.0%) | 2/50 (4.0%) | 0 | 2,916ms | 74,607 |
Vanna AI leads at 40% execution accuracy — nearly double the other two approaches. But let's break this down further.
Accuracy by Category
| Category | Vanna AI | WrenAI-style | DB-GPT-style |
|---|---|---|---|
| Simple | 7/10 (70%) | 3/10 (30%) | 3/10 (30%) |
| Aggregation | 3/10 (30%) | 2/10 (20%) | 2/10 (20%) |
| Multi-join | 4/10 (40%) | 3/10 (30%) | 2/10 (20%) |
| Nested subquery | 4/10 (40%) | 2/10 (20%) | 2/10 (20%) |
| Advanced | 2/10 (20%) | 1/10 (10%) | 1/10 (10%) |
Vanna dominates simple queries (70% vs 30%) and holds a meaningful edge on multi-join and nested subqueries. All three approaches collapse below 20% on advanced queries (window functions, CTEs, cohort analysis).
Accuracy by Schema Tier
| Tier | Description | Vanna AI | WrenAI-style | DB-GPT-style |
|---|---|---|---|---|
| Tier 1 | Clean names (5 tables) | 13/23 (56.5%) | 7/23 (30.4%) | 6/23 (26.1%) |
| Tier 2 | More tables (10 total) | 4/16 (25.0%) | 1/16 (6.2%) | 1/16 (6.2%) |
| Tier 3 | Abbreviated names (16 total) | 3/11 (27.3%) | 3/11 (27.3%) | 3/11 (27.3%) |
The Tier 2 drop is dramatic — accuracy falls off a cliff when the schema expands from 5 to 10 tables. Interestingly, all three approaches score identically on Tier 3 (27.3%), suggesting that abbreviated names aren't harder than having more tables to choose from.
Vanna AI: The Training Data Effect
One of Vanna's distinguishing features is its training pipeline. You feed it DDL, documentation, and example question-SQL pairs, and it stores everything in a ChromaDB vector store for retrieval. Here's the impact of adding just 5 example pairs:
| Mode | Queries | Exec Accuracy | Exact Match | Errors | Avg Latency |
|---|---|---|---|---|---|
| Zero-shot (DDL + docs only) | 23 | 4/23 (17.4%) | 0/23 (0.0%) | 1 | 1,662ms |
| 5 example Q-SQL pairs | 23 | 11/23 (47.8%) | 5/23 (21.7%) | 0 | 1,331ms |
| Improvement | +30.4pp | +21.7pp | -1 | -331ms |
Five training examples nearly tripled accuracy from 17.4% to 47.8%. Zero-shot Vanna (17.4%) is actually worse than the other approaches on the same Tier 1 queries, which tells us the RAG retrieval of relevant examples is doing most of the heavy lifting — not the architecture itself.
Deep Dive: Vanna AI Results
Approach: RAG + ChromaDB with 5 example Q-SQL training pairs, DDL, and documentation
By Category
| Category | Accuracy | Errors | Avg Latency |
|---|---|---|---|
| Simple | 7/10 (70%) | 0 | 1,030ms |
| Aggregation | 3/10 (30%) | 1 | 1,049ms |
| Multi-join | 4/10 (40%) | 1 | 1,676ms |
| Nested subquery | 4/10 (40%) | 0 | 1,328ms |
| Advanced | 2/10 (20%) | 0 | 1,577ms |
By Tier
| Tier | Accuracy | Errors |
|---|---|---|
| Tier 1 | 13/23 (56.5%) | 0 |
| Tier 2 | 4/16 (25.0%) | 1 |
| Tier 3 | 3/11 (27.3%) | 1 |
Selected Query Results
Expand full per-query results (50 queries)
| # | Category | Tier | Question | Result | Latency |
|---|---|---|---|---|---|
| 1 | simple | T1 | List all customers from California. | PASS | 1,059ms |
| 2 | simple | T1 | Show the 5 most expensive products. | PASS | 1,111ms |
| 3 | simple | T1 | How many orders have been cancelled? | PASS | 1,014ms |
| 4 | simple | T1 | What are all the product categories? | ~SQL | 847ms |
| 5 | simple | T1 | Find all orders placed in January 2024. | PASS | 750ms |
| 6 | simple | T1 | Which products cost less than $50? | PASS | 712ms |
| 7 | simple | T1 | Show all delivered orders sorted by total amount desc. | ~SQL | 792ms |
| 8 | simple | T2 | List all employees in the Engineering department. | ~SQL | 1,561ms |
| 9 | simple | T2 | What are the names and countries of all suppliers? | PASS | 746ms |
| 10 | simple | T2 | Show all 5-star reviews. | PASS | 1,705ms |
| 11 | aggregation | T1 | What is the total revenue from all delivered orders? | PASS | 804ms |
| 12 | aggregation | T1 | How many orders does each customer have? Top 10. | PASS | 900ms |
| 13 | aggregation | T1 | What is the average order value by month in 2024? | ~SQL | 1,127ms |
| 14 | aggregation | T1 | Which product category generates the most revenue? | ~SQL | 1,415ms |
| 15 | aggregation | T1 | What is the profit margin for each product? | ~SQL | 718ms |
| 16 | aggregation | T2 | What is the average product rating by category? | ~SQL | 1,172ms |
| 17 | aggregation | T2 | What is the total salary expense per department? | ~SQL | 1,035ms |
| 18 | aggregation | T2 | Which carrier delivers the most shipments? | ~SQL | 873ms |
| 19 | aggregation | T2 | What is the average delivery time in days by carrier? | FAIL | 1,058ms |
| 20 | aggregation | T3 | How many returns by reason with total refund amount? | PASS | 1,390ms |
| 21 | multi-join | T1 | Show each customer's name and total spend. | PASS | 839ms |
| 22 | multi-join | T1 | What are the best-selling products by quantity? | ~SQL | 1,196ms |
| 23 | multi-join | T1 | Most popular product category per state. | PASS | 2,394ms |
| 24 | multi-join | T2 | Products with avg review rating and review count. | ~SQL | 2,200ms |
| 25 | multi-join | T2 | List employees and their managers' names. | ~SQL | 1,298ms |
| 26 | multi-join | T2 | Customers who ordered from every category. | PASS | 1,314ms |
| 27 | multi-join | T3 | Inventory levels across warehouses for products under 50. | ~SQL | 1,553ms |
| 28 | multi-join | T3 | Most popular payment methods by customer state. | PASS | 1,593ms |
| 29 | multi-join | T3 | Warehouse manager with most total inventory. | FAIL | 2,152ms |
| 30 | multi-join | T3 | Total refund amount by product category for completed returns. | ~SQL | 2,224ms |
| 31 | nested | T1 | Customers who spent more than the average. | PASS | 2,097ms |
| 32 | nested | T1 | Products that have never been ordered. | PASS | 747ms |
| 33 | nested | T1 | Most recent order for each customer. | ~SQL | 941ms |
| 34 | nested | T1 | Categories with products above average price. | PASS | 871ms |
| 35 | nested | T2 | Employees earning more than department average. | PASS | 1,237ms |
| 36 | nested | T2 | Products with above-average review ratings. | ~SQL | 1,271ms |
| 37 | nested | T2 | Suppliers who shipped above-average total orders. | ~SQL | 2,004ms |
| 38 | nested | T3 | Products where inventory is below reorder point. | ~SQL | 1,172ms |
| 39 | nested | T3 | Customers who used more than 2 payment methods. | ~SQL | 1,124ms |
| 40 | nested | T3 | Orders with total above average for that customer state. | ~SQL | 1,815ms |
| 41 | advanced | T1 | Rank customers by total spend with percentile. | PASS | 1,117ms |
| 42 | advanced | T1 | Month-over-month revenue growth rate for 2024. | ~SQL | 1,366ms |
| 43 | advanced | T1 | Product revenue as percentage of total. | ~SQL | 1,954ms |
| 44 | advanced | T1 | Running total of orders by date for 2024. | ~SQL | 1,329ms |
| 45 | advanced | T2 | Rank employees within department by salary. | ~SQL | 2,115ms |
| 46 | advanced | T2 | Top 3 products by review count with avg rating. | ~SQL | 1,119ms |
| 47 | advanced | T2 | Customer cohort analysis by signup month. | ~SQL | 1,217ms |
| 48 | advanced | T3 | Top 5 promo codes by usage with estimated revenue. | ~SQL | 2,503ms |
| 49 | advanced | T3 | Total orders and returns combined via UNION. | ~SQL | 1,783ms |
| 50 | advanced | T3 | Audit log activity by table and action type. | PASS | 1,270ms |
SQL Examples
"How many orders have been cancelled?" — PASS (execution match)
-- Gold
SELECT COUNT(*) AS cancelled_orders FROM orders WHERE status = 'cancelled'
-- Vanna AI
SELECT COUNT(order_id) AS cancelled_orders_count FROM orders WHERE status = 'cancelled'
Different column alias and COUNT(order_id) vs COUNT(*), but same result: correct.
"What are all the product categories?" — ~SQL (close miss)
-- Gold
SELECT name, description FROM categories
-- Vanna AI
SELECT DISTINCT name FROM categories;
Vanna returned only the name column, missing description. Same row count but different columns.
Deep Dive: WrenAI-style Results
Approach: Single LLM call with full MDL (table descriptions, column display names, relationships) as context
By Category
| Category | Accuracy | Errors | Avg Latency |
|---|---|---|---|
| Simple | 3/10 (30%) | 0 | 920ms |
| Aggregation | 2/10 (20%) | 1 | 1,086ms |
| Multi-join | 3/10 (30%) | 0 | 1,333ms |
| Nested subquery | 2/10 (20%) | 1 | 1,054ms |
| Advanced | 1/10 (10%) | 0 | 1,233ms |
By Tier
| Tier | Accuracy | Errors |
|---|---|---|
| Tier 1 | 7/23 (30.4%) | 0 |
| Tier 2 | 1/16 (6.2%) | 2 |
| Tier 3 | 3/11 (27.3%) | 0 |
Expand full per-query results (50 queries)
| # | Category | Tier | Question | Result | Latency |
|---|---|---|---|---|---|
| 1 | simple | T1 | List all customers from California. | ~SQL | 1,191ms |
| 2 | simple | T1 | Show the 5 most expensive products. | ~SQL | 888ms |
| 3 | simple | T1 | How many orders have been cancelled? | PASS | 383ms |
| 4 | simple | T1 | What are all the product categories? | ~SQL | 278ms |
| 5 | simple | T1 | Find all orders placed in January 2024. | PASS | 1,238ms |
| 6 | simple | T1 | Which products cost less than $50? | ~SQL | 446ms |
| 7 | simple | T1 | Show all delivered orders sorted by total amount desc. | ~SQL | 1,610ms |
| 8 | simple | T2 | List all employees in the Engineering department. | ~SQL | 1,617ms |
| 9 | simple | T2 | What are the names and countries of all suppliers? | PASS | 911ms |
| 10 | simple | T2 | Show all 5-star reviews. | ~SQL | 640ms |
| 11 | aggregation | T1 | What is the total revenue from all delivered orders? | PASS | 1,200ms |
| 12 | aggregation | T1 | How many orders does each customer have? Top 10. | ~SQL | 635ms |
| 13 | aggregation | T1 | What is the average order value by month in 2024? | ~SQL | 615ms |
| 14 | aggregation | T1 | Which product category generates the most revenue? | ~SQL | 2,760ms |
| 15 | aggregation | T1 | What is the profit margin for each product? | ~SQL | 634ms |
| 16 | aggregation | T2 | What is the average product rating by category? | ~SQL | 1,831ms |
| 17 | aggregation | T2 | What is the total salary expense per department? | ~SQL | 1,135ms |
| 18 | aggregation | T2 | Which carrier delivers the most shipments? | ~SQL | 414ms |
| 19 | aggregation | T2 | What is the average delivery time in days by carrier? | FAIL | 827ms |
| 20 | aggregation | T3 | How many returns by reason with total refund amount? | PASS | 810ms |
| 21 | multi-join | T1 | Show each customer's name and total spend. | PASS | 1,590ms |
| 22 | multi-join | T1 | What are the best-selling products by quantity? | ~SQL | 478ms |
| 23 | multi-join | T1 | Most popular product category per state. | PASS | 1,911ms |
| 24 | multi-join | T2 | Products with avg review rating and review count. | ~SQL | 2,865ms |
| 25 | multi-join | T2 | List employees and their managers' names. | ~SQL | 909ms |
| 26 | multi-join | T2 | Customers who ordered from every category. | ~SQL | 688ms |
| 27 | multi-join | T3 | Inventory levels across warehouses for products under 50. | ~SQL | 2,177ms |
| 28 | multi-join | T3 | Most popular payment methods by customer state. | PASS | 957ms |
| 29 | multi-join | T3 | Warehouse manager with most total inventory. | ~SQL | 967ms |
| 30 | multi-join | T3 | Total refund amount by product category for completed returns. | ~SQL | 790ms |
| 31 | nested | T1 | Customers who spent more than the average. | ~SQL | 2,187ms |
| 32 | nested | T1 | Products that have never been ordered. | PASS | 809ms |
| 33 | nested | T1 | Most recent order for each customer. | ~SQL | 1,459ms |
| 34 | nested | T1 | Categories with products above average price. | PASS | 785ms |
| 35 | nested | T2 | Employees earning more than department average. | ~SQL | 888ms |
| 36 | nested | T2 | Products with above-average review ratings. | ~SQL | 814ms |
| 37 | nested | T2 | Suppliers who shipped above-average total orders. | FAIL | 944ms |
| 38 | nested | T3 | Products where inventory is below reorder point. | ~SQL | 952ms |
| 39 | nested | T3 | Customers who used more than 2 payment methods. | ~SQL | 820ms |
| 40 | nested | T3 | Orders with total above average for that customer state. | ~SQL | 886ms |
| 41 | advanced | T1 | Rank customers by total spend with percentile. | ~SQL | 1,014ms |
| 42 | advanced | T1 | Month-over-month revenue growth rate for 2024. | ~SQL | 1,767ms |
| 43 | advanced | T1 | Product revenue as percentage of total. | ~SQL | 1,124ms |
| 44 | advanced | T1 | Running total of orders by date for 2024. | ~SQL | 1,536ms |
| 45 | advanced | T2 | Rank employees within department by salary. | ~SQL | 786ms |
| 46 | advanced | T2 | Top 3 products by review count with avg rating. | ~SQL | 1,729ms |
| 47 | advanced | T2 | Customer cohort analysis by signup month. | ~SQL | 1,200ms |
| 48 | advanced | T3 | Top 5 promo codes by usage with estimated revenue. | ~SQL | 834ms |
| 49 | advanced | T3 | Total orders and returns combined via UNION. | ~SQL | 1,707ms |
| 50 | advanced | T3 | Audit log activity by table and action type. | PASS | 633ms |
SQL Examples
"List all customers from California." — ~SQL (close miss)
-- Gold
SELECT first_name, last_name, email, city FROM customers WHERE state = 'CA'
-- WrenAI-style
SELECT * FROM customers WHERE state = 'CA';
Despite having column display names in the MDL, the model defaulted to SELECT *. Same rows, different columns.
Deep Dive: DB-GPT-style Results
Approach: 3-step pipeline — schema linking, SQL generation with sample data, SQL correction on failure
By Category
| Category | Accuracy | Errors | Avg Latency |
|---|---|---|---|
| Simple | 3/10 (30%) | 0 | 2,220ms |
| Aggregation | 2/10 (20%) | 0 | 2,661ms |
| Multi-join | 2/10 (20%) | 0 | 3,768ms |
| Nested subquery | 2/10 (20%) | 0 | 3,018ms |
| Advanced | 1/10 (10%) | 0 | 2,913ms |
By Tier
| Tier | Accuracy | Errors |
|---|---|---|
| Tier 1 | 6/23 (26.1%) | 0 |
| Tier 2 | 1/16 (6.2%) | 0 |
| Tier 3 | 3/11 (27.3%) | 0 |
The standout stat: 0 execution errors across all 50 queries. The correction step works — when SQL fails to execute, the LLM fixes it. But syntactically valid SQL that returns wrong results still counts as a miss.
Expand full per-query results (50 queries)
| # | Category | Tier | Question | Result | Latency |
|---|---|---|---|---|---|
| 1 | simple | T1 | List all customers from California. | ~SQL | 2,124ms |
| 2 | simple | T1 | Show the 5 most expensive products. | ~SQL | 2,097ms |
| 3 | simple | T1 | How many orders have been cancelled? | PASS | 1,710ms |
| 4 | simple | T1 | What are all the product categories? | ~SQL | 2,280ms |
| 5 | simple | T1 | Find all orders placed in January 2024. | PASS | 2,064ms |
| 6 | simple | T1 | Which products cost less than $50? | ~SQL | 1,920ms |
| 7 | simple | T1 | Show all delivered orders sorted by total amount desc. | ~SQL | 2,441ms |
| 8 | simple | T2 | List all employees in the Engineering department. | ~SQL | 4,232ms |
| 9 | simple | T2 | What are the names and countries of all suppliers? | PASS | 1,640ms |
| 10 | simple | T2 | Show all 5-star reviews. | ~SQL | 1,688ms |
| 11 | aggregation | T1 | What is the total revenue from all delivered orders? | PASS | 2,676ms |
| 12 | aggregation | T1 | How many orders does each customer have? Top 10. | ~SQL | 2,974ms |
| 13 | aggregation | T1 | What is the average order value by month in 2024? | ~SQL | 2,686ms |
| 14 | aggregation | T1 | Which product category generates the most revenue? | ~SQL | 3,367ms |
| 15 | aggregation | T1 | What is the profit margin for each product? | ~SQL | 1,737ms |
| 16 | aggregation | T2 | What is the average product rating by category? | ~SQL | 3,613ms |
| 17 | aggregation | T2 | What is the total salary expense per department? | ~SQL | 2,982ms |
| 18 | aggregation | T2 | Which carrier delivers the most shipments? | ~SQL | 1,791ms |
| 19 | aggregation | T2 | What is the average delivery time in days by carrier? | ~SQL | 2,728ms |
| 20 | aggregation | T3 | How many returns by reason with total refund amount? | PASS | 2,054ms |
| 21 | multi-join | T1 | Show each customer's name and total spend. | ~SQL | 2,854ms |
| 22 | multi-join | T1 | What are the best-selling products by quantity? | ~SQL | 1,907ms |
| 23 | multi-join | T1 | Most popular product category per state. | PASS | 4,462ms |
| 24 | multi-join | T2 | Products with avg review rating and review count. | ~SQL | 3,814ms |
| 25 | multi-join | T2 | List employees and their managers' names. | ~SQL | 2,534ms |
| 26 | multi-join | T2 | Customers who ordered from every category. | ~SQL | 2,865ms |
| 27 | multi-join | T3 | Inventory levels across warehouses for products under 50. | ~SQL | 3,487ms |
| 28 | multi-join | T3 | Most popular payment methods by customer state. | PASS | 2,973ms |
| 29 | multi-join | T3 | Warehouse manager with most total inventory. | ~SQL | 4,169ms |
| 30 | multi-join | T3 | Total refund amount by product category for completed returns. | ~SQL | 8,612ms |
| 31 | nested | T1 | Customers who spent more than the average. | ~SQL | 4,072ms |
| 32 | nested | T1 | Products that have never been ordered. | PASS | 3,126ms |
| 33 | nested | T1 | Most recent order for each customer. | ~SQL | 2,995ms |
| 34 | nested | T1 | Categories with products above average price. | PASS | 2,794ms |
| 35 | nested | T2 | Employees earning more than department average. | ~SQL | 2,819ms |
| 36 | nested | T2 | Products with above-average review ratings. | ~SQL | 2,711ms |
| 37 | nested | T2 | Suppliers who shipped above-average total orders. | ~SQL | 3,288ms |
| 38 | nested | T3 | Products where inventory is below reorder point. | ~SQL | 2,879ms |
| 39 | nested | T3 | Customers who used more than 2 payment methods. | ~SQL | 3,035ms |
| 40 | nested | T3 | Orders with total above average for that customer state. | ~SQL | 2,465ms |
| 41 | advanced | T1 | Rank customers by total spend with percentile. | ~SQL | 3,030ms |
| 42 | advanced | T1 | Month-over-month revenue growth rate for 2024. | ~SQL | 2,867ms |
| 43 | advanced | T1 | Product revenue as percentage of total. | ~SQL | 3,340ms |
| 44 | advanced | T1 | Running total of orders by date for 2024. | ~SQL | 2,316ms |
| 45 | advanced | T2 | Rank employees within department by salary. | ~SQL | 2,840ms |
| 46 | advanced | T2 | Top 3 products by review count with avg rating. | ~SQL | 2,352ms |
| 47 | advanced | T2 | Customer cohort analysis by signup month. | ~SQL | 2,685ms |
| 48 | advanced | T3 | Top 5 promo codes by usage with estimated revenue. | ~SQL | 4,946ms |
| 49 | advanced | T3 | Total orders and returns combined via UNION. | ~SQL | 2,793ms |
| 50 | advanced | T3 | Audit log activity by table and action type. | PASS | 1,960ms |
SQL Examples
"Which products cost less than $50?" — ~SQL (close miss)
-- Gold
SELECT name, price FROM products WHERE price < 50 ORDER BY price
-- DB-GPT-style
SELECT product_id, name, price, cost
FROM products
WHERE cost < 50;
Schema linking correctly identified the products table, but the LLM picked the wrong column (cost instead of price) and added extra columns. The sample data step didn't help disambiguate.
Head-to-Head: Same Query, Three Approaches
Here's how each approach handled the same questions — same LLM, same database, different context strategies.
"Show the 5 most expensive products."
Vanna AI (PASS) — trained on similar examples, nailed the column selection:
SELECT name, price FROM products ORDER BY price DESC LIMIT 5
WrenAI-style (~SQL) — had column display names in MDL but still added product_id:
SELECT product_id, name, price FROM products ORDER BY price DESC LIMIT 5;
DB-GPT-style (~SQL) — schema linking found the right table, but same extra column:
SELECT product_id, name, price FROM products ORDER BY price DESC LIMIT 5;
Gold: SELECT name, price FROM products ORDER BY price DESC LIMIT 5
The pattern is clear: Vanna's example Q-SQL pairs teach the model which columns to select. The other approaches give the model the full schema and let it decide — and it consistently over-selects.
"How many orders does each customer have? Show the top 10."
Vanna AI (PASS):
SELECT c.first_name, c.last_name, COUNT(o.order_id) AS order_count
FROM customers c JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name
ORDER BY order_count DESC LIMIT 10
WrenAI-style (~SQL) — used LEFT JOIN instead of INNER JOIN:
SELECT customers.customer_id, customers.first_name, customers.last_name,
COUNT(orders.order_id) AS order_count
FROM customers
LEFT JOIN orders ON customers.customer_id = orders.customer_id
GROUP BY customers.customer_id, customers.first_name, customers.last_name
ORDER BY order_count DESC LIMIT 10;
DB-GPT-style (~SQL) — also LEFT JOIN:
SELECT c.customer_id, c.first_name, c.last_name, COUNT(o.order_id) AS order_count
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name
ORDER BY order_count DESC LIMIT 10;
Gold: SELECT c.first_name, c.last_name, COUNT(o.order_id) AS order_count FROM customers c JOIN orders o ON c.customer_id = o.customer_id GROUP BY c.customer_id, c.first_name, c.last_name ORDER BY order_count DESC LIMIT 10
Both WrenAI-style and DB-GPT-style used LEFT JOIN (which includes customers with 0 orders), while the gold SQL uses INNER JOIN. This changes the result set.
Key Findings
1. RAG with examples beats semantic models and multi-step pipelines
Vanna's 40% vs WrenAI's 22% and DB-GPT's 20% isn't close. The difference comes down to one thing: Vanna shows the LLM examples of correct SQL for similar questions. The other approaches give it schema information and hope for the best. Few-shot examples remain the most effective way to steer LLM SQL generation.
2. The column selection problem is universal
The single biggest failure mode across all approaches: SELECT * or selecting extra columns when the gold SQL specifies particular ones. This isn't a logic error — the SQL is semantically reasonable — but it fails execution accuracy. A production system needs either strict column selection rules or a more forgiving evaluation metric.
3. Multi-step pipelines trade accuracy for reliability
DB-GPT's 3-step approach achieved 0 execution errors (vs 2 each for the others) but scored lowest on accuracy. The correction step catches syntax errors but can't fix wrong logic. The extra LLM calls add ~1.5s latency per query. In production, zero errors matters — but you need to pair it with better initial generation.
4. Schema complexity is the real enemy
All three approaches score 27-57% on Tier 1 (5 tables with clean names) and collapse to 6-25% on Tier 2 (10 tables). The problem isn't abbreviated names (Tier 3 scores equal Tier 2) — it's having more tables for the model to choose from. Schema pruning and table selection are critical unsolved problems.
5. None of these approaches are production-ready at 40%
The best score here is 40%, and that's with the simplest evaluation metric (does the result set match). Real users would consider many of the ~SQL results "close enough." But even with generous grading, NL2SQL tools need significant improvement — or a very controlled schema — before they can reliably replace writing SQL.
Latency and Cost
| Approach | Avg Latency | LLM Calls/Query | Total Tokens | Why |
|---|---|---|---|---|
| WrenAI-style | 1,125ms | 1 | 193,969 | Single call with large MDL context (~3,800 tokens/query) |
| Vanna AI | 1,332ms | 1 | N/A | Single call + ChromaDB vector search overhead |
| DB-GPT-style | 2,916ms | 2-3 | 74,607 | Schema linking + generation + optional correction |
WrenAI-style is fastest but uses the most tokens (large MDL context). DB-GPT-style uses the fewest tokens per call but makes 2-3 calls, resulting in 2.6x the latency. For production cost optimization, Vanna's approach is the best balance.
What I'd Do Differently
If I were building a production NL2SQL system based on these results:
-
Start with Vanna's approach — RAG with curated Q-SQL examples gives the best accuracy floor. Invest in building a quality training set rather than better prompts.
-
Add DB-GPT's correction step — The 0-error rate is valuable. Use a 2-step pipeline: RAG-based generation first, then error correction if execution fails.
-
Borrow WrenAI's semantic context — Table and column descriptions help, especially for non-obvious schemas. Combine MDL-style descriptions with example pairs.
-
Solve column selection separately — Either constrain the LLM to always use
SELECT *(and handle projection downstream) or add a post-processing step that maps columns. -
Invest in schema pruning — The Tier 2 drop shows that giving the LLM the full schema hurts. DB-GPT's schema linking idea is right; it just needs to be more reliable.
The benchmark code, all 50 queries, and raw results are available on GitHub. Built with DuckDB, GPT-4o, and a lot of patience.