Dissecting Open-Source NL2SQL: How Vanna, WrenAI, and DB-GPT Approach Text-to-SQL
Dissecting Open-Source NL2SQL: How Vanna, WrenAI, and DB-GPT Approach Text-to-SQL
If you've worked with natural language to SQL, you know the pitch: "Just ask your database a question in plain English." Snowflake has Cortex Analyst. Databricks has Genie. But what if you need something open-source, self-hosted, or database-agnostic?
The open-source NL2SQL landscape has matured significantly. Three projects stand out -- not because they're the only options, but because they represent three fundamentally different architectural philosophies:
- Vanna AI (~22k GitHub stars): RAG-first, self-learning
- WrenAI (~14k stars): Semantic model-first, inspired by Cortex Analyst
- DB-GPT (~17k stars): Multi-agent, workflow-driven
I spent time digging into the source code of each project, tracing queries end-to-end, and identifying where each approach excels and where it breaks down. This post is a technical teardown for engineers evaluating these tools or building their own NL2SQL systems.
The Three Paradigms
Before diving into each project, here's the core architectural bet each one makes:
| Vanna AI | WrenAI | DB-GPT | |
|---|---|---|---|
| Core Idea | "Give the LLM relevant examples and it'll figure out the SQL" | "Give the LLM a semantic model and it'll understand your data" | "Let multiple agents collaborate to generate and validate SQL" |
| Architecture | RAG pipeline + tool-calling agent | Semantic engine + RAG retrieval + SQL transpiler | AWEL workflow engine + multi-agent orchestration |
| Key Insight | Context is everything -- retrieval strategy matters more than LLM choice | Business logic must be encoded explicitly, not inferred | Complex queries need decomposition and iterative refinement |
| License | MIT | AGPL-3.0 | MIT |
1. Vanna AI: The RAG Purist
How It Works
Vanna's thesis is simple: if you give an LLM the right examples, it can generate accurate SQL. No semantic layer, no multi-agent orchestration -- just well-curated context retrieved via vector similarity search.
You "train" Vanna by feeding it three types of data:
# Schema definitions
vn.train(ddl="CREATE TABLE customers (id INT, name TEXT, revenue DECIMAL)")
# Business documentation
vn.train(documentation="Revenue represents gross revenue in USD before tax")
# Question-SQL pairs (most impactful)
vn.train(question="Top 10 customers by revenue",
sql="SELECT name, SUM(revenue) FROM customers GROUP BY name ORDER BY 2 DESC LIMIT 10")
Each type is embedded and stored in a separate vector collection. At query time, Vanna retrieves the most relevant context from all three collections and assembles a prompt.
End-to-End Query Trace
Here's what actually happens when you ask Vanna a question:
"What were our top 5 products by revenue last quarter?"
|
v
[1] Vector Similarity Search (3 parallel lookups):
- SQL collection: find similar question-SQL pairs
- DDL collection: find relevant schema definitions
- Docs collection: find related business documentation
|
v
[2] Prompt Assembly:
- System: "You are a {dialect} expert..."
- Inject DDL definitions (token-budget-aware)
- Inject documentation snippets
- Append retrieved Q&A pairs as few-shot examples
- Append the user's question
|
v
[3] LLM Call -> SQL response
|
v
[4] Optional: Two-Pass Generation
- If LLM returns "intermediate_sql", execute it
- Re-prompt with DataFrame results for data-aware refinement
|
v
[5] SQL Extraction (regex: CREATE TABLE AS > WITH > SELECT > ```sql```)
|
v
[6] Execute against database -> DataFrame
The two-pass generation is clever -- the LLM can issue an exploratory query first (e.g., SELECT DISTINCT status FROM orders) to discover actual column values before generating the final query. This handles cases where the LLM needs to know that statuses are 'ACTIVE', 'INACTIVE', not 'active', 'inactive'.
The Self-Learning Loop
When auto_train=True (the default), any query that executes successfully and returns results is automatically added back to the training set:
df = self.run_sql(sql)
if len(df) > 0 and auto_train:
self.add_question_sql(question=question, sql=sql)
This creates a flywheel: more usage generates more examples, which improves future retrieval accuracy. Vanna 2.0 extends this further with AgentMemory -- the agent searches for previously successful tool usage patterns before executing any new tool call.
Vanna 2.0: From Library to Agent
Vanna 2.0 is a complete rewrite that transforms it from a method-based library into a user-aware agent framework. The key additions:
- User identity flowing through every layer (JWT/OAuth resolution)
- Tool-calling loop with configurable iteration limits
- 7 extensibility points: lifecycle hooks, LLM middlewares, context enrichers, conversation filters, error recovery strategies, observability, and audit logging
- Row-level security and group-based tool access
- Streaming UI components via a pre-built
<vanna-chat>web component
The legacy v0.x system is preserved behind a LegacyVannaAdapter for backward compatibility.
Where Vanna Excels
- Simplest mental model: train with examples, ask questions, get SQL. No schema modeling or agent choreography needed.
- Self-learning: the flywheel effect genuinely improves accuracy over time.
- Data sovereignty: only metadata (schemas, column names) is sent to LLMs, not actual data.
- LLM agnostic: swap between OpenAI, Anthropic, Ollama, Gemini with minimal changes. Same for vector stores (ChromaDB, Pinecone, Qdrant, FAISS, and 6 others).
- Two-pass generation: the intermediate SQL pattern handles data-dependent queries that single-shot approaches miss.
Where Vanna Breaks Down
- No semantic layer: the same business term ("revenue") may be interpreted differently across queries depending on which examples the vector search retrieves. There's no formal governance.
- Complex joins: without explicit relationship definitions, the LLM must infer join conditions from DDL alone. Multi-table joins with ambiguous foreign keys produce incorrect results.
- Inconsistent results: identical questions can produce different SQL at different times, depending on retrieval randomness.
- Weak validation:
is_sql_valid()only checks if the statement starts with SELECT viasqlparse. No AST validation, no dry-run, no schema-aware checking. - Manual curation: the training data needs ongoing human attention. Stale or incorrect examples silently degrade accuracy.
The Key Research Finding
Vanna's own accuracy research tested across GPT-3.5 Turbo, GPT-4, and PaLM 2:
- Schema only: ~3% accuracy (2/60 correct)
- Static SQL examples: 20-60% depending on LLM
- Contextually relevant examples via RAG: ~80% accuracy
The takeaway: context strategy matters more than LLM choice. A weaker model with good examples outperforms a stronger model with bad ones.
2. WrenAI: The Semantic Modeler
How It Works
WrenAI's thesis is that the fundamental problem with NL2SQL isn't retrieval -- it's that LLMs don't understand your business. Raw schemas don't capture that tbl_ord_2024.amt_net means "Net Order Amount after discounts and returns" or that customers joins to orders on customer_id.
WrenAI introduces MDL (Modeling Definition Language) -- a JSON-based semantic model that explicitly encodes business concepts, relationships, calculated fields, and metrics. This is architecturally the closest open-source analog to Snowflake Cortex Analyst's YAML-based semantic model.
The Three Components
User -> Wren UI (Next.js, port 3000)
|
| GraphQL API
v
Wren AI Service (Python/FastAPI, port 5555)
|
|--- Retrieves context from Qdrant vector DB
|--- Calls LLM (OpenAI, Claude, Gemini, Ollama, etc.)
|--- Sends SQL for validation
v
Wren Engine (Rust/Apache DataFusion, port 8080)
|
|--- Validates via dry-run
|--- Rewrites SQL using MDL semantics
|--- Transpiles to target database dialect
v
Ibis Server (Python, port 8000)
|
v
Target Database (PostgreSQL, BigQuery, Snowflake, etc.)
Wren AI Service handles the LLM interaction. It uses Hamilton (async execution) and Haystack (component-based pipelines) to build three pipeline categories: indexing (vectorize schema), retrieval (fetch relevant context), and generation (produce SQL).
Wren Engine is the semantic backbone, written in Rust on top of Apache DataFusion. It parses generated SQL into a LogicalPlan, applies MDL-aware analyzer rules that inject relationship semantics and calculated field definitions, validates the query via dry-run, and transpiles to the target dialect.
Ibis Server provides database-agnostic connectivity using the ibis-project, supporting 12+ databases.
The Semantic Model (MDL)
Here's what an MDL definition looks like conceptually:
{
"models": [{
"name": "Sales Orders",
"tableReference": { "table": "tbl_ord_2024" },
"columns": [{
"name": "Net Amount",
"expression": "amt_net",
"description": "Total order amount after discounts and returns"
}]
}],
"relationships": [{
"name": "orders_to_customers",
"joinType": "MANY_TO_ONE",
"condition": "orders.customer_id = customers.id"
}],
"metrics": [{
"name": "Total Revenue",
"expression": "SUM(\"Net Amount\")",
"dimensions": ["Region", "Quarter"]
}]
}
The MDL supports models (virtual tables), columns with semantic names and descriptions, relationships with explicit join types, calculated fields that reference columns across related models, metrics with dimensional breakdowns, views, and macros.
End-to-End Query Trace
"What were our top 10 customers by revenue last quarter?"
|
v
[1] Intent Classification:
LLM determines this is a text-to-SQL query (vs. general info request)
|
v
[2] Context Retrieval (RAG):
- Qdrant similarity search over vectorized MDL schema
- Column pruning: LLM selects only relevant columns (60-80% context reduction)
- Retrieve historical question-SQL pairs
- Gather custom instructions and SQL function docs
|
v
[3] SQL Generation:
LLM generates "WrenSQL" -- ANSI SQL using MDL model/column names
(e.g., SELECT "Customer Name", SUM("Net Amount") FROM "Sales Orders"...)
|
v
[4] Validation (Dry Run):
Wren Engine parses WrenSQL into DataFusion LogicalPlan
- ModelAnalyzeRule: resolve model references to physical tables
- ModelGenerationRule: inject calculated field expressions
- Type checking, column existence verification
- If invalid: fallback correction pipeline attempts auto-fix
|
v
[5] SQL Transpilation:
LogicalPlan "unparsed" into target dialect (BigQuery SQL, Snowflake SQL, etc.)
- Handles timestamp timezone syntax, type coercion, function name differences
|
v
[6] Execution:
Transpiled SQL pushed to target database via Ibis Server
No data is moved or copied -- query-in-place
|
v
[7] Optional: Generate chart (Vega-Lite v5) and/or natural language summary
The critical step is #4: WrenAI doesn't just pass generated SQL to the database. The Wren Engine validates it against the semantic model, resolves business-level references to physical schema, and transpiles to the correct dialect. This catches errors that would otherwise hit the database.
Where WrenAI Excels
- Semantic correctness: explicit relationship definitions eliminate join ambiguity. "Revenue" always means the same thing because it's defined once in the MDL.
- Database-agnostic: 12+ databases supported via SQL transpilation. Cortex Analyst only works with Snowflake.
- LLM-agnostic: OpenAI, Claude, Gemini, Ollama, DeepSeek, Bedrock, Groq.
- Dry-run validation: SQL is validated before execution, catching errors early.
- Full GenBI pipeline: Text-to-SQL + Text-to-Chart + natural language summaries.
- MCP integration: the Wren Engine can serve as an MCP server, enabling any MCP-compatible AI agent to query governed data.
Where WrenAI Breaks Down
- Setup complexity: self-hosting requires Docker Compose, LLM API keys, database connections, and building the semantic model. The gap between "docker compose up" and "production-ready" is significant.
- LLM dependency: accuracy degrades markedly with less capable models. The system is only as good as its LLM.
- Iteration required: even with the semantic layer, complex queries often need multiple iterations. Simple questions sometimes require 2-3 attempts before producing executable SQL.
- Limited data context: since no actual data is sent to the LLM (only schema metadata), the model can't reason about data distributions, edge cases, or specific values.
- AGPL-3.0: the copyleft license can be a blocker for commercial use cases.
WrenAI vs Cortex Analyst
Both validate the same thesis -- semantic models improve NL2SQL accuracy -- but represent opposite architectural choices:
| WrenAI | Cortex Analyst | |
|---|---|---|
| Model format | JSON (MDL) | YAML |
| Deployment | Self-hosted, open-source | Snowflake-native, managed |
| Database support | 12+ databases | Snowflake only |
| LLM choice | User-configurable | Snowflake-hosted models |
| Learning | Accumulates Q-SQL pairs over time | Manually curated "verified queries" |
| Visualization | Built-in (Vega-Lite) | None (SQL output only) |
Cortex Analyst is vertically integrated -- everything lives inside Snowflake's governance boundary. WrenAI is horizontally decoupled -- you choose your LLM, database, and deployment model. The tradeoff is governance simplicity vs. flexibility.
3. DB-GPT: The Multi-Agent Orchestrator
How It Works
DB-GPT's thesis is that complex SQL generation is too hard for a single prompt -- it needs decomposition, specialized roles, and iterative refinement. The project is the most ambitious of the three, positioning itself as an "AI-native data application development framework" rather than just a NL2SQL tool.
The core architectural innovation is AWEL (Agentic Workflow Expression Language) -- a declarative, DAG-based workflow engine for orchestrating LLM application logic. Think Apache Airflow, but for LLM pipelines.
AWEL: The Workflow Engine
AWEL has three layers:
Operator Layer: atomic operations like retrieval, vectorization, model interaction, and prompt processing. Core operators include MapOperator, JoinOperator, ReduceOperator, BranchOperator, FilterOperator, and stream operators.
AgentFrame Layer: encapsulates operators into chain computations with distributed computing support (functional operations: filter, join, map, reduce).
DSL Layer: structured representation for deterministic workflow definition:
DATA declarations -> RECEIVE -> TRANSFORM -> RETRIEVE -> APPLY (LLM) -> RESPOND
ON ERROR: FAIL | RETRY | LOG
Operators connect using Python's >> syntax:
trigger_task >> req_parse_task >> prompt_task >> history_storage_task \
>> history_task >> embedding_task >> chat_task >> model_task >> output_parser_task
The Text2SQL Pipeline
DB-GPT's GBI (Generative Business Intelligence) module implements a 5-step NL2SQL pipeline:
"What are the top 5 customers by total order amount?"
|
v
[1] Query Understanding:
Natural language parsing, intent detection
|
v
[2] Schema Linking:
Entity-to-database mapping using metadata
(table names, column names, foreign key relationships)
|
v
[3] SQL Generation:
LLM-powered generation with Text Representation Prompt (TRP):
- Schema descriptions
- Instructions
- Few-shot exemplars (retrieved adaptively)
|
v
[4] Query Execution:
Connection pooling, error handling
If error: retry with error context fed back to LLM
|
v
[5] Visualization:
Chart generation via GPT-Vis protocol
Multi-Agent Collaboration
For complex queries, DB-GPT deploys multiple agents with specialized roles:
- Data Analyst: interprets user queries, formulates analytical strategies
- Database Architect: manages schema understanding, optimizes SQL
- Software Engineer: generates code, handles technical implementation
- Plugin Agents: extend functionality via Auto-GPT-style plugins
Agents communicate through a shared state bus and follow structured Standard Operating Procedures (SOPs) for task handoff. Each agent is modeled as an AWEL operator, so multi-agent workflows compose into DAGs with parallel execution, conditional routing (BranchOperator), and result aggregation (JoinOperator).
SMMF: Running Your Own Models
DB-GPT includes SMMF (Service-oriented Multi-Model Management Framework) for managing 50+ LLMs locally. It supports four inference backends:
- vLLM: high-throughput serving with PagedAttention
- llama.cpp: CPU and Metal inference for GGUF models
- MLX: Apple Silicon native
- HuggingFace Transformers: broadest compatibility
The model deployment layer includes a registration center, load balancing, and multi-instance horizontal scaling -- production-grade infrastructure for running local models.
DB-GPT-Hub: Fine-Tuning for Text2SQL
This is where DB-GPT gets interesting. DB-GPT-Hub provides a standardized fine-tuning pipeline using LoRA/QLoRA across models like CodeLlama, LLaMa2, Qwen, and Baichuan.
The results on the Spider benchmark are notable:
| Model | Method | Execution Accuracy |
|---|---|---|
| CodeLlama-13B-Instruct | QLoRA | 82.5% |
| GPT-4 | zero-shot | 76.2% |
| ChatGPT-3.5 | zero-shot | 72.8% |
| Qwen-14B-Chat | QLoRA | 70.1% |
| CodeLlama-7B-Instruct | QLoRA | 62.3% |
A fine-tuned 13B model beating GPT-4 on Text2SQL is a meaningful result. The key insight: fine-tuning provides the largest gains for smaller models. The gap narrows as model scale increases -- Qwen-72B shows only a 3.5% improvement from tuning vs. 31% for the 7B variant.
Hardware requirements are practical: QLoRA on a 13B model needs ~13.4 GB GPU RAM.
Where DB-GPT Excels
- Most comprehensive platform: RAG + agents + workflows + Text2SQL + visualization + local model serving in a single framework.
- Strong fine-tuning results: 82.5% on Spider with a 13B model beating GPT-4 zero-shot.
- Local LLM support: full privacy with no data leaving the execution environment.
- AWEL workflow engine: principled way to compose agents, models, and data sources into DAGs.
- Active research community: VLDB 2024 demo paper, 17k+ stars, MIT licensed.
Where DB-GPT Breaks Down
- Massive complexity: four layers, seven packages, three AWEL sub-layers, a custom DSL, and multiple agent types. For teams that just need NL2SQL, it's severely over-engineered.
- Steep learning curve: getting productive requires understanding AWEL, SMMF, the agent framework, AND the Text2SQL pipeline. The conceptual overhead is high.
- Schema linking fragility: the pipeline depends on accurate metadata. Complex databases with ambiguous column names degrade performance significantly.
- Hard queries are still hard: 82.5% on Spider sounds good, but accuracy on "Extra Hard" tier queries drops to ~40%.
- Documentation gaps: AWEL DSL and some operator implementations lack thorough docs. The project's documentation domain has expired. Significant documentation is in Chinese.
- Dependency sprawl: the optional extras system can create version conflicts in larger deployments.
Comparison: What Actually Matters
Query Accuracy
None of these tools claim (or achieve) near-perfect accuracy on arbitrary queries. Here's the realistic picture:
| Query Type | Vanna | WrenAI | DB-GPT |
|---|---|---|---|
| Simple SELECT with filters | High | High | High |
| Aggregations (GROUP BY, HAVING) | Medium-High | High | High |
| Multi-table JOINs | Medium | High (explicit relationships) | Medium-High |
| Subqueries / CTEs | Medium | Medium-High | Medium-High |
| Window functions | Low-Medium | Medium | Medium |
| Data-dependent queries | Medium (two-pass) | Low (no data access) | Medium |
| Business logic ("revenue", "churn") | Low (no governance) | High (semantic model) | Medium |
The pattern is clear: WrenAI wins on business semantics, Vanna wins on data-aware queries, and DB-GPT wins on raw SQL complexity (especially with fine-tuned models).
Architectural Complexity
Simplicity Complexity
| |
Vanna ---------> WrenAI --------------------------> DB-GPT
(one pipeline) (3 services + engine) (4 layers, 7 packages,
multi-agent DAGs)
Production Readiness
| Concern | Vanna | WrenAI | DB-GPT |
|---|---|---|---|
| SQL Validation | Basic (sqlparse) | Strong (dry-run + auto-fix) | Error-retry loop |
| Multi-tenancy | v2.0 user awareness | Role-based access | Limited |
| Observability | v2.0 spans/metrics | Langfuse integration | Basic logging |
| Audit | v2.0 audit logger | Limited | Limited |
| Streaming | SSE via web component | GraphQL subscriptions | FastAPI streaming |
If I Were Building This From Scratch
Having traced queries through all three systems, here's what I'd take from each:
From Vanna: the self-learning loop. Automatically adding successful query-SQL pairs back to the training set is the simplest, most effective way to improve accuracy over time. The two-pass generation pattern is also underrated -- letting the LLM inspect actual data before generating the final query handles a class of problems that schema-only approaches miss entirely.
From WrenAI: the semantic layer and dry-run validation. Explicit relationship definitions and business term governance solve the consistency problem that pure RAG approaches struggle with. The Wren Engine's approach of parsing SQL into a LogicalPlan, applying semantic rules, and transpiling to target dialects is architecturally sound. And validating SQL before execution (with automatic correction) should be table stakes, not a differentiator.
From DB-GPT: the fine-tuning pipeline. A fine-tuned 13B model beating GPT-4 zero-shot on Text2SQL is a compelling result. For teams running at scale, the cost and latency savings of a specialized smaller model (13.4 GB GPU RAM for QLoRA) versus repeated GPT-4 API calls are significant. The AWEL workflow engine is interesting conceptually but may be over-engineered for most use cases.
What nobody does well yet: combining all three -- a semantic layer for business governance, RAG for dynamic context retrieval, fine-tuned models for cost-efficient accuracy, and a self-learning loop for continuous improvement. The closest open-source project to Cortex Analyst's integrated approach is WrenAI, but it still lacks Vanna's self-learning and DB-GPT's fine-tuning capabilities.
Notably, "OpenCortex" and "OpenGenie" don't exist as projects -- the namespace is wide open for someone to combine these approaches into a unified, database-agnostic NL2SQL platform.
Quick Reference
| Vanna AI | WrenAI | DB-GPT | |
|---|---|---|---|
| GitHub | vanna-ai/vanna | Canner/WrenAI | eosphoros-ai/DB-GPT |
| Stars | ~22k | ~14k | ~17k |
| License | MIT | AGPL-3.0 | MIT |
| Language | Python | TypeScript/Python/Rust | Python |
| Approach | RAG + agent | Semantic model + RAG + transpiler | Multi-agent + AWEL workflows |
| LLMs | OpenAI, Anthropic, Ollama, Gemini, + 6 more | OpenAI, Claude, Gemini, Ollama, DeepSeek, Bedrock, Groq | 50+ via SMMF (local + API) |
| Databases | 11+ | 12+ | MySQL, PostgreSQL, SQLite, DuckDB, ClickHouse, + more |
| Vector Store | ChromaDB, Pinecone, Qdrant, FAISS, + 6 more | Qdrant | ChromaDB, Milvus, Elasticsearch, PGVector, Weaviate |
| Self-Learning | Yes (auto-train + agent memory) | Partial (Q-SQL pair accumulation) | No |
| Fine-Tuning | No | No | Yes (DB-GPT-Hub, LoRA/QLoRA) |
| Visualization | Plotly | Vega-Lite v5 | GPT-Vis |
| Best For | Teams wanting fast setup with growing accuracy | Teams needing semantic governance across multiple databases | Teams wanting full-stack AI data apps with local model control |
Other Projects Worth Watching
- Chat2DB (~25-30k stars): AI-powered SQL client GUI. Most similar to a self-hosted Databricks Genie with a desktop client.
- Dataherald (~3.6k stars): Enterprise API-first NL2SQL. Broadest database support, designed to be embedded into SaaS products.
- DBHub (~2k stars): MCP server that bridges AI assistants to databases. Infrastructure layer, not a NL2SQL tool itself, but enables any MCP-compatible agent to query databases.
- Defog SQLCoder (~4k stars): Fine-tuned 15B model for Text2SQL. Not a framework -- use it as the LLM backbone inside Vanna, LangChain, or your own pipeline.
Update: I built a benchmark to test these three approaches head-to-head — 50 queries, same database, same LLM. Read the results in NL2SQL Benchmark: Vanna AI vs WrenAI vs DB-GPT.
Have experience with any of these tools in production? I'd like to hear what worked and what didn't -- reach out on X or LinkedIn.