Dissecting Open-Source NL2SQL: How Vanna, WrenAI, and DB-GPT Approach Text-to-SQL

If you've worked with natural language to SQL, you know the pitch: "Just ask your database a question in plain English." Snowflake has Cortex Analyst. Databricks has Genie. But what if you need something open-source, self-hosted, or database-agnostic?

The open-source NL2SQL landscape has matured significantly. Three projects stand out -- not because they're the only options, but because they represent three fundamentally different architectural philosophies:

Vanna AI (~22k GitHub stars): RAG-first, self-learning
WrenAI (~14k stars): Semantic model-first, inspired by Cortex Analyst
DB-GPT (~17k stars): Multi-agent, workflow-driven

I spent time digging into the source code of each project, tracing queries end-to-end, and identifying where each approach excels and where it breaks down. This post is a technical teardown for engineers evaluating these tools or building their own NL2SQL systems.

The Three Paradigms

Before diving into each project, here's the core architectural bet each one makes:

	Vanna AI	WrenAI	DB-GPT
Core Idea	"Give the LLM relevant examples and it'll figure out the SQL"	"Give the LLM a semantic model and it'll understand your data"	"Let multiple agents collaborate to generate and validate SQL"
Architecture	RAG pipeline + tool-calling agent	Semantic engine + RAG retrieval + SQL transpiler	AWEL workflow engine + multi-agent orchestration
Key Insight	Context is everything -- retrieval strategy matters more than LLM choice	Business logic must be encoded explicitly, not inferred	Complex queries need decomposition and iterative refinement
License	MIT	AGPL-3.0	MIT

1. Vanna AI: The RAG Purist

How It Works

Vanna's thesis is simple: if you give an LLM the right examples, it can generate accurate SQL. No semantic layer, no multi-agent orchestration -- just well-curated context retrieved via vector similarity search.

You "train" Vanna by feeding it three types of data:

# Schema definitions
vn.train(ddl="CREATE TABLE customers (id INT, name TEXT, revenue DECIMAL)")

# Business documentation
vn.train(documentation="Revenue represents gross revenue in USD before tax")

# Question-SQL pairs (most impactful)
vn.train(question="Top 10 customers by revenue",
         sql="SELECT name, SUM(revenue) FROM customers GROUP BY name ORDER BY 2 DESC LIMIT 10")

Each type is embedded and stored in a separate vector collection. At query time, Vanna retrieves the most relevant context from all three collections and assembles a prompt.

End-to-End Query Trace

Here's what actually happens when you ask Vanna a question:

"What were our top 5 products by revenue last quarter?"
    |
    v
[1] Vector Similarity Search (3 parallel lookups):
    - SQL collection: find similar question-SQL pairs
    - DDL collection: find relevant schema definitions
    - Docs collection: find related business documentation
    |
    v
[2] Prompt Assembly:
    - System: "You are a {dialect} expert..."
    - Inject DDL definitions (token-budget-aware)
    - Inject documentation snippets
    - Append retrieved Q&A pairs as few-shot examples
    - Append the user's question
    |
    v
[3] LLM Call -> SQL response
    |
    v
[4] Optional: Two-Pass Generation
    - If LLM returns "intermediate_sql", execute it
    - Re-prompt with DataFrame results for data-aware refinement
    |
    v
[5] SQL Extraction (regex: CREATE TABLE AS > WITH > SELECT > ```sql```)
    |
    v
[6] Execute against database -> DataFrame

The two-pass generation is clever -- the LLM can issue an exploratory query first (e.g., SELECT DISTINCT status FROM orders) to discover actual column values before generating the final query. This handles cases where the LLM needs to know that statuses are 'ACTIVE', 'INACTIVE', not 'active', 'inactive'.

The Self-Learning Loop

When auto_train=True (the default), any query that executes successfully and returns results is automatically added back to the training set:

df = self.run_sql(sql)
if len(df) > 0 and auto_train:
    self.add_question_sql(question=question, sql=sql)

This creates a flywheel: more usage generates more examples, which improves future retrieval accuracy. Vanna 2.0 extends this further with AgentMemory -- the agent searches for previously successful tool usage patterns before executing any new tool call.

Vanna 2.0: From Library to Agent

Vanna 2.0 is a complete rewrite that transforms it from a method-based library into a user-aware agent framework. The key additions:

User identity flowing through every layer (JWT/OAuth resolution)
Tool-calling loop with configurable iteration limits
7 extensibility points: lifecycle hooks, LLM middlewares, context enrichers, conversation filters, error recovery strategies, observability, and audit logging
Row-level security and group-based tool access
Streaming UI components via a pre-built <vanna-chat> web component

The legacy v0.x system is preserved behind a LegacyVannaAdapter for backward compatibility.

Where Vanna Excels

Simplest mental model: train with examples, ask questions, get SQL. No schema modeling or agent choreography needed.
Self-learning: the flywheel effect genuinely improves accuracy over time.
Data sovereignty: only metadata (schemas, column names) is sent to LLMs, not actual data.
LLM agnostic: swap between OpenAI, Anthropic, Ollama, Gemini with minimal changes. Same for vector stores (ChromaDB, Pinecone, Qdrant, FAISS, and 6 others).
Two-pass generation: the intermediate SQL pattern handles data-dependent queries that single-shot approaches miss.

Where Vanna Breaks Down

No semantic layer: the same business term ("revenue") may be interpreted differently across queries depending on which examples the vector search retrieves. There's no formal governance.
Complex joins: without explicit relationship definitions, the LLM must infer join conditions from DDL alone. Multi-table joins with ambiguous foreign keys produce incorrect results.
Inconsistent results: identical questions can produce different SQL at different times, depending on retrieval randomness.
Weak validation: is_sql_valid() only checks if the statement starts with SELECT via sqlparse. No AST validation, no dry-run, no schema-aware checking.
Manual curation: the training data needs ongoing human attention. Stale or incorrect examples silently degrade accuracy.

The Key Research Finding

Vanna's own accuracy research tested across GPT-3.5 Turbo, GPT-4, and PaLM 2:

Schema only: ~3% accuracy (2/60 correct)
Static SQL examples: 20-60% depending on LLM
Contextually relevant examples via RAG: ~80% accuracy

The takeaway: context strategy matters more than LLM choice. A weaker model with good examples outperforms a stronger model with bad ones.

2. WrenAI: The Semantic Modeler

How It Works

WrenAI's thesis is that the fundamental problem with NL2SQL isn't retrieval -- it's that LLMs don't understand your business. Raw schemas don't capture that tbl_ord_2024.amt_net means "Net Order Amount after discounts and returns" or that customers joins to orders on customer_id.

WrenAI introduces MDL (Modeling Definition Language) -- a JSON-based semantic model that explicitly encodes business concepts, relationships, calculated fields, and metrics. This is architecturally the closest open-source analog to Snowflake Cortex Analyst's YAML-based semantic model.

The Three Components

User -> Wren UI (Next.js, port 3000)
         |
         | GraphQL API
         v
    Wren AI Service (Python/FastAPI, port 5555)
         |
         |--- Retrieves context from Qdrant vector DB
         |--- Calls LLM (OpenAI, Claude, Gemini, Ollama, etc.)
         |--- Sends SQL for validation
         v
    Wren Engine (Rust/Apache DataFusion, port 8080)
         |
         |--- Validates via dry-run
         |--- Rewrites SQL using MDL semantics
         |--- Transpiles to target database dialect
         v
    Ibis Server (Python, port 8000)
         |
         v
    Target Database (PostgreSQL, BigQuery, Snowflake, etc.)

Wren AI Service handles the LLM interaction. It uses Hamilton (async execution) and Haystack (component-based pipelines) to build three pipeline categories: indexing (vectorize schema), retrieval (fetch relevant context), and generation (produce SQL).

Wren Engine is the semantic backbone, written in Rust on top of Apache DataFusion. It parses generated SQL into a LogicalPlan, applies MDL-aware analyzer rules that inject relationship semantics and calculated field definitions, validates the query via dry-run, and transpiles to the target dialect.

Ibis Server provides database-agnostic connectivity using the ibis-project, supporting 12+ databases.

The Semantic Model (MDL)

Here's what an MDL definition looks like conceptually:

{
  "models": [{
    "name": "Sales Orders",
    "tableReference": { "table": "tbl_ord_2024" },
    "columns": [{
      "name": "Net Amount",
      "expression": "amt_net",
      "description": "Total order amount after discounts and returns"
    }]
  }],
  "relationships": [{
    "name": "orders_to_customers",
    "joinType": "MANY_TO_ONE",
    "condition": "orders.customer_id = customers.id"
  }],
  "metrics": [{
    "name": "Total Revenue",
    "expression": "SUM(\"Net Amount\")",
    "dimensions": ["Region", "Quarter"]
  }]
}

The MDL supports models (virtual tables), columns with semantic names and descriptions, relationships with explicit join types, calculated fields that reference columns across related models, metrics with dimensional breakdowns, views, and macros.

End-to-End Query Trace

"What were our top 10 customers by revenue last quarter?"
    |
    v
[1] Intent Classification:
    LLM determines this is a text-to-SQL query (vs. general info request)
    |
    v
[2] Context Retrieval (RAG):
    - Qdrant similarity search over vectorized MDL schema
    - Column pruning: LLM selects only relevant columns (60-80% context reduction)
    - Retrieve historical question-SQL pairs
    - Gather custom instructions and SQL function docs
    |
    v
[3] SQL Generation:
    LLM generates "WrenSQL" -- ANSI SQL using MDL model/column names
    (e.g., SELECT "Customer Name", SUM("Net Amount") FROM "Sales Orders"...)
    |
    v
[4] Validation (Dry Run):
    Wren Engine parses WrenSQL into DataFusion LogicalPlan
    - ModelAnalyzeRule: resolve model references to physical tables
    - ModelGenerationRule: inject calculated field expressions
    - Type checking, column existence verification
    - If invalid: fallback correction pipeline attempts auto-fix
    |
    v
[5] SQL Transpilation:
    LogicalPlan "unparsed" into target dialect (BigQuery SQL, Snowflake SQL, etc.)
    - Handles timestamp timezone syntax, type coercion, function name differences
    |
    v
[6] Execution:
    Transpiled SQL pushed to target database via Ibis Server
    No data is moved or copied -- query-in-place
    |
    v
[7] Optional: Generate chart (Vega-Lite v5) and/or natural language summary

The critical step is #4: WrenAI doesn't just pass generated SQL to the database. The Wren Engine validates it against the semantic model, resolves business-level references to physical schema, and transpiles to the correct dialect. This catches errors that would otherwise hit the database.

Where WrenAI Excels

Semantic correctness: explicit relationship definitions eliminate join ambiguity. "Revenue" always means the same thing because it's defined once in the MDL.
Database-agnostic: 12+ databases supported via SQL transpilation. Cortex Analyst only works with Snowflake.
LLM-agnostic: OpenAI, Claude, Gemini, Ollama, DeepSeek, Bedrock, Groq.
Dry-run validation: SQL is validated before execution, catching errors early.
Full GenBI pipeline: Text-to-SQL + Text-to-Chart + natural language summaries.
MCP integration: the Wren Engine can serve as an MCP server, enabling any MCP-compatible AI agent to query governed data.

Where WrenAI Breaks Down

Setup complexity: self-hosting requires Docker Compose, LLM API keys, database connections, and building the semantic model. The gap between "docker compose up" and "production-ready" is significant.
LLM dependency: accuracy degrades markedly with less capable models. The system is only as good as its LLM.
Iteration required: even with the semantic layer, complex queries often need multiple iterations. Simple questions sometimes require 2-3 attempts before producing executable SQL.
Limited data context: since no actual data is sent to the LLM (only schema metadata), the model can't reason about data distributions, edge cases, or specific values.
AGPL-3.0: the copyleft license can be a blocker for commercial use cases.

WrenAI vs Cortex Analyst

Both validate the same thesis -- semantic models improve NL2SQL accuracy -- but represent opposite architectural choices:

	WrenAI	Cortex Analyst
Model format	JSON (MDL)	YAML
Deployment	Self-hosted, open-source	Snowflake-native, managed
Database support	12+ databases	Snowflake only
LLM choice	User-configurable	Snowflake-hosted models
Learning	Accumulates Q-SQL pairs over time	Manually curated "verified queries"
Visualization	Built-in (Vega-Lite)	None (SQL output only)

Cortex Analyst is vertically integrated -- everything lives inside Snowflake's governance boundary. WrenAI is horizontally decoupled -- you choose your LLM, database, and deployment model. The tradeoff is governance simplicity vs. flexibility.

3. DB-GPT: The Multi-Agent Orchestrator

How It Works

DB-GPT's thesis is that complex SQL generation is too hard for a single prompt -- it needs decomposition, specialized roles, and iterative refinement. The project is the most ambitious of the three, positioning itself as an "AI-native data application development framework" rather than just a NL2SQL tool.

The core architectural innovation is AWEL (Agentic Workflow Expression Language) -- a declarative, DAG-based workflow engine for orchestrating LLM application logic. Think Apache Airflow, but for LLM pipelines.

AWEL: The Workflow Engine

AWEL has three layers:

Operator Layer: atomic operations like retrieval, vectorization, model interaction, and prompt processing. Core operators include MapOperator, JoinOperator, ReduceOperator, BranchOperator, FilterOperator, and stream operators.

AgentFrame Layer: encapsulates operators into chain computations with distributed computing support (functional operations: filter, join, map, reduce).

DSL Layer: structured representation for deterministic workflow definition:

DATA declarations -> RECEIVE -> TRANSFORM -> RETRIEVE -> APPLY (LLM) -> RESPOND
ON ERROR: FAIL | RETRY | LOG

Operators connect using Python's >> syntax:

trigger_task >> req_parse_task >> prompt_task >> history_storage_task \
>> history_task >> embedding_task >> chat_task >> model_task >> output_parser_task

The Text2SQL Pipeline

DB-GPT's GBI (Generative Business Intelligence) module implements a 5-step NL2SQL pipeline:

"What are the top 5 customers by total order amount?"
    |
    v
[1] Query Understanding:
    Natural language parsing, intent detection
    |
    v
[2] Schema Linking:
    Entity-to-database mapping using metadata
    (table names, column names, foreign key relationships)
    |
    v
[3] SQL Generation:
    LLM-powered generation with Text Representation Prompt (TRP):
    - Schema descriptions
    - Instructions
    - Few-shot exemplars (retrieved adaptively)
    |
    v
[4] Query Execution:
    Connection pooling, error handling
    If error: retry with error context fed back to LLM
    |
    v
[5] Visualization:
    Chart generation via GPT-Vis protocol

Multi-Agent Collaboration

For complex queries, DB-GPT deploys multiple agents with specialized roles:

Data Analyst: interprets user queries, formulates analytical strategies
Database Architect: manages schema understanding, optimizes SQL
Software Engineer: generates code, handles technical implementation
Plugin Agents: extend functionality via Auto-GPT-style plugins

Agents communicate through a shared state bus and follow structured Standard Operating Procedures (SOPs) for task handoff. Each agent is modeled as an AWEL operator, so multi-agent workflows compose into DAGs with parallel execution, conditional routing (BranchOperator), and result aggregation (JoinOperator).

SMMF: Running Your Own Models

DB-GPT includes SMMF (Service-oriented Multi-Model Management Framework) for managing 50+ LLMs locally. It supports four inference backends:

vLLM: high-throughput serving with PagedAttention
llama.cpp: CPU and Metal inference for GGUF models
MLX: Apple Silicon native
HuggingFace Transformers: broadest compatibility

The model deployment layer includes a registration center, load balancing, and multi-instance horizontal scaling -- production-grade infrastructure for running local models.

DB-GPT-Hub: Fine-Tuning for Text2SQL

This is where DB-GPT gets interesting. DB-GPT-Hub provides a standardized fine-tuning pipeline using LoRA/QLoRA across models like CodeLlama, LLaMa2, Qwen, and Baichuan.

The results on the Spider benchmark are notable:

Model	Method	Execution Accuracy
CodeLlama-13B-Instruct	QLoRA	82.5%
GPT-4	zero-shot	76.2%
ChatGPT-3.5	zero-shot	72.8%
Qwen-14B-Chat	QLoRA	70.1%
CodeLlama-7B-Instruct	QLoRA	62.3%

A fine-tuned 13B model beating GPT-4 on Text2SQL is a meaningful result. The key insight: fine-tuning provides the largest gains for smaller models. The gap narrows as model scale increases -- Qwen-72B shows only a 3.5% improvement from tuning vs. 31% for the 7B variant.

Hardware requirements are practical: QLoRA on a 13B model needs ~13.4 GB GPU RAM.

Where DB-GPT Excels

Most comprehensive platform: RAG + agents + workflows + Text2SQL + visualization + local model serving in a single framework.
Strong fine-tuning results: 82.5% on Spider with a 13B model beating GPT-4 zero-shot.
Local LLM support: full privacy with no data leaving the execution environment.
AWEL workflow engine: principled way to compose agents, models, and data sources into DAGs.
Active research community: VLDB 2024 demo paper, 17k+ stars, MIT licensed.

Where DB-GPT Breaks Down

Massive complexity: four layers, seven packages, three AWEL sub-layers, a custom DSL, and multiple agent types. For teams that just need NL2SQL, it's severely over-engineered.
Steep learning curve: getting productive requires understanding AWEL, SMMF, the agent framework, AND the Text2SQL pipeline. The conceptual overhead is high.
Schema linking fragility: the pipeline depends on accurate metadata. Complex databases with ambiguous column names degrade performance significantly.
Hard queries are still hard: 82.5% on Spider sounds good, but accuracy on "Extra Hard" tier queries drops to ~40%.
Documentation gaps: AWEL DSL and some operator implementations lack thorough docs. The project's documentation domain has expired. Significant documentation is in Chinese.
Dependency sprawl: the optional extras system can create version conflicts in larger deployments.

Comparison: What Actually Matters

Query Accuracy

None of these tools claim (or achieve) near-perfect accuracy on arbitrary queries. Here's the realistic picture:

Query Type	Vanna	WrenAI	DB-GPT
Simple SELECT with filters	High	High	High
Aggregations (GROUP BY, HAVING)	Medium-High	High	High
Multi-table JOINs	Medium	High (explicit relationships)	Medium-High
Subqueries / CTEs	Medium	Medium-High	Medium-High
Window functions	Low-Medium	Medium	Medium
Data-dependent queries	Medium (two-pass)	Low (no data access)	Medium
Business logic ("revenue", "churn")	Low (no governance)	High (semantic model)	Medium

The pattern is clear: WrenAI wins on business semantics, Vanna wins on data-aware queries, and DB-GPT wins on raw SQL complexity (especially with fine-tuned models).

Architectural Complexity

Simplicity                                                    Complexity
    |                                                              |
    Vanna ---------> WrenAI --------------------------> DB-GPT
    (one pipeline)    (3 services + engine)              (4 layers, 7 packages,
                                                          multi-agent DAGs)

Production Readiness

Concern	Vanna	WrenAI	DB-GPT
SQL Validation	Basic (sqlparse)	Strong (dry-run + auto-fix)	Error-retry loop
Multi-tenancy	v2.0 user awareness	Role-based access	Limited
Observability	v2.0 spans/metrics	Langfuse integration	Basic logging
Audit	v2.0 audit logger	Limited	Limited
Streaming	SSE via web component	GraphQL subscriptions	FastAPI streaming

If I Were Building This From Scratch

Having traced queries through all three systems, here's what I'd take from each:

From Vanna: the self-learning loop. Automatically adding successful query-SQL pairs back to the training set is the simplest, most effective way to improve accuracy over time. The two-pass generation pattern is also underrated -- letting the LLM inspect actual data before generating the final query handles a class of problems that schema-only approaches miss entirely.

From WrenAI: the semantic layer and dry-run validation. Explicit relationship definitions and business term governance solve the consistency problem that pure RAG approaches struggle with. The Wren Engine's approach of parsing SQL into a LogicalPlan, applying semantic rules, and transpiling to target dialects is architecturally sound. And validating SQL before execution (with automatic correction) should be table stakes, not a differentiator.

From DB-GPT: the fine-tuning pipeline. A fine-tuned 13B model beating GPT-4 zero-shot on Text2SQL is a compelling result. For teams running at scale, the cost and latency savings of a specialized smaller model (13.4 GB GPU RAM for QLoRA) versus repeated GPT-4 API calls are significant. The AWEL workflow engine is interesting conceptually but may be over-engineered for most use cases.

What nobody does well yet: combining all three -- a semantic layer for business governance, RAG for dynamic context retrieval, fine-tuned models for cost-efficient accuracy, and a self-learning loop for continuous improvement. The closest open-source project to Cortex Analyst's integrated approach is WrenAI, but it still lacks Vanna's self-learning and DB-GPT's fine-tuning capabilities.

Notably, "OpenCortex" and "OpenGenie" don't exist as projects -- the namespace is wide open for someone to combine these approaches into a unified, database-agnostic NL2SQL platform.

Quick Reference

	Vanna AI	WrenAI	DB-GPT
GitHub	vanna-ai/vanna	Canner/WrenAI	eosphoros-ai/DB-GPT
Stars	~22k	~14k	~17k
License	MIT	AGPL-3.0	MIT
Language	Python	TypeScript/Python/Rust	Python
Approach	RAG + agent	Semantic model + RAG + transpiler	Multi-agent + AWEL workflows
LLMs	OpenAI, Anthropic, Ollama, Gemini, + 6 more	OpenAI, Claude, Gemini, Ollama, DeepSeek, Bedrock, Groq	50+ via SMMF (local + API)
Databases	11+	12+	MySQL, PostgreSQL, SQLite, DuckDB, ClickHouse, + more
Vector Store	ChromaDB, Pinecone, Qdrant, FAISS, + 6 more	Qdrant	ChromaDB, Milvus, Elasticsearch, PGVector, Weaviate
Self-Learning	Yes (auto-train + agent memory)	Partial (Q-SQL pair accumulation)	No
Fine-Tuning	No	No	Yes (DB-GPT-Hub, LoRA/QLoRA)
Visualization	Plotly	Vega-Lite v5	GPT-Vis
Best For	Teams wanting fast setup with growing accuracy	Teams needing semantic governance across multiple databases	Teams wanting full-stack AI data apps with local model control

Other Projects Worth Watching

Chat2DB (~25-30k stars): AI-powered SQL client GUI. Most similar to a self-hosted Databricks Genie with a desktop client.
Dataherald (~3.6k stars): Enterprise API-first NL2SQL. Broadest database support, designed to be embedded into SaaS products.
DBHub (~2k stars): MCP server that bridges AI assistants to databases. Infrastructure layer, not a NL2SQL tool itself, but enables any MCP-compatible agent to query databases.
Defog SQLCoder (~4k stars): Fine-tuned 15B model for Text2SQL. Not a framework -- use it as the LLM backbone inside Vanna, LangChain, or your own pipeline.

Update: I built a benchmark to test these three approaches head-to-head — 50 queries, same database, same LLM. Read the results in NL2SQL Benchmark: Vanna AI vs WrenAI vs DB-GPT.

Have experience with any of these tools in production? I'd like to hear what worked and what didn't -- reach out on X or LinkedIn.