A pattern for narrowing the gap on structurally-sound baselines through candidate voting, demonstrated on SQL generation across healthcare, finance, retail, and logistics schemas. The same shape applies to any task where single-pass generation is already close to ceiling and the remaining errors are stochastic, not systematic.
Text-to-SQL with a good prompt and few-shot examples already gets most queries right. The remaining failures share a property:
- Stochastic errors, not systematic ones. The model knows the right SQL pattern but occasionally picks the wrong join column, misreads a filter condition, or chooses a subtly wrong aggregation. More prompt engineering doesn't fix errors that aren't consistent.
The instance here: 100 natural language questions over database schemas ranging from single-table lookups to multi-table joins with subqueries and window functions, scored by execution accuracy (run both queries, compare result sets). The baseline agent -- five fixed few-shot examples, detailed SQLite-specific instructions, temperature 0 -- scored 0.650.
Paper Lantern surveyed 96 papers and surfaced three that address stochastic errors through candidate generation and voting: SQL-PaLM showed +4.9% from execution-based consistency decoding, MCS-SQL reached 89.6% on Spider via multi-prompt voting, and PET-SQL hit 87.6% via cross-consistency across low-temperature runs. The PL agent tested five approaches before finding that self-consistency voting -- 7 candidates at temp=0.8, majority vote after SQL normalization -- moved accuracy to 0.690.
prompt = build_prompt(
question=question,
db_schema=db_schema, # CREATE TABLE statements
sample_rows=sample_rows, # INSERT statements
few_shot=five_curated_examples, # diverse SQL patterns
)
answer = gemini.generate(prompt, temperature=0)
sql = extract_sql(answer)One call per question, temperature 0, five hand-picked examples covering WHERE filters, GROUP BY, JOINs, percentage calculations, and correlated subqueries. Detailed system prompt with SQLite-specific rules. This is a solid baseline -- it handles 65 of 100 questions correctly. The failures aren't clustered around a single SQL pattern; they're scattered across question types. That's the signal that the problem is variance, not a missing capability. (baseline/solve.py)
The PL agent didn't find self-consistency on the first try. It tested five approaches before landing on the winner -- and understanding why each failed is the real takeaway.
| Approach | Score | Why it failed |
|---|---|---|
| PreSQL + category few-shot + 3 candidates + majority vote | 0.600 | Multiple prompt variants fragmented the vote -- different styles produce different SQL that splits the majority |
| 2 candidates + LLM judge | 0.660 | The judge adds noise on disagreements. When two candidates differ, the judge doesn't have execution feedback to break the tie |
| PreSQL + single enriched pass | 0.610 | Dynamic few-shot selection alone is weaker than fixed curated examples -- retrieval noise outweighs relevance gains |
| Self-refinement (generate, then verify/fix) | 0.640 | Without execution feedback, the model "fixes" correct queries into broken ones as often as it fixes broken ones |
| Self-consistency: 7 candidates, temp=0.8, majority vote | 0.690 | Same prompt, varied sampling -- correct SQL converges, errors scatter |
The key insight from SQL-PaLM and PET-SQL: when errors are stochastic, the correct answer appears more often than any single incorrect answer across multiple samples. Self-consistency exploits this directly. Multi-prompt diversity (MCS-SQL) works at scale but fragments votes with only 7 candidates.
The winning approach generates 7 candidates per question at temperature 0.8 using the same prompt (5 fixed + 3 TF-IDF-retrieved few-shot examples), normalizes each SQL (uppercase, collapse whitespace, unify quotes), and picks the most common form. The correct SQL tends to converge; incorrect SQL scatters across different mistakes. (with_pl/solve.py)
-
When errors are stochastic, vote instead of refining. If your failures aren't clustered around a single pattern, self-consistency (same prompt, varied temperature) outperforms self-refinement (generate then fix). Refinement without execution feedback introduces as many errors as it corrects.
-
Same prompt, varied sampling beats multiple prompts. Different prompt styles produce structurally different SQL that splits the vote. Temperature variation with one prompt keeps the style consistent while varying the logic -- the majority signal stays clean.
-
Know when to stop searching for a better approach. The PL agent tried five techniques. Four scored between 0.600 and 0.660 -- all worse than the baseline. The winner was a 7-candidate vote, the simplest of the five ideas. When the baseline is structurally sound, marginal gains come from reducing variance, not from architectural shifts.
-
Normalize before you vote. SQL has many equivalent surface forms (
SELECT a, bvsSELECT a,b, single vs double quotes). Without normalization, identical queries split across formatting variants and the majority signal dissolves.
| Baseline | With PL | Delta | |
|---|---|---|---|
| Execution Accuracy | 0.650 | 0.690 | +6% |
| API calls | 100 | 700 | 7x |
| Tokens | 108K | 1.16M | 11x |
| Runtime | 161s | 417s | 2.6x |
The baseline was already structurally sound -- good prompt, good few-shot examples, deterministic generation. The +6% comes from converting stochastic single-pass errors into a voting signal across 7 candidates. The 7x cost increase is modest in absolute terms (both runs cost well under $1), but the gain-per-dollar is thin. This is the right tradeoff when accuracy matters more than cost, and the wrong one when it doesn't.
Systematic errors -- if the model consistently mishandles a SQL pattern (e.g., always gets window functions wrong), voting 7 times produces 7 copies of the same mistake. Schema ambiguity -- if the question is genuinely ambiguous given the schema, no amount of sampling resolves it. Tight cost budgets at scale -- 7x calls per question adds up when you're generating thousands of queries per hour.
The only difference between the two agents was a single section of their prompt -- "use your own knowledge" vs. "use Paper Lantern MCP tools to research first." Everything else downstream was the agent's doing.
uv run python run.py setup text_to_sql
uv run python run.py run text_to_sql baseline
uv run python run.py run text_to_sql with_pl
uv run python run.py compare text_to_sql
diff prompts/baseline.md prompts/with_pl.md