Building a SQL Tokenizer and Formatter From Scratch — Supporting 6 Dialects
SQL formatting seems simple until you try to build it. Keyword capitalization? Easy. Proper indentation of subqueries, CASE expressions, and JOINs across PostgreSQL, MySQL, SQL Server, Oracle, SQLite, and BigQuery? That's a compiler problem.
The full implementation is ~800 lines of TypeScript at devprix.dev/tools/sql-formatter.
Architecture: Tokenizer + State Machine
I chose a two-stage approach: tokenize the SQL into a stream of typed tokens, then format by iterating through tokens with a state machine. No AST (Abstract Syntax Tree) needed — SQL formatting doesn't require understanding query semantics, just structure.
Stage 1: The Tokenizer
The tokenizer is a single-pass, character-by-character scanner. It produces an array of typed tokens:
type TokenType =
| "keyword" | "identifier" | "string" | "number"
| "operator" | "punctuation" | "comma"
| "open_paren" | "close_paren"
| "comment_single" | "comment_multi"
| "whitespace" | "dot" | "semicolon"
| "wildcard" | "unknown";
interface Token {
type: TokenType;
value: string;
}
String Literals: Four Quoting Styles
Different SQL dialects use different quoting:
// Single-quoted strings (standard SQL)
if (ch === "'") {
let str = "'";
i++;
while (i < sql.length) {
if (sql[i] === "'" && sql[i + 1] === "'") {
str += "''"; // escaped quote
i += 2;
} else if (sql[i] === "'") {
str += "'";
i++;
break;
} else {
str += sql[i]; i++;
}
}
tokens.push({ type: "string", value: str });
}
The tokenizer also handles:
- Double-quoted identifiers:
"column_name"(PostgreSQL, standard SQL) - Backtick identifiers:
`table_name`(MySQL) - Square bracket identifiers:
[column](SQL Server)
Each has its own escape rules — SQL uses doubled quotes ('', "") rather than backslashes.
The Wildcard Problem
Is * a wildcard or a multiplication operator? It depends on context:
SELECT * FROM users -- wildcard
SELECT price * quantity -- multiplication
SELECT COUNT(*) FROM users -- wildcard inside function
The tokenizer disambiguates by looking at the previous non-whitespace token:
if (sql[i] === "*") {
const lastNonWs = tokens.findLast(t => t.type !== "whitespace");
if (!lastNonWs ||
lastNonWs.type === "keyword" ||
lastNonWs.type === "comma" ||
lastNonWs.type === "open_paren") {
token.type = "wildcard"; // SELECT *, COUNT(*)
} else {
token.type = "operator"; // price * qty
}
}
Compound Keywords
SQL has multi-word keywords: ORDER BY, GROUP BY, INNER JOIN, INSERT INTO, UNION ALL. The tokenizer uses lookahead to detect these:
const remaining = sql.slice(i + word.length);
const compoundMatch = remaining.match(
/^\s+(BY|INTO|JOIN|ALL|TABLE|FROM|INDEX|KEY|EXISTS)\b/i
);
if (compoundMatch) {
const compound = word.toUpperCase() + " " + compoundMatch[1].toUpperCase();
if (MAJOR_CLAUSE_KEYWORDS.has(compound)) {
token.value = compound;
i += word.length + compoundMatch[0].length;
}
}
This is preferable to treating them as separate tokens because ORDER alone might be an identifier in some contexts, but ORDER BY is always a keyword.
Dialect-Specific Keywords
Each dialect extends the base keyword set:
const dialectKeywords: Record<string, Set<string>> = {
mysql: new Set(["ENGINE", "INNODB", "SHOW", "DESCRIBE", "ENUM", "JSON", ...]),
postgresql: new Set(["RETURNING", "ILIKE", "JSONB", "LATERAL", "LISTEN", ...]),
sqlserver: new Set(["TOP", "NOLOCK", "NVARCHAR", "PIVOT", "UNPIVOT", ...]),
oracle: new Set(["ROWNUM", "SYSDATE", "DECODE", "NVL", "CONNECT", ...]),
sqlite: new Set(["AUTOINCREMENT", "PRAGMA", "ATTACH", "GLOB", ...]),
};
When tokenizing, a word is checked against both the base keywords and the dialect-specific set.
Stage 2: The Formatter
The formatter is a state machine that tracks context as it iterates through tokens:
function formatSql(sql: string, options: FormatOptions): string {
const tokens = tokenize(sql, options.dialect);
let depth = 0; // indentation level
let lineStart = true; // at beginning of line?
let inSelect = false; // inside SELECT clause?
let inWhere = false; // inside WHERE clause?
let afterClause = false; // just saw a clause keyword?
const subqueryStack: number[] = []; // paren depth for subqueries
The Core Loop
Each token type has formatting rules:
for (let i = 0; i < tokens.length; i++) {
const token = tokens[i];
if (token.type === "keyword") {
const upper = token.value.toUpperCase();
if (MAJOR_CLAUSE_KEYWORDS.has(upper)) {
// SELECT, FROM, WHERE, JOIN, etc.
newLine();
addIndent();
result += options.uppercaseKeywords ? upper : token.value.toLowerCase();
afterClause = true;
if (upper === "SELECT") inSelect = true;
if (upper === "FROM") inSelect = false;
if (upper === "WHERE") inWhere = true;
}
}
}
Subquery Detection
The trickiest part is detecting subqueries — a SELECT inside parentheses gets extra indentation:
if (token.type === "open_paren") {
const next = tokens.slice(i + 1).find(t => t.type !== "whitespace");
if (next && next.value.toUpperCase() === "SELECT") {
depth++;
subqueryStack.push(depth);
newLine();
addIndent();
} else {
result += "(";
}
}
When the matching close paren arrives, we check subqueryStack to know whether to dedent.
Comma Formatting
Two styles — trailing commas (traditional) and leading commas (some teams prefer this):
if (token.type === "comma") {
if (options.trailingCommas) {
result += ",";
if (inSelect) {
newLine();
addIndent();
result += " "; // extra indent for continuation
}
} else {
newLine();
addIndent();
result += ", ";
}
}
AND/OR in WHERE Clauses
WHERE conditions can be compact or expanded:
if ((upper === "AND" || upper === "OR") && inWhere) {
if (options.compactWhere) {
addSpace();
result += upper;
} else {
newLine();
addIndent();
result += " " + upper;
}
}
CASE Expression Formatting
CASE/WHEN/THEN/ELSE/END requires careful indentation tracking:
if (upper === "CASE") {
addSpace();
result += upper;
depth++;
}
if (upper === "WHEN") {
newLine();
addIndent();
result += upper;
}
if (upper === "END") {
depth--;
newLine();
addIndent();
result += upper;
}
The Minifier
The reverse operation — collapse SQL to a single line while preserving semantics:
function minifySql(sql: string, dialect: string): string {
const tokens = tokenize(sql, dialect);
let result = "";
let lastChar = "";
for (let i = 0; i < tokens.length; i++) {
const token = tokens[i];
if (token.type === "whitespace") {
const next = tokens[i + 1];
if (/[a-zA-Z0-9_]/.test(lastChar) &&
next && /^[a-zA-Z0-9_'"@#`\[]/.test(next.value[0])) {
result += " ";
}
} else if (token.type === "comment_single") {
result += "/* " + token.value.slice(2).trim() + " */";
} else {
result += token.value;
}
lastChar = result[result.length - 1] || "";
}
return result;
}
The single-line comment conversion is a subtle but critical detail. -- comment depends on a newline to terminate. In minified SQL on one line, a -- would comment out everything after it. Converting to /* */ preserves the comment without the line-break dependency.
Syntax Highlighting
The highlighter re-tokenizes the formatted SQL and wraps each token in a <span> with a CSS class:
function highlightSql(sql: string, dialect: string): string {
const tokens = tokenize(sql, dialect);
return tokens.map(token => {
const escaped = token.value
.replace(/&/g, "&")
.replace(/</g, "<")
.replace(/>/g, ">");
switch (token.type) {
case "keyword":
return SQL_FUNCTIONS.has(token.value.toUpperCase())
? `<span class="sql-function">${escaped}</span>`
: `<span class="sql-keyword">${escaped}</span>`;
case "string":
return `<span class="sql-string">${escaped}</span>`;
case "number":
return `<span class="sql-number">${escaped}</span>`;
case "comment_single":
case "comment_multi":
return `<span class="sql-comment">${escaped}</span>`;
default:
return escaped;
}
}).join("");
}
Functions (COUNT, SUM, AVG, etc.) get a different color than keywords (SELECT, FROM, WHERE) even though both are tokenized as "keyword" type. A secondary lookup distinguishes them.
What I Learned
Lookahead is essential. Context determines meaning. * isn't always multiplication. ORDER isn't always a keyword.
Dialect differences matter more than expected. BigQuery supports STRUCT, Oracle supports CONNECT BY, PostgreSQL has RETURNING. A generic approach fails immediately without dialect awareness.
State machines scale better than recursive descent for formatting. We track depth, clause context, and subquery stacks rather than building an AST.
Minification is the hard part. Many formatters forget you need to reverse the process. Converting -- to /* */ feels like a small detail but breaks silently if missed.
This is part of DevPrix — 63 free developer tools that run entirely in your browser. No sign-up, no tracking, no server calls.