Text-to-SQL¶
Generate SQL queries from natural language questions using T5-based text-to-SQL models.
Quick example¶
try (var sqlGen = T5SqlGenerator.t5SmallAwesome().build()) {
String sql = sqlGen.generateSql(
"How many employees are in the engineering department?",
"CREATE TABLE employees (id INT, name VARCHAR, department VARCHAR, salary INT)");
System.out.println(sql);
// SELECT COUNT(*) FROM employees AS T1 JOIN departments AS T2
// ON T1.department = T2.id WHERE T2.name = 'Engineer'
}
Full example¶
import io.github.inference4j.generation.GenerationResult;
import io.github.inference4j.nlp.T5SqlGenerator;
public class TextToSql {
public static void main(String[] args) {
String schema = "CREATE TABLE employees (id INT, name VARCHAR, department VARCHAR, salary INT); " +
"CREATE TABLE departments (id INT, name VARCHAR, location VARCHAR)";
try (var sqlGen = T5SqlGenerator.t5SmallAwesome()
.maxNewTokens(200)
.build()) {
String[] questions = {
"What is the average salary by department?",
"List all employees in New York",
"Which department has the most employees?"
};
for (String question : questions) {
GenerationResult result = sqlGen.generateSql(question, schema,
token -> System.out.print(token));
System.out.println();
System.out.printf(" → %d tokens in %,d ms%n",
result.generatedTokens(), result.duration().toMillis());
}
}
}
}
Model presets¶
| Preset | Model | Parameters | Size | Schema format |
|---|---|---|---|---|
T5SqlGenerator.t5SmallAwesome() |
T5-small-awesome-text-to-sql | 60M | ~240 MB | CREATE TABLE DDL |
T5SqlGenerator.t5LargeSpider() |
T5-LM-Large-text2sql-spider | 0.8B | ~4.6 GB | Spider format with [SEP] |
Choosing a preset¶
T5-small-awesome is recommended for most use cases. It's fast, lightweight, and handles
standard SQL patterns well. Schema is provided as familiar CREATE TABLE statements.
T5-large-spider produces higher accuracy on complex queries (JOINs, subqueries,
GROUP BY with HAVING). It uses a specialized schema format designed for the
Spider benchmark with quoted identifiers and
[SEP] table delimiters — ideal for integration with JDBC metadata.
Schema formats¶
T5-small-awesome (CREATE TABLE DDL)¶
String schema = "CREATE TABLE employees (id INT, name VARCHAR, salary INT); "
+ "CREATE TABLE departments (id INT, name VARCHAR)";
sqlGen.generateSql("What is the average salary?", schema);
T5-large-spider (Spider format)¶
String schema = "\"employees\" \"id\" int, \"name\" varchar, \"salary\" int, "
+ "foreign_key: primary key: \"id\" "
+ "[SEP] "
+ "\"departments\" \"id\" int, \"name\" varchar, "
+ "foreign_key: primary key: \"id\"";
sqlGen.generateSql("What is the average salary?", schema);
Builder options¶
| Method | Type | Default | Description |
|---|---|---|---|
.modelId(String) |
String |
Preset-dependent | HuggingFace model ID |
.modelSource(ModelSource) |
ModelSource |
HuggingFaceModelSource |
Model resolution strategy |
.sessionOptions(SessionConfigurer) |
SessionConfigurer |
default | ONNX Runtime session config |
.tokenizerProvider(TokenizerProvider) |
TokenizerProvider |
UnigramTokenizer |
Tokenizer construction strategy |
.promptFormatter(BiFunction) |
BiFunction<String, String, String> |
Preset-dependent | Combines (query, schema) into model prompt |
.maxNewTokens(int) |
int |
256 |
Maximum tokens to generate |
.temperature(float) |
float |
0.0 |
Sampling temperature |
.topK(int) |
int |
0 (disabled) |
Top-K sampling |
.topP(float) |
float |
0.0 (disabled) |
Nucleus sampling |
.eosTokenId(int) |
int |
Auto-detected | End-of-sequence token ID |
Result type¶
GenerationResult is a record with:
| Field | Type | Description |
|---|---|---|
text() |
String |
The generated SQL query |
promptTokens() |
int |
Number of tokens in the input |
generatedTokens() |
int |
Number of tokens generated |
duration() |
Duration |
Wall-clock generation time |
The convenience method generateSql(query, schema) returns the SQL as a plain String.
Tips¶
- Use greedy decoding (default
temperature=0) for SQL generation — deterministic output is what you want. - Always validate and sanitize generated SQL before executing it against a real database.
- Include all relevant tables in the schema, even if the query only touches one — the model uses the full schema to resolve column references.
- For the T5-large-spider model, the schema format can be generated programmatically from JDBC
DatabaseMetaData.