Text-to-SQL¶

Generate SQL queries from natural language questions using T5-based text-to-SQL models.

Quick example¶

try (var sqlGen = T5SqlGenerator.t5SmallAwesome().build()) {
    String sql = sqlGen.generateSql(
            "How many employees are in the engineering department?",
            "CREATE TABLE employees (id INT, name VARCHAR, department VARCHAR, salary INT)");
    System.out.println(sql);
    // SELECT COUNT(*) FROM employees AS T1 JOIN departments AS T2
    //   ON T1.department = T2.id WHERE T2.name = 'Engineer'
}

Full example¶

import io.github.inference4j.generation.GenerationResult;
import io.github.inference4j.nlp.T5SqlGenerator;

public class TextToSql {
    public static void main(String[] args) {
        String schema = "CREATE TABLE employees (id INT, name VARCHAR, department VARCHAR, salary INT); " +
                         "CREATE TABLE departments (id INT, name VARCHAR, location VARCHAR)";

        try (var sqlGen = T5SqlGenerator.t5SmallAwesome()
                .maxNewTokens(200)
                .build()) {

            String[] questions = {
                "What is the average salary by department?",
                "List all employees in New York",
                "Which department has the most employees?"
            };

            for (String question : questions) {
                GenerationResult result = sqlGen.generateSql(question, schema,
                        token -> System.out.print(token));
                System.out.println();
                System.out.printf("  → %d tokens in %,d ms%n",
                        result.generatedTokens(), result.duration().toMillis());
            }
        }
    }
}

Model presets¶

Preset	Model	Parameters	Size	Schema format
`T5SqlGenerator.t5SmallAwesome()`	T5-small-awesome-text-to-sql	60M	~240 MB	`CREATE TABLE` DDL
`T5SqlGenerator.t5LargeSpider()`	T5-LM-Large-text2sql-spider	0.8B	~4.6 GB	Spider format with `[SEP]`

Choosing a preset¶

T5-small-awesome is recommended for most use cases. It's fast, lightweight, and handles standard SQL patterns well. Schema is provided as familiar CREATE TABLE statements.

T5-large-spider produces higher accuracy on complex queries (JOINs, subqueries, GROUP BY with HAVING). It uses a specialized schema format designed for the Spider benchmark with quoted identifiers and [SEP] table delimiters — ideal for integration with JDBC metadata.

Schema formats¶

T5-small-awesome (CREATE TABLE DDL)¶

String schema = "CREATE TABLE employees (id INT, name VARCHAR, salary INT); "
              + "CREATE TABLE departments (id INT, name VARCHAR)";
sqlGen.generateSql("What is the average salary?", schema);

T5-large-spider (Spider format)¶

String schema = "\"employees\" \"id\" int, \"name\" varchar, \"salary\" int, "
              + "foreign_key: primary key: \"id\" "
              + "[SEP] "
              + "\"departments\" \"id\" int, \"name\" varchar, "
              + "foreign_key: primary key: \"id\"";
sqlGen.generateSql("What is the average salary?", schema);

Builder options¶

Method	Type	Default	Description
`.modelId(String)`	`String`	Preset-dependent	HuggingFace model ID
`.modelSource(ModelSource)`	`ModelSource`	`HuggingFaceModelSource`	Model resolution strategy
`.sessionOptions(SessionConfigurer)`	`SessionConfigurer`	default	ONNX Runtime session config
`.tokenizerProvider(TokenizerProvider)`	`TokenizerProvider`	`UnigramTokenizer`	Tokenizer construction strategy
`.promptFormatter(BiFunction)`	`BiFunction<String, String, String>`	Preset-dependent	Combines (query, schema) into model prompt
`.maxNewTokens(int)`	`int`	`256`	Maximum tokens to generate
`.temperature(float)`	`float`	`0.0`	Sampling temperature
`.topK(int)`	`int`	`0` (disabled)	Top-K sampling
`.topP(float)`	`float`	`0.0` (disabled)	Nucleus sampling
`.eosTokenId(int)`	`int`	Auto-detected	End-of-sequence token ID

Result type¶

GenerationResult is a record with:

Field	Type	Description
`text()`	`String`	The generated SQL query
`promptTokens()`	`int`	Number of tokens in the input
`generatedTokens()`	`int`	Number of tokens generated
`duration()`	`Duration`	Wall-clock generation time

The convenience method generateSql(query, schema) returns the SQL as a plain String.

Tips¶

Use greedy decoding (default temperature=0) for SQL generation — deterministic output is what you want.
Always validate and sanitize generated SQL before executing it against a real database.
Include all relevant tables in the schema, even if the query only touches one — the model uses the full schema to resolve column references.
For the T5-large-spider model, the schema format can be generated programmatically from JDBC DatabaseMetaData.