Two papers at the AITD workshop at EurIPS 2025!

Two papers to be presented at the AI for Tabular Data workshop at EurIPS 2025!


Congrats to Daniel and Cornelius on these very interesting contributions!

Daniel led an interesting reflection on the characteristics natural language queries in open-domain insight extraction, i.e. where relevant data needs to be retrieved and processed first, in order to provide the desired insights. What kind of queries should we expect in practice, and what queries are we evaluating these systems with? Basically: are we asking the right questions?

Paper: Are We Asking the Right Questions? On Ambiguity in Natural Language Queries for Tabular Data Analysis
Authors: Daniel Gomm, Cornelius Wolff, Madelon Hulsebos
TLDR: Ambiguity in natural-language queries over tables shouldn’t be viewed as a flaw but as part of natural human-data interaction. We propose a framework that revolves around a shared responsibility between users and systems for specifying queries. Applying criteria distilled from the framework to 15 popular tabular QA datasets, we find that current benchmarks mix query types in ways that undermine both execution and interpretation evaluation. We call for new design and evaluation practices that explicitly account for cooperative ambiguity in natural-language interfaces to tabular data.
Link: https://arxiv.org/abs/2511.04584
Blogpost: https://www.daniel-gomm.com/blog/2025/Have-you-Queries-Already-Seen-the-Data

Cornelius led the development of a large-scale semi-synthetic text-to-SQL dataset that has been grounded in real-world schemas and questions (500K triples). This dataset is envisioned to facilitate training and tuning smaller and specialized text-to-SQL models.

Paper: SQALE: A Large Text-to-SQL Corpus Grounded in Real Schemas
Authors: Cornelius Wolff, Daniel Gomm, Madelon Hulsebos
TLDR: SQaLe is a large-scale text-to-SQL dataset built from over 139,000 database schemas and more than 500,000 validated triples of schema, question, and query. It was created to address the limits of existing resources in scale, diversity, and realism, providing a foundation for training and evaluating models that translate natural language into SQL. The dataset reflects real schema complexity and can be loaded directly from the Hugging Face Hub for research or fine-tuning:
Link: https://openreview.net/pdf?id=6PsKDjgoEy
Dataset (HuggingFace): https://huggingface.co/datasets/trl-lab/SQaLe-text-to-SQL-dataset
Blogpost (HuggingFace): https://huggingface.co/blog/cwolff/sqale