TRL Seminar
Info
The Table Representation Learning (TRL) Seminar hosts talks on recent research in representation learning and generative models for structured data. This includes fundamental mechanisms for modeling structured data, retrieval from tabular data sources, multi-modal tabular learning, applications from data management, to reasoning and predicting over tabular data.
Organization
The TRL Seminar is an initiative from the TRL Lab under the affiliated Table Representation Learning Research Theme in the ELLIS unit Amsterdam, and is organized by Madelon Hulsebos (CWI), supported by Cornelius Wolff.
Logistics
- When: monthly on a Friday, 4-5pm, with drinks afterwards.
- Where: room L302 CWI, Science Park 123, Amsterdam (gather 5 minutes before in the lobby).
- How: talks are in-person, streamed and recorded through Zoom.
Upcoming talks
Past talks
Daniel Gomm, CWI & University of Amsterdam
23rd January, 4:00pm, room L302, CWI, Amsterdam Science Park, in-person talk and streamed through Zoom
Unfold Bio
Daniel Gomm is a PhD student at the Table Representation Learning Lab within the Database Architectures Group at Centrum Wiskunde & Informatica (CWI) and the Information Retrieval Lab at the University of Amsterdam. His research lies at the intersection of generative AI, information retrieval, and structured data, with the goal of democratizing access to insights from tables, relational databases, and spreadsheets by contextualizing generative methods in structured data settings. He brings an interdisciplinary background spanning engineering, economics, and computer science, with experience across industry, academia, and policy.
“Are We Asking the Right Questions? On Ambiguity in Natural Language Queries for Tabular Data Analysis”
Natural language interfaces for tabular data analysis must contend with ambiguity in user queries. Rather than treating ambiguity as a flaw to be eliminated, this talk argues that ambiguity is often an intentional and productive aspect of user–system interaction.
Daniel presents a principled framework that conceptualizes analytical queries through the lens of cooperative interaction, distinguishing between unambiguous queries, cooperative but underspecified queries that systems can reasonably resolve, and uncooperative queries that lack sufficient information for any actionable interpretation. The framework is grounded in linguistic theory and formalizes how responsibility for query specification is shared between user and system.
Applying this framework, the talk analyzes queries from 15 widely used benchmarks for tabular question answering, text-to-SQL, and data analysis. The results reveal that current datasets conflate different query types, undermining meaningful evaluation of both execution accuracy and interpretation capabilities. The talk concludes with implications for designing more realistic benchmarks and for building tabular data systems that explicitly support cooperative grounding, selective inference, and iterative clarification.
Join the seminar remotely via Zoom: https://cwi-nl-zoom.zoom.us/j/86289891036?pwd=5DGnyc3jpaucipnIjpEVa9wdGEISM1.1
Cornelius Wolff, CWI & University of Amsterdam
23rd January, 4:15pm, room L302, CWI, Amsterdam Science Park, in-person talk and streamed through Zoom
Unfold Bio
Cornelius Wolff is a PhD researcher at the TRL Lab at Centrum Wiskunde & Informatica (CWI) and the University of Amsterdam, supervised by Madelon Hulsebos and Maarten de Rijke. His main focus is the autonomous retrieval of relevant insight from structured data in realistic settings, with an emphasis on scalable and interpretable AI systems for databases and tabular data. He is also interested in text-to-SQL, small and efficient language models and in-context learning beyond natural language.
“SQALE: Scaling Text-to-SQL with Realistic Database Schemas”
Natural language interfaces for databases rely on text-to-SQL models that can translate user questions into executable SQL queries. While recent advances in large language models have significantly improved performance, progress remains constrained by the limited scale, diversity, and realism of available training data.
In this talk, Cornelius presents SQALE, a large-scale semi-synthetic text-to-SQL dataset grounded in real-world database schemas. SQALE comprises over 517,000 validated (question, schema, query) triples built on 135,000+ relational schemas derived from SchemaPile. The dataset is constructed using a principled generation pipeline that combines schema extension, natural language question synthesis, and SQL generation with execution-based validation.
The talk will discuss the design criteria behind SQALE, its statistical properties in comparison to existing benchmarks such as Spider 2.0 and BIRD, and how SQALE enables more realistic training and evaluation of text-to-SQL models.
Join the seminar remotely via Zoom: https://cwi-nl-zoom.zoom.us/j/86289891036?pwd=5DGnyc3jpaucipnIjpEVa9wdGEISM1.1
Erkan Karabulut, University of Amsterdam
23rd January, 4:30pm, room L302, CWI, Amsterdam Science Park, in-person talk and streamed through Zoom
Unfold Bio
Erkan is a PhD student at the INtelligent Data Engineering Lab (INDElab), University of Amsterdam. His research focuses on knowledge discovery and interpretable inference via rule learning on tabular data, including sensor data, with and without structured background knowledge (knowledge graphs) or prior knowledge from foundation models.
“Scalable Knowledge Discovery from Tabular Data”
Discovering patterns from data in human-understandable forms is a valuable task for both knowledge discovery and interpretable inference. A prominent method is Association Rule Mining (ARM), which identifies patterns in the form of logical rules describing relationships between data attributes. Popular ARM methods, however, rely on algorithmic or optimization-based solutions that struggle to scale to high-dimensional datasets (i.e., tables with many columns) without effective search space reduction.
This talk introduces Aerial+, a novel ARM method that leverages neural networks’ ability to handle high-dimensional data to learn a concise set of prominent patterns from tabular datasets. Aerial+ has been evaluated on both digital twin datasets (sensor data enriched with semantics) and on generic tabular datasets, demonstrating its versatility across domains. In addition, Aerial+ can incorporate prior knowledge to enhance discovery: either from knowledge graphs (structured semantic information about a domain) or from tabular foundation models, large pre-trained neural networks that capture table semantics and support diverse downstream tasks.
Join the seminar remotely via Zoom: https://cwi-nl-zoom.zoom.us/j/86289891036?pwd=5DGnyc3jpaucipnIjpEVa9wdGEISM1.1
Irina Saparina, University of Edinburgh
Friday 28 November 4:00-5:00pm, room L302, Science Park 123 Amsterdam, in-person talk and streamed through Zoom
Unfold Bio
Irina Saparina is a final-year Ph.D. student in Informatics at the University of Edinburgh, advised by Mirella Lapata. Her research focuses on building AI systems that can reason about user intents: recognizing ambiguity, offering diverse responses, and adapting to language variation. She is particularly interested in tasks that require complex multi-step reasoning, especially in text-to-SQL semantic parsing.
“From Implicit Bias to Explicit Choice: Handling Ambiguity in Text-to-SQL Parsing”
Practical text-to-SQL parsers are expected to understand user requests and map them to executable SQL queries, even when these requests are ambiguous. I begin by introducing AMBROSIA, a benchmark for text-to-SQL with ambiguous questions. Our findings reveal that even state-of-the-art LLMs struggle to recognize and interpret ambiguity, showing strong biases toward preferred interpretations. To address this challenge, I present a modular "disambiguate first, parse later" approach that generates natural language interpretations before mapping to logical forms. This approach constructively exploits LLM biases to generate an initial set of preferred disambiguations and then applies a specialized infilling model to identify and generate missing interpretations. Finally, I demonstrate how reinforcement learning with customized reward functions enables models to generate multiple interpretation-answer pairs in a single stage. This shifts focus from how to answer to what to answer, encouraging models to reason about user intent and consider different interpretations before responding. Together, these works move toward more robust, user-aligned LLMs that embrace rather than obscure ambiguity.
Want to join the seminar in-person? Please gather in the CWI Lobby by 3:50pm, someone will pick you up there!
Join the seminar remotely via Zoom: https://cwi-nl-zoom.zoom.us/j/86992868846?pwd=riEYApqkuahvoFvJpFuqglPWMIYlWk.1
Marine Le Morvan, Inria
Friday 11 April 4-5pm, L3.36 at LAB42, Amsterdam Science Park, in-person talk and streamed through Zoom
Unfold Bio
Marine Le Morvan is an INRIA research scientist in the SODA team in Paris-Saclay. Her research lies at the intersection of statistical learning and trustworthy AI, with a focus on:
- Tabular foundation models, which unlock new possibilities through large-scale pretraining.
- Model auditing, to enhance the trustworthiness and reliability of machine learning systems.
- Learning from incomplete data, a challenge pervasive in fields like healthcare and social sciences.
“TabICL: A Tabular Foundation Model for In-Context Learning on Large Data”
Abstract: The long-standing dominance of gradient-boosted decision trees on tabular data is currently challenged by tabular foundation models using In-Context Learning (ICL): setting the training data as context for the test data and predicting in a single forward pass without parameter updates. While the very recent TabPFNv2 foundation model (2025) excels on tables with up to 10K samples, its alternating column- and row-wise attentions make handling large training sets computationally prohibitive. So, can ICL be effectively scaled and deliver a benefit for larger tables? We introduce TabICL, a tabular foundation model pre-trained on datasets with up to 60K samples and handling 500K samples on affordable resources. This is enabled by a novel two-stage architecture: a column-then-row attention mechanism to build fixed-dimensional embeddings of rows, followed by a transformer for efficient ICL. On the TALENT benchmark with 200 datasets, TabICL is on par with TabPFNv2 while being systematically faster (up to 10 times), and significantly outperforms all other approaches. On the 56 datasets with over 10K samples, TabICL surpasses both TabPFNv2 and CatBoost, demonstrating the potential of ICL for large data.
Join the seminar remotely via Zoom: https://cwi-nl-zoom.zoom.us/j/86928893058?pwd=0tFURmzfFWXtWyN4xqkx15urhoui7b.1
Vaishali Pal, University of Amsterdam
Thursday 22 May 4-5pm, L3.33 at LAB42, Amsterdam Science Park, live through Zoom
Unfold Bio
Vaishali is a final-year PhD candidate at the Information Retrieval Lab at the University of Amsterdam. Her research interests are in the natural language processing and information retrieval, with a focus on semi-structured tables.
Table Question Answering
In this talk, I discuss my research on question answering over semi-structured tables. Semi-structured tables are fact-heavy and pose significant challenges to language models aiming to effectively meet a user's information needs. To understand these challenges, I discuss various tasks such as question answering and summarization over multiple tabular contexts and low-resource table question answering. Finally, I briefly discuss information retrieval over tables.
Join the seminar remotely via Zoom: https://cwi-nl-zoom.zoom.us/j/86928893058?pwd=0tFURmzfFWXtWyN4xqkx15urhoui7b.1
Margherita Martorana, VU Amsterdam
Friday 24 October 4:00-4:30pm, room L302 at CWI, Science Park 123 Amsterdam, in-person talk and streamed through Zoom
Unfold Bio
Margherita is a researcher in knowledge engineering and AI at Vrije Universiteit Amsterdam. Her research interests focus on neurosymbolic approaches that combine symbolic reasoning and neural models to build interpretable, interoperable and semantically aware AI solutions. Her PhD explored how metadata-driven and semantic methods can make confidential tabular data more findable and reusable while preserving privacy. Most recently, she worked on applying knowledge graphs and multimodal models to improve the adaptability and interoperability of personal service robots.
“How can we work with data that we cannot see? Metadata-driven approaches for confidential tabular data”
Many datasets that could benefit society, from the medical domain to social science, contain personal or confidential information and therefore cannot be openly shared. This raises a key question: how can we do data-driven research when the data itself is not available? In this talk, I will present the main focus of my PhD research, which explores how metadata can be used to enable the discovery and reuse of restricted-access tabular data without exposing sensitive information. I will introduce the concept of dataless tables, where the actual data is not accessible, but the metadata is used to describe meaningful aspects of the dataset - such as its structure, variables, and relationships - allowing it to be understood and connected without revealing the underlying content. I will then show how knowledge repositories, ontologies, and large language models can support the enrichment and integration of metadata to improve data FAIRness. The work shows that metadata-driven methods offer a practical way to balance data utility and data protection, contributing to a more transparent and secure data-sharing ecosystem.
Join the seminar remotely via Zoom: https://cwi-nl-zoom.zoom.us/j/89379959994?pwd=auygjgwlePCaaFR4SeBkqWHy7lTuyZ.1
Elias Dubbeldam, University of Amsterdam
Friday 24 October 4:30-5:00pm, room L302 at CWI, Science Park 123 Amsterdam, in-person talk and streamed through Zoom
Unfold Bio
Elias Dubbeldam is a second-year PhD candidate in the Business Analytics Section at the University of Amsterdam. His research focuses on tabular deep learning, specifically on modeling feature (i.e., column) interactions in single tables. This work lies at the intersection of probabilistic graphical models and graph neural networks. Elias is also interested in healthcare applications of tabular machine learning.
“Graph-based Tabular Deep Learning Should Learn Feature Interactions, Not Just Make Predictions”
In this talk, I discuss my research on feature interactions of tabular deep learning (TDL). A key challenge of single-table prediction lies in modeling complex, dataset-specific feature interactions that are central to tabular data. I argue that TDL should move beyond prediction-centric objectives and prioritize the explicit learning and evaluation of feature interactions. I discuss my recent research on graph-based TDL (GTDL) methods, which represent features as nodes and their interactions as edges in graph neural networks. Using synthetic datasets with known ground-truth graph structures, we show that existing GTDL methods fail to recover meaningful feature interactions. Moreover, enforcing the true interaction structure improves predictive performance. This highlights the need for GTDL methods to prioritize quantitative evaluation and accurate structural learning. Finally, I highlight opportunities and challenges for explicit feature interaction modeling within tabular foundation models.
Join the seminar remotely via Zoom: https://cwi-nl-zoom.zoom.us/j/89379959994?pwd=auygjgwlePCaaFR4SeBkqWHy7lTuyZ.1