ALOT - Amsterdam Lunch On Table

Description

Amsterdam Lunch on Table (ALOT) is a reading group focussed on table representation learning and generally neural models for structured data that takes place on every first and third Wednesday of the month over lunch (at 12:00). Our objective is to foster a collaborative environment where researchers from the Amsterdam region can discuss and explore the intersection of AI and structured data. Each session is designed to be interactive, encouraging participants to engage in discussions that deepen their understanding of the latest research and methodologies. Through these sessions, we aim to inspire research ideas, support growth as researcher, and facilitate networking opportunities within the community.


Where & when? Please respond to our message on Discord if you are joining so we can pick you up at the CWI entrance. In the future, we plan to move the location to the UvA campus at Science Park to make it more accessible for everyone. We will keep you updated on this!

How it works We discuss one paper in each session. The paper is selected by the group and is announced at least a week in advance. One person is responsible for chairing the session and preparing a short introduction to the paper. The session chair is also responsible for facilitating the discussion and ensuring that everyone has a chance to contribute. We expect participants to read the paper in advance and send some questions or discussion points to the session chair to enable a more comprehensive and engaging discussion.
We are meeting for our reading group over lunch, and we encourage people to eat while we are discussing the paper. We have catered lunch for the group, so you don't have to bring your own lunch. Please indicate if you are coming to the session on the announcement on Discord so we can order enough food for everyone!

Want to join the reading group? Then join the ALOT Discord channel. We manage the reading group via Discord and will announce the papers and sessions there.


Next Session

The next session of the ALOT reading group will take place on Wednesday, June 18, 2025 at 12:00 and we will discuss:

Paper: AOP: Automated and Interactive LLM Pipeline Orchestration for Answering Complex Queries
Authors: J. Wang, G. Li
Venue: CIDR (2025)
Session Chair: Daniel Gomm

Abstract Current data lakes are limited to basic put/get functions on unstructured data and analytical queries on structured data. They fall short in handling complex queries that require multi-hop semantic retrieval and linking, multi-step logical reasoning, and multi-stage semantic analytics across unstructured, semi-structured, and structured data in data lakes. The introduction of large language models (LLMs) has significantly transformed the landscape of traditional data search and analytics across different fields due to their semantic comprehension and reasoning skills. Utilizing LLMs opens up new opportunities to efficiently handle these complex queries for data search and analytics, spanning structured, semi-structured, and unstructured data types in data lakes. However, LLMs struggle with complex queries that require complex task decomposition, pipeline orchestration, pipeline optimization, interactive execution, and self-reflection. In this work, we propose AOP, the first systematic system for automated pipeline orchestration in LLMs for answering complex queries on data lakes. AOP pre-defines standard semantic operators crucial for building execution workflows, such as semantic retrieval, filtering, aggregation, and validation. Then given an online query, AOP extracts relevant operators and uses these operators to automatically and interactively compose optimized pipelines with the assistance of LLMs. This enables AOP to adaptively and accurately address diverse and complex queries on data lakes. To further improve efficiency, we introduce query optimization techniques, including prefetching and parallel execution, to enhance overall efficiency without sacrificing accuracy. Through extensive experiments on real-world datasets, we demonstrate that AOP significantly improves the accuracy for answering complex queries. For instance, on a challenging test set, AOP increases answer accuracy by 45%.

Previous Sessions

2025
Wednesday, April 16, 2025 - Large Language Models(LLMs) on Tabular Data: Prediction, Generation, and Understanding - A Survey

Abstract: Recent breakthroughs in large language modeling have facilitated rigorous exploration of their application in diverse tasks related to tabular data modeling, such as prediction, tabular data synthesis, question answering, and table understanding. Each task presents unique challenges and opportunities. However, there is currently a lack of comprehensive review that summarizes and compares the key techniques, metrics, datasets, models, and optimization approaches in this research domain. This survey aims to address this gap by consolidating recent progress in these areas, offering a thorough survey and taxonomy of the datasets, metrics, and methodologies utilized. It identifies strengths, limitations, unexplored territories, and gaps in the existing literature, while providing some insights for future research directions in this vital and rapidly evolving field. It also provides relevant code and datasets references. Through this comprehensive review, we hope to provide interested readers with pertinent references and insightful perspectives, empowering them with the necessary tools and knowledge to effectively navigate and address the prevailing challenges in the field

Notes: We are reading this survey paper in our inaugural session to get a good overview of the field.

Synopsis of Reading Group Session

Wednesday, May 21, 2025 - TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

Abstract: We present TabPFN, a trained Transformer that can do supervised classification for small tabular datasets in less than a second, needs no hyperparameter tuning and is competitive with state-of-the-art classification methods. TabPFN performs in-context learning (ICL), it learns to make predictions using sequences of labeled examples (x, f(x)) given in the input, without requiring further parameter updates. TabPFN is fully entailed in the weights of our network, which accepts training and test samples as a set-valued input and yields predictions for the entire test set in a single forward pass. TabPFN is a Prior-Data Fitted Network (PFN) and is trained offline once, to approximate Bayesian inference on synthetic datasets drawn from our prior. This prior incorporates ideas from causal reasoning: It entails a large space of structural causal models with a preference for simple structures. On the 18 datasets in the OpenML-CC18 suite that contain up to 1 000 training data points, up to 100 purely numerical features without missing values, and up to 10 classes, we show that our method clearly outperforms boosted trees and performs on par with complex state-of-the-art AutoML systems with up to 230× speedup. This increases to a 5 700× speedup when using a GPU. We also validate these results on an additional 67 small numerical datasets from OpenML. We provide all our code, the trained TabPFN, an interactive browser demo and a Colab notebook at [this https URL](https://github.com/PriorLabs/TabPFN).

Notes: We will use the [nature publication](https://www.nature.com/articles/s41586-024-08328-6) as supplementary material.

Wednesday, June 04, 2025 - TableGPT2: A Large Multimodal Model with Tabular Data Integration

Abstract: The emergence of models like GPTs, Claude, LLaMA, and Qwen has reshaped AI applications, presenting vast new opportunities across industries. Yet, the integration of tabular data remains notably underdeveloped, despite its foundational role in numerous real-world domains. This gap is critical for three main reasons. First, database or data warehouse data integration is essential for advanced applications; second, the vast and largely untapped resource of tabular data offers immense potential for analysis; and third, the business intelligence domain specifically demands adaptable, precise solutions that many current LLMs may struggle to provide. In response, we introduce TableGPT2, a model rigorously pre-trained and fine-tuned with over 593.8K tables and 2.36M high-quality query-table-output tuples, a scale of table-related data unprecedented in prior research. This extensive training enables TableGPT2 to excel in table-centric tasks while maintaining strong general language and coding abilities. One of TableGPT2's key innovations is its novel table encoder, specifically designed to capture schema-level and cell-level information. This encoder strengthens the model's ability to handle ambiguous queries, missing column names, and irregular tables commonly encountered in real-world applications. Similar to visual language models, this pioneering approach integrates with the decoder to form a robust large multimodal model. We believe the results are compelling: over 23 benchmarking metrics, TableGPT2 achieves an average performance improvement of 35.20% in the 7B model and 49.32% in the 72B model over prior benchmark-neutral LLMs, with robust general-purpose capabilities intact.