Bringing Foundation Models to Small Data
- Juan Manuel Ortiz de Zarate
- 4 days ago
- 11 min read
In the world of machine learning, deep neural networks have revolutionized computer vision and natural language processing. But when it comes to tabular data, tree-based models like XGBoost [3], LightGBM[4], and CatBoost[2] have remained dominant. That status quo may be shifting, thanks to a recent breakthrough published in Nature: the Tabular Prior-data Fitted Network (TabPFN)[1]. TabPFN is a transformer-based foundation model designed specifically for small to medium-sized tabular datasets. Despite its unconventional approach, TabPFN not only rivals traditional methods but significantly outperforms them across a variety of benchmarks [8], all while training in seconds.
This article explores the architecture, training philosophy, and empirical results behind TabPFN, and discusses why this foundation model could reshape how data scientists approach tabular learning.
The Challenge of Tabular Data
Unlike images or text, tabular datasets are heterogeneous: each column may encode a different data type, unit, or semantic meaning. Moreover, most real-world tabular datasets are relatively small. On OpenML, for example, over 75% of datasets contain fewer than 10,000 rows. This diversity and data scarcity make it hard to pretrain general-purpose deep models for tabular tasks.
While deep learning thrives on large-scale data and consistent structure, tabular problems demand robustness to missing values, categorical variables, feature scaling, and non-linear dependencies. For these reasons, gradient-boosted decision trees (GBDTs) have become the default choice for practitioners.
Enter TabPFN: A Learned Learning Algorithm

TabPFN introduces a fundamentally different way to tackle tabular prediction tasks. Rather than training a model anew on each dataset, TabPFN is trained once to learn a universal prediction algorithm. It does so by being exposed to a massive variety of synthetic tabular datasets during pretraining. Each of these datasets has its own structure, feature types, and causal relationships. By observing so many different scenarios, TabPFN internalizes a flexible, general-purpose learning strategy.
This is made possible through the concept of in-context learning (ICL)[7]. ICL allows a model to perform learning at inference time, by conditioning on a set of training examples without updating weights. In other words, the model doesn’t require gradient descent to adapt to new tasks—it already knows how to generalize from data patterns it sees in the input.
During inference, TabPFN takes a full dataset as input—some rows are labeled (training set), others are not (test set). In a single forward pass, it predicts the labels of the test rows, using the context provided by the training rows. No backpropagation or iterative optimization is involved at this stage.
The key innovation is that TabPFN learns this inference process itself. It is not just solving individual prediction tasks, but learning a learning algorithm that generalizes across many different problems. This meta-learning approach positions TabPFN as a learned optimizer for tabular data—a neural network that encodes how to learn.
This strategy pays off in practice. Because the learning behavior is embedded in the model weights, TabPFN can generalize across tasks instantly, yielding strong performance even on previously unseen datasets. And because it is trained on synthetic data generated from causal models, it encounters a wide spectrum of feature-target relationships during training, making it robust to real-world complexity.
Architecture Innovations
The TabPFN architecture adapts the transformer encoder to the two-dimensional structure of tabular data. Traditional transformers [6] treat inputs as sequences, which works well for text but fails to exploit the rich row-column semantics of tables. TabPFN overcomes this by treating each table cell as a distinct input token and introducing a custom two-dimensional attention mechanism.

This two-way attention scheme operates in two directions:
Row-wise attention: Each cell attends to other features in the same row, capturing inter-feature relationships within individual samples.
Column-wise attention: Each cell attends to the same feature across all rows, allowing it to learn how a single feature behaves across different samples.
Together, these attention patterns enable the model to learn context-aware representations that reflect both the per-sample structure and global feature dynamics. This design also ensures permutation invariance with respect to the order of rows and columns—a crucial property for tabular data.
To make the model practical at scale, several optimizations were implemented:
Memory-efficient attention: Flash attention is used to reduce the memory footprint during training and inference, enabling training on large synthetic datasets.
Half-precision computation: The model uses mixed precision (e.g., FP16) to further cut down memory usage without sacrificing accuracy.
Inference caching: When test samples are presented after training samples, TabPFN can cache the internal state computed for the training data, avoiding redundant computation and speeding up inference by orders of magnitude.
Output distributions: For regression tasks, the model predicts a piecewise constant distribution over possible target values, capturing uncertainty and supporting multi-modal outputs.
Moreover, TabPFN uses a fixed-size transformer with 12 custom 2D layers, enabling it to scale to datasets with up to 50 million cells (e.g., 5 million rows by 10 features) on a single high-memory GPU. Each cell is represented by a compact embedding, and feature identities are preserved through learned positional encodings that distinguish between different columns.
Another critical innovation is the ability to separate training and test processing. While typical ICL requires a joint forward pass over training and test examples, TabPFN can compute its representation of the training set once and reuse it to make fast predictions on multiple test rows—mimicking the behavior of traditional fitted models but within a transformer architecture.
Together, these architectural advancements enable TabPFN to combine the generalization power of transformers with the efficiency and structure-awareness needed for real-world tabular data.
Synthetic Pretraining with Causal Models
One of the most innovative aspects of TabPFN is its training procedure, which completely bypasses the need for large real-world tabular datasets. Instead, the model is pretrained entirely on synthetic data generated from structural causal models (SCMs). This approach is both principled and pragmatic—it provides an almost unlimited supply of diverse, realistic data while avoiding issues related to privacy, licensing, or data scarcity.

Each synthetic dataset is built from a randomly sampled causal graph: a directed acyclic graph (DAG) where nodes represent features or target variables and edges encode causal or statistical dependencies. To simulate real-world complexity, these graphs vary in size, structure, and complexity. The relationships between variables are generated using a diverse mix of computational primitives including:
Small neural networks with nonlinear activations
Arithmetic and trigonometric functions (e.g., sine, modulo, log)
Discretization for categorical variable generation
Tree-based logic mimicking decision rules
Gaussian noise for stochastic variation
To further simulate realistic challenges, the synthetic datasets include missing values, outliers, and irrelevant (uninformative) features. Feature types vary between continuous, categorical, and ordinal. Each dataset also comes with a configurable level of difficulty, controlling the signal-to-noise ratio and degree of non-linearity in the target function.
The generation process consists of four key steps:
Hyperparameter Sampling: Defines the size, feature dimensionality, and complexity of the dataset.
Causal Graph Construction: Creates the SCM and assigns each node a transformation function.
Data Generation: Samples inputs and propagates them through the graph to produce feature-target pairs.
Postprocessing: Applies data warping, discretization, and other transformations to increase heterogeneity.
The model is then trained to predict masked target values within these synthetic datasets, given the remaining labeled rows as context. This teaches it to perform a wide variety of learning tasks across domains.
By repeating this process over 100 million times, TabPFN learns a highly generalized, task-agnostic inductive bias. Rather than memorizing solutions to specific tasks, it internalizes a flexible learning mechanism that can be applied to new, unseen problems without retraining.
This synthetic pretraining strategy is not only scalable and reproducible but also remarkably effective. Because the training data spans a vast space of causal mechanisms, it prepares TabPFN for the diversity found in real-world applications—without requiring access to sensitive or proprietary datasets.
Empirical Performance
To rigorously assess TabPFN's effectiveness, the authors evaluated it on two large benchmark suites: the AutoML Benchmark and OpenML-CTR. These collections span diverse application domains including finance, medicine, physics, marketing, and public datasets from Kaggle competitions. In total, 29 classification and 28 regression datasets were used, all constrained to fewer than 10,000 samples and 500 features—typical of most real-world tabular tasks.
In the classification setting, TabPFN achieved a normalized ROC AUC score of 0.952 when tuned, and 0.939 in its default configuration. This is significantly better than the best-performing traditional model (CatBoost), which reached 0.822 with 4 hours of hyperparameter tuning. Even without tuning, TabPFN provided stronger performance out of the box.
For regression, the results were similarly impressive. TabPFN delivered a normalized negative RMSE of 0.968 (tuned) compared to CatBoost’s 0.875. Notably, this performance comes with an extraordinary efficiency gain: TabPFN required just 2.8 seconds on average for classification and 4.8 seconds for regression per dataset, while traditional models were given up to 4 hours of CPU time for tuning.
Beyond raw metrics, TabPFN exhibited greater consistency across datasets. While traditional models occasionally outperformed it on specific tasks, TabPFN had the highest win-rate overall and was particularly dominant on noisy datasets, those with small sample sizes, and those containing irrelevant features. This stability is vital for practitioners operating in uncertain or variable data conditions.

The model was also tested on modified datasets to assess robustness. Even after injecting outliers or dropping half the samples or features, TabPFN retained performance levels close to or exceeding tuned GBDTs. This level of resilience is rare among neural network models and highlights the strength of TabPFN’s learned inductive bias.
Finally, in a head-to-head comparison with AutoGluon[5]—a leading AutoML ensemble framework—TabPFN came out ahead in both speed and performance. AutoGluon required substantial compute time to run multiple models and tune hyperparameters, whereas TabPFN achieved superior accuracy in seconds using a single transformer forward pass.
These results position TabPFN as not only an accurate predictor but also an efficient and generalizable solution that can streamline the entire tabular modeling workflow.
Foundation Model Capabilities
As a foundation model, TabPFN is more than just a predictor. It exhibits multiple capabilities that make it versatile and applicable beyond standard supervised learning tasks. This multifunctionality stems from the model's design and pretraining process, which allow it to handle diverse challenges inherent to tabular data.
Density Estimation: TabPFN can estimate the full probability distribution of the target variable, not just point predictions. This makes it suitable for applications where uncertainty quantification is critical, such as medical diagnosis or financial risk analysis. For example, in regression tasks, it outputs a piecewise constant distribution, capturing both unimodal and multimodal outcomes. This feature enables better decision-making in scenarios with inherent ambiguity or noise.
Data Generation: Trained to understand complex feature-target relationships, TabPFN can also generate synthetic data that reflects the statistical properties of the training data. This is especially useful for data augmentation, bootstrapping, or privacy-preserving analytics. When trained on a real-world dataset, it can sample new rows that are statistically consistent with the originals, aiding model robustness and generalization.
Embeddings for Downstream Tasks: TabPFN produces meaningful internal representations of tabular data. These embeddings can be extracted and reused in downstream tasks such as clustering, visualization, or anomaly detection. In experiments with digit classification datasets, the learned embeddings showed well-separated clusters, outperforming raw feature spaces in dimensionality reduction techniques like PCA[9].
Few-shot Fine-tuning: Unlike most tree-based models, which lack fine-tuning mechanisms, TabPFN supports neural fine-tuning. This is particularly advantageous when adapting the model to related but non-identical datasets. For instance, when exposed to multiple sine wave regression tasks with varying offsets, fine-tuning improved its performance over zero-shot predictions. This capability opens the door to domain adaptation and transfer learning in tabular settings, where labeled data is often scarce.
Modular Integration: Thanks to its transformer-based architecture, TabPFN can be incorporated into larger machine learning pipelines. It can serve as a tabular feature encoder in multi-modal architectures or as a preprocessor in ensemble workflows. This positions it as a flexible component for future AI systems requiring structured data inputs.
These capabilities reflect the broader vision of foundation models: reusable, general-purpose systems that support multiple tasks with minimal supervision. TabPFN’s performance across this spectrum demonstrates the feasibility of bringing the foundation model paradigm to structured data domains, long thought resistant to such approaches.
Practical Strengths and Limitations
A key advantage of TabPFN lies in its combination of robustness, interpretability, and practical usability—an uncommon trio for deep learning models operating on tabular data.
Robustness and Generalization
Neural networks are often sensitive to typical challenges in tabular datasets, such as uninformative features, outliers, missing values, and data heterogeneity. Yet TabPFN shows remarkable resilience in these settings. In controlled tests, the model maintained strong performance even when 50% of features were dropped or corrupted with outliers.
This robustness is not accidental—it stems from its pretraining regime on millions of synthetic datasets that deliberately incorporate these challenges. During training, TabPFN sees examples with noise, sparsity, irrelevant features, class imbalance, and heteroscedasticity. As a result, the model generalizes gracefully to real-world datasets, even those with high cardinality categorical features, a mix of datatypes, or small sample sizes.
Unlike conventional models that may overfit to specific signal patterns, TabPFN learns meta-patterns of predictive structure. It doesn't just learn a dataset—it learns how to learn from datasets.
Interpretability via SHAP
Despite being a transformer-based model, TabPFN integrates well with SHAP (Shapley Additive Explanations) for feature attribution. This is essential for real-world deployments where model transparency is non-negotiable, such as in finance, healthcare, or policy-making.
In comparative analyses, TabPFN delivered both higher predictive accuracy and meaningful feature attributions. While logistic regression offered clean interpretability and CatBoost achieved strong performance with less interpretability, TabPFN effectively bridged the gap—offering near-state-of-the-art accuracy while retaining interpretability at the feature level.
This makes it one of the few deep learning models for tabular data that is both high-performing and trustworthy in high-stakes environments.
Limitations and Considerations
While powerful, TabPFN is not a universal solution. Its current architecture and memory optimizations make it ideal for datasets with up to ~10,000 samples and 500 features. Its performance on significantly larger datasets has not been thoroughly evaluated, and memory use grows linearly with data size, which can become limiting on commodity hardware.
Additionally, while TabPFN is extremely fast in training and tuning, its inference speed for a single prediction is slightly slower than highly optimized GBDT models like CatBoost. This trade-off may matter for real-time or latency-critical applications, though inference can still be GPU-accelerated and benefits from caching mechanisms for repeated queries.
Nonetheless, for most tabular ML workloads—think business intelligence, scientific data analysis, healthcare studies, or structured ML competitions—TabPFN offers a highly attractive balance of speed, accuracy, robustness, and usability. It drastically reduces time spent on feature engineering and model tuning, freeing data scientists to focus more on problem framing and interpretation rather than hyperparameter search.
Conclusion: A New Paradigm for Tabular Learning
TabPFN represents a significant shift in how we approach tabular modeling. By leveraging synthetic data, causal priors, and in-context learning, it bypasses many limitations of traditional models. Its ability to generalize across tasks, handle messy data, and deliver predictions in seconds positions it as a foundational tool for modern data science.
For practitioners, TabPFN provides an opportunity to rethink the default use of GBDTs. As foundation models continue to reshape the AI landscape, TabPFN stands as a promising example of how these principles can transform even the most conventional corners of machine learning.
References
[1] Hollmann, N., Müller, S., Purucker, L., Krishnakumar, A., Körfer, M., Hoo, S. B., ... & Hutter, F. (2025). Accurate predictions on small data with a tabular foundation model. Nature, 637(8045), 319-326.
[2] Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: unbiased boosting with categorical features. Advances in neural information processing systems, 31.
[3] Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794).
[4] Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., ... & Liu, T. Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30.
[5] Erickson, N., Mueller, J., Shirkov, A., Zhang, H., Larroy, P., Li, M., & Smola, A. (2020). Autogluon-tabular: Robust and accurate automl for structured data. arXiv preprint arXiv:2003.06505.
[6] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
[7] Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., ... & Herbert-Voss, A. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 1, 3.
[8] Measuring Intelligence: Key Benchmarks and Metrics for LLMs, Transcendent AI
[9] Dimensionality Reduction: Linear methods, Transcendent AI
Comments