Tabular Benchmark
Dataset Description
This dataset is a curation of various datasets from openML and is curated to benchmark performance of various machine learning algorithms.
- Repository: https://github.com/LeoGrin/tabular-benchmark/community
- Paper: https://hal.archives-ouvertes.fr/hal-03723551v2/document
Dataset Summary
Benchmark made of curation of various tabular data learning tasks, including:
- Regression from Numerical and Categorical Features
- Regression from Numerical Features
- Classification from Numerical and Categorical Features
- Classification from Numerical Features
Supported Tasks and Leaderboards
tabular-regression
tabular-classification
Dataset Structure
Data Splits
This dataset consists of four splits (folders) based on tasks and datasets included in tasks.
- reg_num: Task identifier for regression on numerical features.
- reg_cat: Task identifier for regression on numerical and categorical features.
- clf_num: Task identifier for classification on numerical features.
- clf_cat: Task identifier for classification on categorical features.
Depending on the dataset you want to load, you can load the dataset by passing task_name/dataset_name
to data_files
argument of load_dataset
like below:
from datasets import load_dataset
dataset = load_dataset("inria_soda/tabular-benchmark", data_files="reg_cat/house_sales.csv")
Dataset Creation
Curation Rationale
This dataset is curated to benchmark performance of tree based models against neural networks. The process of picking the datasets for curation is mentioned in the paper as below:
- Heterogeneous columns. Columns should correspond to features of different nature. This excludes images or signal datasets where each column corresponds to the same signal on different sensors.
- Not high dimensional. We only keep datasets with a d/n ratio below 1/10.
- Undocumented datasets We remove datasets where too little information is available. We did keep datasets with hidden column names if it was clear that the features were heterogeneous.
- I.I.D. data. We remove stream-like datasets or time series.
- Real-world data. We remove artificial datasets but keep some simulated datasets. The difference is subtle, but we try to keep simulated datasets if learning these datasets are of practical importance (like the Higgs dataset), and not just a toy example to test specific model capabilities.
- Not too small. We remove datasets with too few features (< 4) and too few samples (< 3 000). For benchmarks on numerical features only, we remove categorical features before checking if enough features and samples are remaining.
- Not too easy. We remove datasets which are too easy. Specifically, we remove a dataset if a default Logistic Regression (or Linear Regression for regression) reach a score whose relative difference with the score of both a default Resnet (from Gorishniy et al. [2021]) and a default HistGradientBoosting model (from scikit learn) is below 5%. Other benchmarks use different metrics to remove too easy datasets, like removing datasets which can be learnt perfectly by a single decision classifier [Bischl et al., 2021], but this does not account for different Bayes rate of different datasets. As tree-based methods have been shown to be superior to Logistic Regression [Fernández-Delgado et al., 2014] in our setting, a close score for these two types of models indicates that we might already be close to the best achievable score.
- Not deterministic. We remove datasets where the target is a deterministic function of the data. This mostly means removing datasets on games like poker and chess. Indeed, we believe that these datasets are very different from most real-world tabular datasets, and should be studied separately
Source Data
Numerical Classification
Categorical Classification
dataset_name | n_samples | n_features | original_link | new_link |
---|---|---|---|---|
electricity | 38474 | 8 | https://openml.org/d/151 | https://www.openml.org/d/44156 |
eye_movements | 7608 | 23 | https://openml.org/d/1044 | https://www.openml.org/d/44157 |
covertype | 423680 | 54 | https://openml.org/d/1114 | https://www.openml.org/d/44159 |
rl | 4970 | 12 | https://openml.org/d/1596 | https://www.openml.org/d/44160 |
road-safety | 111762 | 32 | https://openml.org/d/41160 | https://www.openml.org/d/44161 |
compass | 16644 | 17 | https://openml.org/d/42803 | https://www.openml.org/d/44162 |
KDDCup09_upselling | 5128 | 49 | https://www.kaggle.com/datasets/danofer/compass?select=cox-violent-parsed.csv | https://www.openml.org/d/44186 |
Numerical Regression
Categorical Regression
Dataset Curators
Léo Grinsztajn, Edouard Oyallon, Gaël Varoquaux.
Licensing Information
[More Information Needed]
Citation Information
Léo Grinsztajn, Edouard Oyallon, Gaël Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data?. NeurIPS 2022 Datasets and Benchmarks Track, Nov 2022, New Orleans, United States. ffhal-03723551v2f
- Downloads last month
- 351