load_data
- transtab.load_data(dataname, dataset_config=None, encode_cat=False, data_cut=None, seed=123)[source]
Load datasets from the local device or from openml.datasets.
- Parameters
dataname (str or int) – the dataset name/index intended to be loaded from openml. or the directory to the local dataset.
dataset_config (dict) – the dataset configuration to specify for loading. Please note that this variable will override the configuration loaded from the local files or from the openml.dataset.
encode_cat (bool) – whether encoder the categorical/binary columns to be discrete indices, keep False for TransTab models.
data_cut (int) – how many to split the raw tables into partitions equally; set None will not execute partition.
seed (int) – the random seed set to ensure the fixed train/val/test split.
- Returns
all_list (list or tuple) – the complete dataset, be (x,y) or [(x1,y1),(x2,y2),…].
train_list (list or tuple) – the train dataset, be (x,y) or [(x1,y1),(x2,y2),…].
val_list (list or tuple) – the validation dataset, be (x,y) or [(x1,y1),(x2,y2),…].
test_list (list) – the test dataset, be (x,y) or [(x1,y1),(x2,y2),…].
cat_col_list (list) – the list of categorical column names.
num_col_list (list) – the list of numerical column names.
bin_col_list (list) – the list of binary column names.
transtab provides flexible data loading function. It can be used to load arbitrary datasets from openml supported by openml.datasets API.
# specify the dataname
allset, trainset, valset, testset, cat_cols, num_cols, bin_cols \
= transtab.load_data('credit-g')
# or specify the dataset index (in openml)
allset, trainset, valset, testset, cat_cols, num_cols, bin_cols \
= transtab.load_data(31)
It can also be used to load datasets from the local device.
# specify the dataset dir
allset, trainset, valset, testset, cat_cols, num_cols, bin_cols \
= transtab.load_data('./data/credit-g')
Another important feature is to use this function to load multiple datasets
# specify the dataset dir
allset, trainset, valset, testset, cat_cols, num_cols, bin_cols \
= transtab.load_data(['./data/credit-g','./data/credit-approval'])
One can also pass dataset_config
to the load_data
function to manipulate the input table directly.
# customize dataset configuration
dataset_config = {
'credit-g':{
'columns':['a','b','c'], # specify the new columns for the table, should keep the same dimension as the original table.
'cat':['a'], # specify all the categorical columns
'bin':['b'], # specify all the binary columns
'num':['c']} # specify all the numerical columns
}
allset, trainset, valset, testset, cat_cols, num_cols, bin_cols \
= transtab.load_data('credit-g', dataset_config=dataset_config)
While this operation is not recommended. To avoid making errors, you’d better deposit all these configurations to the local following the guidance of custom dataset.