load_data

transtab.load_data(dataname, dataset_config=None, encode_cat=False, data_cut=None, seed=123)[source]

Load datasets from the local device or from openml.datasets.

Parameters
  • dataname (str or int) – the dataset name/index intended to be loaded from openml. or the directory to the local dataset.

  • dataset_config (dict) – the dataset configuration to specify for loading. Please note that this variable will override the configuration loaded from the local files or from the openml.dataset.

  • encode_cat (bool) – whether encoder the categorical/binary columns to be discrete indices, keep False for TransTab models.

  • data_cut (int) – how many to split the raw tables into partitions equally; set None will not execute partition.

  • seed (int) – the random seed set to ensure the fixed train/val/test split.

Returns

  • all_list (list or tuple) – the complete dataset, be (x,y) or [(x1,y1),(x2,y2),…].

  • train_list (list or tuple) – the train dataset, be (x,y) or [(x1,y1),(x2,y2),…].

  • val_list (list or tuple) – the validation dataset, be (x,y) or [(x1,y1),(x2,y2),…].

  • test_list (list) – the test dataset, be (x,y) or [(x1,y1),(x2,y2),…].

  • cat_col_list (list) – the list of categorical column names.

  • num_col_list (list) – the list of numerical column names.

  • bin_col_list (list) – the list of binary column names.

transtab provides flexible data loading function. It can be used to load arbitrary datasets from openml supported by openml.datasets API.

# specify the dataname
allset, trainset, valset, testset, cat_cols, num_cols, bin_cols \
    = transtab.load_data('credit-g')

# or specify the dataset index (in openml)
allset, trainset, valset, testset, cat_cols, num_cols, bin_cols \
    = transtab.load_data(31)

It can also be used to load datasets from the local device.

# specify the dataset dir
allset, trainset, valset, testset, cat_cols, num_cols, bin_cols \
    = transtab.load_data('./data/credit-g')

Another important feature is to use this function to load multiple datasets

# specify the dataset dir
allset, trainset, valset, testset, cat_cols, num_cols, bin_cols \
    = transtab.load_data(['./data/credit-g','./data/credit-approval'])

One can also pass dataset_config to the load_data function to manipulate the input table directly.

# customize dataset configuration
dataset_config = {
    'credit-g':{
        'columns':['a','b','c'], # specify the new columns for the table, should keep the same dimension as the original table.
        'cat':['a'], # specify all the categorical columns
        'bin':['b'], # specify all the binary columns
        'num':['c']} # specify all the numerical columns
        }

allset, trainset, valset, testset, cat_cols, num_cols, bin_cols \
    = transtab.load_data('credit-g', dataset_config=dataset_config)

While this operation is not recommended. To avoid making errors, you’d better deposit all these configurations to the local following the guidance of custom dataset.