Custom Dataset

Here is the best practice to build your own datasets for transtab.

project
|
├── run_your_model.py
|
└─── data
     |
     ├── dataset1
     |   |    data_processed.csv
     |   |    binary_feature.txt
     |   └─── numerical_feature.txt
     |
     ├── dataset2
     |
    ...

where the run_your_model.py is the code where you will load the dataset and train your models.

You should put the preprocessed table into data_processed.csv, which is better to follow the protocols:

All the column names to be represented by meaningful natural languge.
All the categorical features to be represented by meaningful natural language.
All the binary features to be represented by 0 or 1.
All the numerical features to be represented by continuous values.
Store the processed table into data_processed.csv.
Store the binary column names into binary_feature.txt. No need to create this file if no binary feature.
Store the numerical column names into numerical_feature.txt. No need to create this file if no numerical feature.
All the other columns will be treated as categorical or textual.

After that, you can try to load the dataset by

transtab.load_data('./data/dataset1')

About dataset_config, an example is provided as

EXAMPLE_DATACONFIG = {
    "example": { # dataset name
        "bin": ["bin1", "bin2"], # binary column names
        "cat": ["cat1", "cat2"], # categorical column names
        "num": ["num1", "num2"], # numerical column names
        "cols": ["bin1", "bin2", "cat1", "cat2", "num1", "num2"], # all column names
        "binary_indicator": ["1", "yes", "true", "positive", "t", "y"], # binary indicators in the binary columns, which will be converted to 1
        "data_split_idx": {
            "train":[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], # row indices for training set
            "val":[10, 11, 12, 13, 14, 15, 16, 17, 18, 19], # row indices for validation set
            "test":[20, 21, 22, 23, 24, 25, 26, 27, 28, 29], # row indices for test set
            }
        }
    }