Custom Dataset

Here is the best practice to build your own datasets for transtab.

project
|
├── run_your_model.py
|
└─── data
     |
     ├── dataset1
     |   |    data_processed.csv
     |   |    binary_feature.txt
     |   └─── numerical_feature.txt
     |
     ├── dataset2
     |
    ...

where the run_your_model.py is the code where you will load the dataset and train your models.

You should put the preprocessed table into data_processed.csv, which is better to follow the protocols:

  • All the column names to be represented by meaningful natural languge.

  • All the categorical features to be represented by meaningful natural language.

  • All the binary features to be represented by 0 or 1.

  • All the numerical features to be represented by continuous values.

  • Store the processed table into data_processed.csv.

  • Store the binary column names into binary_feature.txt. No need to create this file if no binary feature.

  • Store the numerical column names into numerical_feature.txt. No need to create this file if no numerical feature.

  • All the other columns will be treated as categorical or textual.

After that, you can try to load the dataset by

transtab.load_data('./data/dataset1')

About dataset_config, an example is provided as

EXAMPLE_DATACONFIG = {
    "example": { # dataset name
        "bin": ["bin1", "bin2"], # binary column names
        "cat": ["cat1", "cat2"], # categorical column names
        "num": ["num1", "num2"], # numerical column names
        "cols": ["bin1", "bin2", "cat1", "cat2", "num1", "num2"], # all column names
        "binary_indicator": ["1", "yes", "true", "positive", "t", "y"], # binary indicators in the binary columns, which will be converted to 1
        "data_split_idx": {
            "train":[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], # row indices for training set
            "val":[10, 11, 12, 13, 14, 15, 16, 17, 18, 19], # row indices for validation set
            "test":[20, 21, 22, 23, 24, 25, 26, 27, 28, 29], # row indices for test set
            }
        }
    }