Encode Tables
transtab is able to take pd.DataFrame as inputs and outputs the encoded sample-level embeddings. The full code is available at Notebook Example.
import transtab
# load a dataset and start vanilla supervised training
allset, trainset, valset, testset, cat_cols, num_cols, bin_cols \
= transtab.load_data('credit-g')
# build transtab classifier model
model, collate_fn = transtab.build_contrastive_learner(cat_cols, num_cols, bin_cols)
# start training
training_arguments = {
'num_epoch':50,
'batch_size':64,
'lr':1e-4,
'eval_metric':'val_loss',
'eval_less_is_better':True,
'output_dir':'./checkpoint'
}
transtab.train(model, trainset, valset, collate_fn=collate_fn, **training_arguments)
Now we have obtained the pretrained model saved in ‘./checkpoint’, we can load the model from this path and use it to encode tables.
# load the pretrained model
enc = transtab.build_encoder(
binary_columns=bin_cols,
checkpoint = './checkpoint'
)
Then we can take the whole pretrained model and output the cls token embedding at the last layer’s outputs
# encode tables to sample-level embeddings
df = trainset[0]
output = enc(df)