build_extractor

transtab.build_encoder(categorical_columns=None, numerical_columns=None, binary_columns=None, hidden_dim=128, num_layer=2, num_attention_head=8, hidden_dropout_prob=0, ffn_dim=256, activation='relu', device='cuda:0', checkpoint=None, **kwargs)[source]

Build a feature encoder that maps inputs tabular samples to embeddings.

Parameters
  • categorical_columns (list) – a list of categorical feature names.

  • numerical_columns (list) – a list of numerical feature names.

  • binary_columns (list) – a list of binary feature names, accept binary indicators like (yes,no); (true,false); (0,1).

  • hidden_dim (int) – the dimension of hidden embeddings.

  • num_layer (int) – the number of transformer layers used in the encoder. If set zero, only use the embedding layer to get token-level embeddings.

  • num_attention_head (int) – the numebr of heads of multihead self-attention layer in the transformers. Ignored if num_layer=0 is zero.

  • hidden_dropout_prob (float) – the dropout ratio in the transformer encoder. Ignored if num_layer=0 is zero.

  • ffn_dim (int) – the dimension of feed-forward layer in the transformer layer. Ignored if num_layer=0 is zero.

  • activation (str) – the name of used activation functions, support "relu", "gelu", "selu", "leakyrelu". Ignored if num_layer=0 is zero.

  • device (str) – the device, "cpu" or "cuda:0".

  • checkpoint (str) – the directory to load the pretrained TransTab model.

The returned feature extractor takes pd.DataFrame as inputs and outputs the encoded sample-level embeddings.

# build the feature extractor
enc = transtab.build_encoder(categorical_columns=['gender'], numerical_columns=['age'])

# build a table for inputs
df = pd.DataFrame({'age':[1,2], 'gender':['male','female']})

# extract the outputs
outputs = enc(df)

print(outputs.shape)

'''
torch.Size([2, 128])
'''