AutoNLPWrap

class lightautoml.transformers.text.AutoNLPWrap(model_name, embedding_model=None, cache_dir='./cache_NLP', bert_model=None, transformer_params=None, subs=None, multigpu=False, random_state=42, train_fasttext=False, fasttext_params=None, fasttext_epochs=2, sent_scaler=None, verbose=False, device='0', **kwargs)[source]

Bases: lightautoml.transformers.base.LAMLTransformer

Calculate text embeddings.

property features

Features list.

Return type

List[str]

__init__(model_name, embedding_model=None, cache_dir='./cache_NLP', bert_model=None, transformer_params=None, subs=None, multigpu=False, random_state=42, train_fasttext=False, fasttext_params=None, fasttext_epochs=2, sent_scaler=None, verbose=False, device='0', **kwargs)[source]
Parameters
  • model_name (str) – Method for aggregating word embeddings into sentence embedding.

  • transformer_params (Optional[Dict]) – Aggregating model parameters.

  • embedding_model (Optional[str]) – Word level embedding model with dict interface or path to gensim fasttext model.

  • cache_dir (str) – If None - do not cache transformed datasets.

  • bert_model (Optional[str]) – Name of HuggingFace transformer model.

  • subs (Optional[int]) – Subsample to calculate freqs. If None - full data.

  • multigpu (bool) – Use Data Parallel.

  • random_state (int) – Random state to take subsample.

  • train_fasttext (bool) – Train fasttext.

  • fasttext_params (Optional[Dict]) – Fasttext init params.

  • fasttext_epochs (int) – Number of epochs to train.

  • verbose (bool) – Verbosity.

  • device (Any) – Torch device or str.

  • **kwargs – Unused params.

fit(dataset)[source]

Fit chosen transformer and create feature names.

Parameters

dataset (Union[NumpyDataset, PandasDataset]) – Pandas or Numpy dataset of text features.

transform(dataset)[source]

Transform tokenized dataset to text embeddings.

Parameters

dataset (Union[NumpyDataset, PandasDataset]) – Pandas or Numpy dataset of text features.

Return type

Union[NumpyDataset, PandasDataset]

Returns

Numpy dataset with text embeddings.