lightautoml.text

Provides an internal interface for working with text features.

Sentence Embedders

DLTransformer

Deep Learning based sentence embeddings.

BOREP

Class to compute Bag of Random Embedding Projections sentence embeddings from words embeddings.

RandomLSTM

Class to compute Random LSTM sentence embeddings from words embeddings.

BertEmbedder

Class to compute HuggingFace transformers words or sentence embeddings.

WeightedAverageTransformer

Weighted average of word embeddings.

Torch Datasets for Text

BertDataset

Dataset class with transformers tokenization.

EmbedDataset

Dataset class for extracting word embeddings.

Tokenizers

BaseTokenizer

Base class for tokenizer method.

SimpleRuTokenizer

Russian tokenizer.

SimpleEnTokenizer

English tokenizer.

Utils

seed_everything

Set random seed and cudnn params.

parse_devices

Parse devices and convert first to the torch device.

custom_collate

Puts each data field into a tensor with outer dimension batch size.

single_text_hash

Get text hash.

get_textarr_hash

Get hash of array with texts.