SimpleRuTokenizer

class lightautoml.text.tokenizer.SimpleRuTokenizer(n_jobs=4, to_string=True, stopwords=False, is_stemmer=True, **kwargs)[source]

Bases: lightautoml.text.tokenizer.BaseTokenizer

Russian tokenizer.

__init__(n_jobs=4, to_string=True, stopwords=False, is_stemmer=True, **kwargs)[source]

Tokenizer for Russian language.

Include numeric, punctuation and short word filtering. Use stemmer by default and do lowercase.

Parameters
  • n_jobs (int) – Number of threads for multiprocessing.

  • to_string (bool) – Return string or list of tokens.

  • stopwords (Union[bool, Sequence[str], None]) – Use stopwords or not.

  • is_stemmer (bool) – Use stemmer.

preprocess_sentence(snt)[source]

Preprocess sentence string (lowercase, etc.).

Parameters

snt (str) – Sentence string.

Return type

str

Returns

Resulting string.

tokenize_sentence(snt)[source]

Convert sentence string to a list of tokens.

Parameters

snt (str) – Sentence string.

Return type

List[str]

Returns

Resulting list of tokens.

filter_tokens(snt)[source]

Clean list of sentence tokens.

Parameters

snt (List[str]) – List of tokens.

Return type

List[str]

Returns

Resulting list of filtered tokens.

postprocess_tokens(snt)[source]

Additional processing steps: lemmatization, pos tagging, etc.

Parameters

snt (List[str]) – List of tokens.

Return type

List[str]

Returns

Resulting list of processed tokens.

postprocess_sentence(snt)[source]

Postprocess sentence string (merge words).

Parameters

snt (str) – Sentence string.

Return type

str

Returns

Resulting string.