TabularNLPAutoML

class lightautoml.automl.presets.text_presets.TabularNLPAutoML(task, timeout=3600, memory_limit=16, cpu_limit=4, gpu_ids='all', debug=False, timing_params=None, config_path=None, general_params=None, reader_params=None, read_csv_params=None, nested_cv_params=None, tuning_params=None, selection_params=None, nn_params=None, lgb_params=None, cb_params=None, rf_params=None, linear_l2_params=None, nn_pipeline_params=None, gbm_pipeline_params=None, linear_pipeline_params=None, text_params=None, tfidf_params=None, autonlp_params=None)[source]

Bases: TabularAutoML

Classic preset - work with tabular and text data.

Supported data roles - numbers, dates, categories, text Limitations - no memory management.

GPU support in catboost/lightgbm (if installed for GPU), NN models training.

Commonly _params kwargs (ex. timing_params) set via config file (config_path argument). If you need to change just few params, it’s possible to pass it as dict of dicts, like json. To get available params please look on default config template. Also you can find there param description. To generate config template call TabularNLPAutoML.get_config('config_path.yml').

Parameters:

task (Task) – Task to solve.
timeout (int) – Timeout in seconds.
memory_limit (int) – Memory limit that are passed to each automl.
cpu_limit (int) – CPU limit that that are passed to each automl.
gpu_ids (Optional[str]) – GPU IDs that are passed to each automl.
debug (bool) – To catch running model exceptions or not.
timing_params (Optional[dict]) – Timing param dict.
config_path (Optional[str]) – Path to config file.
general_params (Optional[dict]) – General param dict.
reader_params (Optional[dict]) – Reader param dict.
read_csv_params (Optional[dict]) – Params to pass pandas.read_csv (case of train/predict from file).
nested_cv_params (Optional[dict]) – Param dict for nested cross-validation.
tuning_params (Optional[dict]) – Params of Optuna tuner.
selection_params (Optional[dict]) – Params of feature selection.
nn_params (Optional[dict]) – Params of neural network model.
lgb_params (Optional[dict]) – Params of lightgbm model.
cb_params (Optional[dict]) – Params of catboost model.
linear_l2_params (Optional[dict]) – Params of linear model.
nn_pipeline_params (Optional[dict]) – Params of feature generation for neural network models.
gbm_pipeline_params (Optional[dict]) – Params of feature generation for boosting models.
linear_pipeline_params (Optional[dict]) – Params of feature generation for linear models.
text_params (Optional[dict]) – General params of text features.
tfidf_params (Optional[dict]) – Params of tfidf features.
autonlp_params (Optional[dict]) – Params of text embeddings features.

create_automl(**fit_args)[source]

Create basic automl instance.

Parameters:: **fit_args – Contain all information needed for creating automl.

predict(data, features_names=None, batch_size=None, n_jobs=1)[source]

Get dataset with predictions.

Almost same as lightautoml.automl.base.AutoML.predict on new dataset, with additional features.

Additional features - working with different data formats. Supported now:

Path to .csv, .parquet, .feather files.

ndarray, or dict of ndarray. For example, {'data': X...}. In this case roles are optional, but train_features and valid_features required.

pandas.DataFrame.

Parallel inference - you can pass n_jobs to speedup prediction (requires more RAM). Batch_inference - you can pass batch_size to decrease RAM usage (may be longer).

Parameters:

data (Union[str, ndarray, DataFrame, Dict[str, ndarray], Batch]) – Dataset to perform inference.
features_names (Optional[Sequence[str]]) – Optional features names, if cannot be inferred from train_data.
batch_size (Optional[int]) – Batch size or None.
n_jobs (Optional[int]) – Number of jobs.

Return type:

NumpyDataset

Returns:

Dataset with predictions.