Skip to content

Introduction

With the full modeling approach, developers implement their model training routines in a python script file and directly use python functions for defining and using PyTorch datasets, TensorFlow datasets, and Dask dataframes (for Sci-kit learn and XGBoost modeling) based on Lucd virtual datasets (defined in the Unity client). Additionally, a developer must call functions for uploading trained models and metadata (e.g., model training performance metrics) to the Lucd backend. The advantage of using the full model approach is that developers are free to “carry-over” modeling and customized and/or experimental performance analysis techniques from previously written code. Full model examples are contained in The Lucd Model Shop.

Full Model Format

Full models are implemented using python scripts. As opposed to using a main function, the code’s entrypoint function must be called start. The arguments passed to start are described in the sections below.

As a further note, in full model scripts, except blocks (for handling exceptions) MUST end with the raise statement, as opposed to another terminating statement like return. This ensures that the status of the model is accurately captured in the Unity client.

TensorFlow, PyTorch and Sci-kit learn

Table 1 describes the python arguments (defined in the Unity client when starting mode training) which are always passed to the start function for TensorFlow, PyTorch, and Sci-kit learn models.

Argument Description
args['model'] (string) Model ID, used for storing checkpoints and models to Lucd backend
args['train_id'] (string) Model “training” ID, to be used for storing trained model asset to Lucd backend
args['vds'] (string) Lucd virtual dataset ID, used for retrieving training/validation/testing data for model training
args['asset'] (string) Asset (word embedding) ID, used for retrieving word embeddings for text classification model training
args['parameters']['steps'] (int) Number of steps for model training
args['parameters']['lr'] (float) Learning rate for model training
args['parameters']['regularization_value'] (float) Regularization value for model training
args['parameters']['eval_percent'] (float) Percentage of the virtual dataset to use for validation
args['parameters']['test_percent'] (float) Percentage of the virtual dataset to use for testing
args['parameters']['classification_mode'] (string) Type of classification (binary, multiclass, tf_premade_multiclass) as selected in the GUI (only applies to classification models)
args['parameters']['prediction_threshold'] (float) For binary classification models, minimum threshold for designating a positive decision
args['parameters']['max_document_length'] (int) Maximum number of tokens to be used for free text input into the model for training (for text classification).
args['exportdir'] (string) Directory used for storing trained model (for upload purposes)
args['graphversion'] (string) Version of the graph being trained

Table 1. Full model python script arguments for TensorFlow and PyTorch models.

XGBoost Dask

Table 2 describes the python arguments passed to the start function for XGBoost Dask models.

Argument Description
args['parameters']['lr'] (float) Learning rate for model training
args['parameters']['steps'] (int) Number of steps for model training
args['parameters']['classification_mode'] (string) Type of classification (binary, multiclass, tf_premade_multiclass) as selected in the GUI (only applies to classification models)
args['parameters']['eval_percent'] (float) Percentage of the virtual dataset to use for validation
args['parameters']['test_percent'] (float) Percentage of the virtual dataset to use for testing
args['parameters']['max_document_length'] (int) Maximum number of tokens to be used for free text input into the model for training (for text classification).
args['parameters']['prediction_threshold'] (float) For binary classification models, minimum threshold for designating a positive decision
args['booster'] (string) XGBoost booster type
args['algorithm'] (string) One of “booster”, “xgbclassifier”, “xgbregressor”
args['objective'] (string) Learning task and the corresponding learning objective
args['base_score'] (float) The initial prediction score of all instances, global bias
args['eval_metric'] (string) Evaluation metrics for validation data, a default metric will be assigned according to ‘objective’
args['seed'] (int) Random number seed

Table 2. Full model python script arguments for Dask-XGBoost models.