Skip to content
Lucd Modeling Framework | User Guide | 6.3.0 RC1

Introduction

With the full modeling approach, developers implement their model training routines in a python script file and directly use python functions for defining and using PyTorch datasets, TensorFlow datasets, and Dask dataframes (for XGBoost modeling) based on Lucd virtual datasets (defined in the Unity client). Additionally, a developer must call functions for uploading trained models and metadata (e.g., model training performance metrics) to the Lucd backend. The advantage of using the full model approach is that developers are free to “carry-over” modeling and customized and/or experimental performance analysis techniques from previously written code. Full model examples are contained in The Lucd Model Shop.

Full Model Format

Full models are implemented using python scripts. As opposed to using a main function, the code’s entrypoint function must be called start. The arguments passed to start are described in the sections below.

As a further note, in full model scripts, except blocks (for handling exceptions) MUST end with the raise statement, as opposed to another terminating statement like return. This ensures that the status of the model is accurately captured in the Unity client.

TensorFlow and PyTorch

Table 1 describes the python arguments (defined in the Unity client when starting mode training) which are always passed to the start function for TensorFlow and PyTorch models.

Argument Description
args['model'] (string) Model ID, used for storing checkpoints and models to Lucd backend
args['train_id'] (string) Model “training” ID, to be used for storing trained model asset to Lucd backend
args['vds'] (string) Lucd virtual dataset ID, used for retrieving training/validation/testing data for model training
args['asset'] (string) Asset (word embedding) ID, used for retrieving word embeddings for text classification model training
args['parameters']['steps'] (int) Number of steps for model training
args['parameters']['lr'] (float) Learning rate for model training
args['parameters']['regularization_value'] (float) Regularization value for model training
args['parameters']['eval_percent'] (float) Percentage of the virtual dataset to use for validation
args['parameters']['test_percent'] (float) Percentage of the virtual dataset to use for testing
args['parameters']['classification_mode'] (string) Type of classification (binary, multiclass, tf_premade_multiclass) as selected in the GUI (only applies to classification models)
args['parameters']['prediction_threshold'] (float) For binary classification models, minimum threshold for designating a positive decision
args['parameters']['max_document_length'] (int) Maximum number of tokens to be used for free text input into the model for training (for text classification).
args['exportdir'] (string) Directory used for storing trained model (for upload purposes)
args['graphversion'] (string) Version of the graph being trained

Table 1. Full model python script arguments for TensorFlow and PyTorch models.

Dask XGBoost

Table 2 describes the python arguments passed to the start function for Dask XGBoost models.

Argument Description
args['booster'] (string) XGBoost booster type
args['objective'] (string) Learning task and the corresponding learning objective
args['base_score'] (float) The initial prediction score of all instances, global bias
args['eval_metric'] (string) Evaluation metrics for validation data, a default metric will be assigned according to ‘objective’
args['seed'] (int) Random number seed
args['eta'] (int) Step size shrinkage used in update to prevents overfitting
args['gamma'] (float) Minimum loss reduction required to make a further partition on a leaf node of the tree
args['max_depth'] (int) Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit; beware that XGBoost aggressively consumes memory when training a deep tree
args['min_child_weight'] (float) Minimum sum of instance weight(hessian) needed in a child
args['max_delta_step'] (int) Maximum delta step we allow each tree’s weight estimation to be
args['subsample'] (float) Subsample ratio of the training instance
args['colsample_bytree'] (float) Subsample ratio of columns when constructing each tree
args['colsample_bylevel'] (float) Subsample ratio of columns for each level
args['colsample_bynode'] (float) Subsample ratio of columns for each split
args['xgboost_lambda'] (float) L2 regularization term on weights; increasing this value will make model more conservative
args['alpha'] (float) L1 regularization term on weights; increasing this value will make model more conservative
args['tree_method'] (string) The tree construction algorithm used in XGBoost
args['scale_pos_weight'] (float) Balancing of positive and negative weights
args['refresh_leaf'] (int) This is a parameter of the refresh updater plugin; when this flag is 1, tree leaves as well as tree nodes’ stats are updated; when it is 0, only node stats are updated
args['process_type'] (string) A type of boosting process to run
args['num_parallel_tree'] (int) Number of parallel trees constructed during each iteration; this option is used to support boosted random forest
args['sample_type'] (string) Type of sampling algorithm
args['normalize_type'] (string) Type of normalization algorithm
args['rate_drop'] (float) Dropout rate (a fraction of previous trees to drop during the dropout)
args['one_drop'] (string) When this flag is enabled, at least one tree is always dropped during the dropout (allows Binomial-plus-one or epsilon-dropout from the original DART paper)
args['skip_drop'] (float) Probability of skipping the dropout procedure during a boosting iteration
args['feature_selector'] (string) Feature selection and ordering method
args['top_k'] (int) The number of top features to select in greedy and thrifty feature selector; the value of 0 means using all the features
args['updater'] (string) A comma-separated string defining the sequence of tree updaters to run, providing a modular way to construct and to modify the trees. This is an advanced parameter that is usually set automatically, depending on some other parameters. However, it could be also set explicitly by a user.

Table 2. Full model python script arguments for Dask-XGBoost models.