Skip to content
Lucd Modeling Framework | Introduction to Developing Models | 6.2.7

Introduction to Developing Models

This document will describe how to prepare models for training using the python-based Lucd Model Framework. The guide is separated into the following sections:

  • accessing training arguments;

  • import training data;

  • evaluating TensorFlow models;

  • submitting model evaluation results;

  • submitting model training status.

A developer must implement her model using a python script with a main function that takes a dict of arguments. Numerous examples of models in python script form are presented in the documentation for Example Models.

Accessing Training Arguments

Recall that starting a model training session in the Lucd GUI requires defining a set of arguments. Table 1 describes the arguments (defined in the Lucd GUI) which are always passed to the main function of a TensorFlow or PyTorch model script.

Table 1. General Modeling Arguments

General argument Description
args[‘train_id’] Model “training run” ID, to be used for storing trained model asset back to Lucd
args[‘vds’] ID of the Lucd virtual dataset used for training the model
args[‘parameters’][‘steps’] Number of epochs for which to train the model
args[‘parameters’][‘lr’] Learning rate for training the model
args[‘parameters’][‘eval_percent’] Percentage of the virtual dataset to use for evaluation
args[‘parameters’][‘test_percent’] Percentage of the virtual dataset to use for testing
args[‘parameters’][‘classification_mode’] Type of classification (binary, multiclass, tf_premade_multiclass) as selected in the GUI (only applies to classification models)
args[‘exportdir’] Directory used for storing trained model (for upload purposes)
args[‘graphversion’] Version of the graph being trained

Table 2. Dask-XGBoost Model Script Arguments

Dask XGBoost argument Description
args[‘booster’] XGBoost booster type
args[‘objective’] Learning task and the corresponding learning objective
args[‘base_score’] The initial prediction score of all instances, global bias
args[‘eval_metric’] Evaluation metrics for validation data, a default metric will be assigned according to ‘objective’
args[‘seed’] Random number seed
args[‘eta’] Step size shrinkage used in update to prevents overfitting
args[‘gamma’] Minimum loss reduction required to make a further partition on a leaf node of the tree
args[‘max_depth’] Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit; beware that XGBoost aggressively consumes memory when training a deep tree
args[‘min_child_weight’] Minimum sum of instance weight(hessian) needed in a child
args[‘max_delta_step’] Maximum delta step we allow each tree’s weight estimation to be
args[‘subsample’] Subsample ratio of the training instance
args[‘colsample_bytree’] Subsample ratio of columns when constructing each tree
args[‘colsample_bylevel’] Subsample ratio of columns for each level
args[‘colsample_bynode’] Subsample ratio of columns for each split
args[‘xgboost_lambda’] L2 regularization term on weights; increasing this value will make model more conservative
args[‘alpha’] L1 regularization term on weights; increasing this value will make model more conservative
args[‘tree_method’] The tree construction algorithm used in XGBoost
args[‘scale_pos_weight’] Balancing of positive and negative weights
args[‘refresh_leaf’] This is a parameter of the refresh updater plugin; when this flag is 1, tree leaves as well as tree nodes’ stats are updated; when it is 0, only node stats are updated
args[‘process_type’] A type of boosting process to run
args[‘num_parallel_tree’] Number of parallel trees constructed during each iteration; this option is used to support boosted random forest
args[‘sample_type’] Type of sampling algorithm
args[‘normalize_type’] Type of normalization algorithm
args[‘rate_drop’] Dropout rate (a fraction of previous trees to drop during the dropout)
args[‘one_drop’] When this flag is enabled, at least one tree is always dropped during the dropout (allows Binomial-plus-one or epsilon-dropout from the original DART paper)
args[‘skip_drop’] Probability of skipping the dropout procedure during a boosting iteration
args[‘feature_selector’] Feature selection and ordering method
args[‘top_k’] The number of top features to select in greedy and thrifty feature selector; the value of 0 means using all the features
args[‘updater’] A comma-separated string defining the sequence of tree updaters to run, providing a modular way to construct and to modify the trees. This is an advanced parameter that is usually set automatically, depending on some other parameters. However, it could be also set explicitly by a user.

As a word of caution, the developer should not change any of these parameters, as they are selected when configuring a training session in the Lucd GUI, and hence, altering them can reduce the efficacy of training results.

Importing Training Data

Data can be imported into a model using the Lucd UDS API (eda.lib.lucd_uds). This library implements functions for creating datasets of various formats (TensorFlow, PyTorch, Dask dataframe) based on Lucd virtual datasets defined in the GUI. It also provides the capability to retrieve assets related to previously trained word embeddings.

The Lucd UDS functions providing data retrieval are listed below. Refer to the API documentation for full descriptions.

  • train_eval_test_split_tensorflow
  • get_tf_dataset
  • get_tf_dataset_text
  • get_tf_dataset_image
  • get_asset
  • get_tf_dataset_image
  • train_eval_test_split_pytorch
  • generate_pytorch_text_batch
  • train_eval_test_split_dataframe_2
  • get_dataframe

Important notes for implementing multi-class modeling

TensorFlow offers different approaches to building multi-class models, two prominent ones being using premade estimators (https://www.tensorflow.org/tutorials/estimator/premade#overview_of_programming_with_estimators), and using general techniques such as with Keras models and estimators. If one-hot encoded data labels are needed (i.e., to match the number of nodes in a neural network output layer), the num_classes parameter should be used when calling relevant functions to get data (e.g., lucd_uds.get_tf_dataset). Note that most (if not all) TensorFlow premade estimator models do not require explicitly one-hot encoded data labels for non-binary modeling (e.g., tensorflow.estimator.DNNClassifier) and hence, the num_classes argument can be omitted. In the case of TensorFlow estimators, developers are encouraged to understand how to shape input for the models. The same goes for modeling with PyTorch or XGBoost.

Evaluating TensorFlow Models

To help with the evaluation of TensorFlow models, the Lucd ML API (eda.lib.lucd_ml) contains the following functions:

  • get_predictions_classification,

  • get_predictions_regression, and

  • confusion_matrix.

The first two functions use trained TensorFlow estimator models and a TensorFlow input functions to compute predictions from a dataset. confusion_matrix computes and formats a confusion matrix for proper display in the Lucd GUI.

Submitting Model Evaluation Results

Trained models and metadata can be uploaded back to Lucd via the eda.int.train.update function. The following piece of example code illustrates how to use the function.

model_filename = lucd_uds.zip_tf_model(_estimator, serving_input_receiver_fn, model_id, graphversion, exportdir)

## Store model graph and performance stats back to Lucd back-end
with open(model_filename, "rb") as graph_file:
    train.update({tid: {
        'performance': {
            'loss': final_loss,
            'accuracy': accuracy,
            'confusion_matrix': cm_string,
            'macro_precision': precision_macro,
            'macro_recall': recall_macro,
            'macro_f1': f1_macro,
            'micro_precision': precision_micro,
            'micro_recall': recall_micro,
            'micro_f1': f1_micro,
            'weighted_precision': precision_weighted,
            'weighted_recall': recall_weighted,
            'weighted_f1': f1_weighted,
            'precision_recall_f1_per_label': results_string,
        },
        'graph_version': graphversion,
        'graph_file': graph_file.read()
    }})

train.update takes a python dictionary as the argument, with the train_id, described in Table 1, as the top-level key (tid represents the table_id in the code snippet above). The secondary keys graph_version and graph_file store the graph version and trained graph file respectively. The secondary key performance stores another dictionary for performance values. Some of the parameters here are specific to the type of model that was trained, i.e., classification or regression (the snippet above addresses classification). The confusion_matrix attribute stores the output from the confusion_matrix function described in the previous section.

As for precision_recall_f1_per_label, an example of how to compute performance statistics per label can be found at https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html and in example model code. precision_recall_f1_per_label must be formatted as a semicolon-separated string of tuples, with each tuple formatted as

precision_recall_f1_per_label: setosa-1.0,0.5,1.0;virginica-1.0,1.0,0.6;versicolor-1.0,0.6666666666666666,0.7499999999999999

As for regression modeling, the parameters are rmse (root mean squared error), mae (mean absolute error), and r2 (R2 regression score).

To enable a trained model to be used by the explainability tool in the Lucd GUI, the parameters ordered_feature_names and ordered_class_names must also be defined. ordered_feature_names (not to be confused with training data input column names) are the ordered names of the inputs to the trained model. For example, for a TensorFlow text classification model, the named input might be “embedding_input.” Please see example code in this documentation for further clarity. As a further note, in the current release, since the Lucd GUI explainability tool only supports TensorFlow text classification, ordered_feature_names are only needed for these model types. ordered_class_names must be formatted so that string class names are ordered by their integer representations. For instance, for a binary classification model for which the labels are 0 and 1, the order of strings must be negative and positive (or whatever string labels you choose).

Submitting Model Training Status

Another helpful function is eda.int.train.status, which is used for storing the status of a developer’s training pipeline. This enables a model’s status to be displayed on the Lucd GUI. The function definition is below.

def status(uid, code, message=None):
    """Update model status in the database.

    Args:
        uid: Int representing a model's ID.
        code: 0 - RETRIEVING VDS, 1 - TRAINING, 2 - ANALYZING PERFORMANCE, 3 - STORING MODEL, 4 - TRAINING COMPLETE,
        5 - ERROR, 6 - QUEUED, 7 - STOPPED.
        message: String representing optional custom message to include.

    Returns:
        Status message.

    Raises:
        TypeError: If code is not of type int.
        Exception: If code is invalid.
    """

Next: Example Models

Comments