Skip to content
Lucd Modeling Framework | User Guide | 6.3.0 RC1

Introduction

This section covers the high-level tasks needed to support model training: importing and preparing data and analyzing performance. Most of the content here pertains to the full model approach, but some (e.g., reporting model status) is still helpful for PyTorch compact modeling.

Importing and Preparing Data

Data can be imported into a modeling context using the Lucd Unified Dataspace (UDS) API (eda.lib.lucd_uds). This library provides functions for creating datasets of various formats (TensorFlow, PyTorch, Dask dataframe) based on Lucd virtual datasets defined in the Unity client. It also provides the capability to retrieve previously trained word embeddings.

The Lucd UDS functions providing data retrieval are listed below. Some are used for straightforward data importing (e.g., get_datframe) while others are used for preparing framework-specific datasets for AI models (e.g., get_tf_dataset for defining TensorFlow datasets). Refer to the API documentation for full function descriptions, and visit the Lucd Model Shop gitlab project for examples on how to use the functions for developing AI models.

  • get_asset
  • get_dataframe
  • get_tf_dataset
  • get_tf_dataset_image
  • get_tf_dataset_text
  • train_eval_test_split_dataframe
  • train_eval_test_split_pytorch
  • train_eval_test_split_tensorflow

Important notes for implementing multi-class modeling

TensorFlow offers different approaches to building multi-class models, two prominent ones being using pre-made Estimators (https://www.tensorflow.org/tutorials/estimator/premade#overview_of_programming_with_estimators), and using general techniques such as with Keras models and Estimators. If one-hot encoded data labels are needed (i.e., to match the number of nodes in a neural network output layer), the num_classes parameter should be used when calling relevant functions to get data (e.g., lucd_uds.get_tf_dataset). Note that most (if not all) TensorFlow pre-made Estimator models do not require explicitly one-hot encoded data labels for non-binary modeling (e.g., tensorflow.estimator.DNNClassifier) and hence, the num_classes argument can be omitted. In the case of TensorFlow Estimators, developers are encouraged to understand how to shape input for the models. The same goes for modeling with PyTorch or XGBoost.

Analyzing Model Performance

Post-training performance analysis tasks are supported by the Lucd Machine Learning (ML) API (eda.lib.lucd_ml). This library provides functions supporting automatic execution and reporting of critical performance analysis tasks (e.g., creating confusion matrices, ROC curves), preventing the need to repeatedly write such code. The tables and plots created from these library functions can be viewed in the Unity client after the entire model training process has completed.

The Lucd ML functions for performance analysis are listed below. Refer to the API documentation for full function descriptions.

  • get_predictions_classification_pt
  • get_predictions_classification_tf
  • get_predictions_regression_pt
  • get_predictions_regression_tf
  • lucd_precision_recall_curve
  • lucd_roc_curve
  • lucd_confusion_matrix
  • update_plots

Submitting Performance Analysis Results

Trained models and metadata can be uploaded to the Lucd backend via the eda.int.train.update function. The following piece of example code illustrates how to use the function.

model_filename = lucd_uds.zip_model_tf(trained_classifier, serving_input_receiver_fn, model_id, graph_version, log_dir)

# Store model graph and performance stats back to Lucd back-end
with open(model_filename, "rb") as graph_file:
    train.update({tid: {
        'performance': {
            'loss': loss,
            'accuracy': accuracy,
            'macro_precision': precision_macro,
            'macro_recall': recall_macro,
            'macro_f1': f1_macro,
            'micro_precision': precision_micro,
            'micro_recall': recall_micro,
            'micro_f1': f1_micro,
            'weighted_precision': precision_weighted,
            'weighted_recall': recall_weighted,
            'weighted_f1': f1_weighted,
            'precision_recall_f1_per_label': results_string,
        },
        'graph_version': graph_version,
        'graph_file': graph_file.read()
    }})

train.update takes a python dictionary as the argument, with the train_id, described in Table 1, as the top-level key (tid represents the table_id in the code snippet above). The secondary keys graph_version and graph_file store the graph version and trained graph file (model) respectively. The secondary key performance stores another dictionary for performance values. There is no restriction on the key-value pairs here. The developer is allowed to choose the performance values and they will be viewable in the Unity client afterward. The values shown in the code snippet above are customary for evaluating classification models. Again, see example models in The Lucd Model Shop for more insights.

As for precision_recall_f1_per_label, an example of how to compute performance statistics per label can be found at https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html. precision_recall_f1_per_label must be formatted as a semicolon-separated string of tuples, with each tuple formatted as “

precision_recall_f1_per_label: setosa-1.0,0.5,1.0;virginica-1.0,1.0,0.6;versicolor-1.0,0.6666666666666666,0.7499999999999999.

Enabling Model Explainability

To enable a trained model to be used by the explainability tool in the Unity client, some parameters must be defined. For TensorFlow models, ordered_feature_names, ordered_class_names, input_name, and output_name must be defined. ordered_feature_names (not to be confused with training data input column names) is a list of ordered names of the inputs to the trained model, commonly defined in TensorFlow model definitions as tf.feature_column. For example, for a TensorFlow text classification model, the named input might be embedding_input. Please see example code in The Lucd Model Shop. ordered_class_names is a list formatted such that string class names are ordered by their integer representations (the order of outputs from your model). For instance, for a binary classification model for which the labels are 0 and 1, the order of strings must be negative and positive (or whatever string labels you choose). input_name is the name of the input layer in your TensorFlow model to which your ordered_feature_names data will be passed. output_name is the name of the output layer in your TensorFlow model (by default these can be named things like ‘dense_2’ and ‘scores’). The output_name is used to retrieve your model outputs in the proper format for explanation. PyTorch models only require that ordered_class_names be provided.

Plots

Lucd allows users to plot associated model training metrics in the Unity client. The plots will update in realtime during model training, providing insight into the viability, problems, and successes of training runs. Users are able to store any information they want to be plotted as part of a training run using the eda.lib.lucd_ml.update_plots function. The provided information must abide by the following assumptions: 1. The top level keys represent individual plots; 2. Each inner dictionary must specify a labels and a description key, where labels are [“plot x_label”, “plot y_label”] and the description can be any string; 3. All remaining keys in the inner dictionary will be treated as individual lines on the plot, so in the following example "accuracy" is a line on the plot.

{
    "accuracy": {
        "l1": [ [1, 0.10], [2, 0.15] ],
        "labels": ["epoch", "accuracy"],
        "description": "This is a simple accuracy plot example."
    }
}

The individual elements in the “l1” line above represent [x_val, y_val]. A common example as shown above includes the following:

"l1": [ [epoch, accuracy], [epoch, accuracy], ... ].

Once created, the dictionary may be stored for plotting with the lucd_ml.update_plots function:

def update_plots(train_or_eval: bool, tid: str, dictionary: dict) -> str:
    ...
dictionary = {
    "accuracy": {
        "l1": [ [1, 0.10], [2, 0.15] ],
        "labels": ["epoch", "accuracy"],
        "description": "This is a simple accuracy plot example."
    }
}
update_plots(True, train_id, dictionary)

Regarding the update_plots function, train_or_eval allows a user to specify whether their plot is part of the training or evaluation (or validation) cycle of the model training: train=true, eval=false.

A TensorFlow hook is provided in lucd_ml for automatically parsing generated events files (the same as used by TensorBoard) and passing them to update_plots as part of a TensorFlow model. It can be provided as part of a TenorFlow EvalSpec or TrainSpec object as follows (stub included for posterity):

class LucdTFEstimatorHook(tf.estimator.SessionRunHook):
    def __init__(self, train_hook: bool, log_dir: str, tid: str, freq: int, last_epoch: int):
        ...
train_spec = tf.estimator.TrainSpec(
        input_fn=lambda: lucd_uds.get_tf_dataset_image(type_dict, training_data, num_features, target_type,
                                                       num_classes).repeat(count=None).shuffle(30).batch(int(30)),
        max_steps=training_steps,
        hooks=lucd_ml.LucdTFEstimatorHook(train_hook=True, log_dir=log_dir, tid=tid, freq=10, max_epoch=training_steps)])

train_hook allows a user to specify whether the hook is providing train or eval metrics to the user (train=true, eval=false). log_dir tells the hook where to find TensorFlow events files. freq is the frequency that the hook should look for metrics in the events files. last_epoch tells the hook the number of epochs being run so the hook can ignore freq for the last epoch.

Two last helper functions are provided as part of the Lucd plotting framework: lucd_roc_curve and lucd_precision_recall_curve. These functions generate ROC curves and precision-recall curves respectively, and are called selectively when using the compact modeling approach (enabled by “performance_curves” button in unity). Further documentation for these functions is provided in the API.

def lucd_roc_curve(truths: list, scores: list, class_list: list,
                   tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict):
def lucd_precision_recall_curve(truths: list, scores: list, class_list: list,
                                tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict):
    ...

Confusion Matrix

Lucd provides an in-depth, interactive confusion matrix for classification model evaluation. Users may select a square in the Unity client to be shown actual records associated with the square selected. This may be enabled by using the following function:

def lucd_confusion_matrix(test_set: list or DataLoader, predictions: list, num_classes: int,
                         label_mapping: type(abs), tid: str, write_accumulo: bool,) -> (dict, str):

The function arguments details are provided below.

  • test_set: Users may directly pass the PyTorch DataLoader or list of delayed dask dataframes returned from the respective train_eval_test_split_pytorch/tensorflow function.
  • predictions: This should be a list of predictions generated by your model (the list returned from lucd_ml get_predictions_classification). The list must be in the same order as the test_set data.
  • num_classes: An integer number of classes for the confusion matrix to represent.
  • label_mapping: A function to map integers to class labels, which is used to map predictions to a human-readable format.
  • tid: Training id to associate confusion matrix with.
  • write_accumulo: Boolean specifying whether to write the dictionary directly to accumulo. Must be True to enable Unity client Confusion Matrix. If False, the generated confusion matrix will only be returned & not written to the database.

Further documentation for this function exists in the API documentation. Here is an example usage:

def _label_mapping():
    return {0: 'I. versicolor', 1: 'I. virginica', 2: 'I. setosa'}

...

# Prepare vds data for modeling
delayed_values_training, delayed_values_evaluation, delayed_values_testing, my_df_testing_label, num_features = \
    lucd_uds.train_eval_test_split_tensorflow(virtual_dataset_id, evaluation_dataset_percent, testing_dataset_percent)

...

predictions, scores = lucd_ml.get_predictions_classification_tf(_estimator,
                                                             lambda: uds.get_tf_dataset(
                                                                feature_dict,
                                                                delayed_values_testing,
                                                                num_features,
                                                                target_type).batch(1),
                                                             classification_mode, .5)

...

lucd_ml.lucd_confusion_matrix(delayed_values_testing, predictions, 3, label_mapping(), tid, True)

Submitting Model Training Status

Another helpful function is eda.int.train.status, which is used for storing the status of a developer’s training pipeline. This enables a model’s status to be displayed on the Unity client. The function definition is below.

def status(uid, code, message=None):
    """Update model status in the database.

    Args:
        uid: Int representing a model's ID.
        code: 0 - RETRIEVING VDS, 1 - TRAINING, 2 - ANALYZING PERFORMANCE, 3 - STORING MODEL, 4 - TRAINING COMPLETE,
        5 - ERROR, 6 - QUEUED, 7 - STOPPED.
        message: String representing optional custom message to include.

    Returns:
        Status message.

    Raises:
        TypeError: If code is not of type int.
        Exception: If code is invalid.
    """