Skip to content
Lucd Modeling Framework | User Guide | 6.3.0 RC1

Introduction

In the compact modeling approach, the Lucd Modeling Framework (LMF) provides a lightweight approach to training TensorFlow and PyTorch models in Lucd. Boilerplate tasks (e.g., creating confusion matrices, ROC curves, etc.) are automatically executed by LMF, enabling a developer to write and maintain less code. Furthermore, this supports modularity so that developers can effortlessly use new data loading, training, and performance measurement logic as they become available in LMF.

Compact modeling differs according the framework being used for model development, i.e., TensorFlow and PyTorch. The following sections describe the separate approaches in more detail.

Examples illustrating how to use the compact modeling approach are in The Lucd Model Shop.

TensorFlow

For TensorFlow-based modeling, compact modeling requires the developer to implement only two functions: model and label_mapping.

Model Function

The model function is intended to contain code for building TensorFlow Estimator models. There are plenty of examples on the web demonstrating how to build and configure Estimator models; the following link is a good starting point, https://www.tensorflow.org/guide/estimator. The LMF sends essential data (e.g., training data, training steps, pre-defined word embeddings) to this function to configure an Estimator model for training. In turn, the function must return some essential elements to essentially configure the LMF for training, performance analysis, etc. Below is a formal description of the model function.

def model(training_data, validation_data, num_features, training_steps, learning_rate, regularization_value, log_dir,
          training_param_map, embedding_matrix, embedding_size, word_index_mapping, max_document_length, pad_value,
          train_id):
    """Function used by LMF for training and analyzing TensorFlow Estimator models.

        Args:
            training_data (list): List of delayed "chunks" of Dask dataframe representing training data.
            validation_data (list): List of delayed "chunks" of Dask dataframe representing validation data.
            num_features (tuple): The shape of the features input for a model.
            training_steps (int): Number of steps for model training.
            learning_rate (float): Model's learning rate.
            regularization_value (float): Model's regularization value.
            log_dir (string): Path designating where checkpoints will be written (needed for training).
            training_param_map (dict): Dictionary containing miscellaneous parameters.
            embedding_matrix (numpy array): 2D numpy array where each row represents an embedding for an indexed word
            (for text classification models).
            embedding_size (int): Size/length of embeddings in embedding_matrix, or length of embeddings to be learned
            (for text classification models).
            word_index_mapping (dict): Dict mapping string words to their int index representations (for text
            classification models).
            max_document_length (int): Int defining the maximum number of tokens to be used for free text input into the
            model for training (for text classification models).
            pad_value (int): Int defining index value used for padding documents for training, validation, and testing
            (for text classification models).
            train_id (str): Unique identifier of the underlying training in the database.

        Returns:
            TensorFlow Estimator object (for training),
            TensorFlow Estimator TrainSpec object (for running training),
            TensorFlow Estimator EvalSpec object (for running validation),
            Dict mapping feature names to feature types (for loading data into the model),
            Type of target/label in training/validation/testing data (for loading data into the model),
            TensorFlow serving_input_receiver_fn suitable for use in serving (for model serving/prediction),
            List of feature names (same order as in training data, GUI display purposes),
            List of class names (same order as their numerical representation in training data, for confusion matrix and
            GUI display purposes),
            Number of classes to use in lucd_uds.get_tf_dataset_* and lucd_ml.get_predictions_* functions for TensorFlow.
            String "input_name" representing the name of the model input layer for use with TF signature def 
            when generating predictions.
    """

Note that the inputs to the model function are defined in the Unity client, and hence should not be altered.

Label_mapping Function

Return values from the label_mapping function are used by the LMF to compute the confusion matrix precision and recall statistics. For proper construction of the confusion matrix, a dict mapping training data’s label values (integers) to expressive strings should be returned.

PyTorch

For PyTorch-based modeling, the developer is required to implement the same functions as for TensorFlow: model and lable_mapping. The use of the label_mapping function for PyTorch is exactly the same as for TensorFlow. Hence, only details for the model function will be described for PyTorch.

Model Function

As opposed to TensorFlow-based modeling, for which the model function implements a developer’s AI model, model for PyTorch is primarily used for executing model training and validation logic. This is mainly because with PyTorch, model training logic is designed to be much more under a developer’s control. As one can see in the example code, and similar to traditional PyTorch style, the actual AI model can be defined as a separate class inside python file. Details of the model function for PyTorch are below.

def model(training_data, validation_data, num_features, training_steps, learning_rate, regularization_value, log_dir,
          training_param_map, embedding_matrix, embedding_size, word_index_mapping, max_document_length, pad_value,
          train_id):
    """Function used by LMF for training and analyzing TensorFlow Estimator models.

        Args:
            training_data (torch.utils.data.Dataset): PyTorch dataset representing training data.
            validation_data (torch.utils.data.Dataset): PyTorch dataset representing validation data.
            num_features (tuple): The shape of the features input for a model (no use for PyTorch identified).
            training_steps (int): Number of steps for model training.
            learning_rate (float): Model's learning rate.
            regularization_value (float): Model's regularization value.
            log_dir (string): Path designating where checkpoints will be written (needed for training).
            training_param_map (dict): Dictionary containing miscellaneous parameters.
            embedding_matrix (numpy array): 2D numpy array where each row represents an embedding for an indexed word
            (for text classification models).
            embedding_size (int): Size/length of embeddings in embedding_matrix, or length of embeddings to be learned
            (for text classification models).
            word_index_mapping (dict): Dict mapping string words to their int index representations (for text
            classification models).
            max_document_length (int): Int defining the maximum number of tokens to be used for free text input into the
            model for training (for text classification models).
            pad_value (int): Int defining index value used for padding documents for training, validation, and testing
            (for text classification models).
            train_id (str): Unique identifier of the underlying training in the database.

        Returns:
            Trained PyTorch model,
            List floats representing final model performance statistics values,
            List of class names (same order as their numerical representation in training data, for confusion matrix and
            GUI display purposes).
    """