Skip to content

Introduction

In the compact modeling approach, the Lucd Modeling Framework (LMF) provides a lightweight approach to training TensorFlow, PyTorch, Sci-kit learn, and XGBoost models in Lucd. Boilerplate tasks (e.g., creating confusion matrices, ROC curves, etc.) are automatically executed by LMF, enabling a developer to write and maintain less code. Furthermore, this supports modularity so that developers can effortlessly use new data loading, training, and performance measurement logic as they become available in LMF.

Compact modeling differs according the framework being used for model development, i.e., TensorFlow vs PyTorch. The following sections describe the separate approaches in more detail.

Examples illustrating how to use the compact modeling approach are in The Lucd Model Shop.

TensorFlow

For TensorFlow-based modeling, compact modeling requires the developer to implement only two functions: model and label_mapping.

Model Function

The model function is intended to contain code for building TensorFlow Estimator models. There are plenty of examples on the web demonstrating how to build and configure Estimator models; the following link is a good starting point, https://www.tensorflow.org/guide/estimator. The LMF sends essential data (e.g., training data, training steps, pre-defined word embeddings) to this function to configure an Estimator model for training. In turn, the function must return some essential elements to essentially configure the LMF for training, performance analysis, etc. Below is a formal description of the model function.

def model(training_data, validation_data, num_features, training_steps, learning_rate, regularization_value, log_dir,
          training_param_map, embedding_matrix, embedding_size, word_index_mapping, max_document_length, pad_value,
          train_id):
    """Function used by LMF for training and analyzing TensorFlow Estimator models.

        Args:
            training_data (list): List of delayed "chunks" of Dask dataframe representing training data.
            validation_data (list): List of delayed "chunks" of Dask dataframe representing validation data.
            num_features (tuple): The shape of the features input for a model.
            training_steps (int): Number of steps for model training.
            learning_rate (float): Model's learning rate.
            regularization_value (float): Model's regularization value.
            log_dir (string): Path designating where checkpoints will be written (needed for training).
            training_param_map (dict): Dictionary containing miscellaneous parameters.
            embedding_matrix (numpy array): 2D numpy array where each row represents an embedding for an indexed word
            (for text classification models).
            embedding_size (int): Size/length of embeddings in embedding_matrix, or length of embeddings to be learned
            (for text classification models).
            word_index_mapping (dict): Dict mapping string words to their int index representations (for text
            classification models).
            max_document_length (int): Int defining the maximum number of tokens to be used for free text input into the
            model for training (for text classification models).
            pad_value (int): Int defining index value used for padding documents for training, validation, and testing
            (for text classification models).
            train_id (str): Unique identifier of the underlying training in the database.

        Returns:
            TensorFlow Estimator object (for training),
            TensorFlow Estimator TrainSpec object (for running training),
            TensorFlow Estimator EvalSpec object (for running validation),
            Dict mapping feature names to feature types (for loading data into the model),
            Type of target/label in training/validation/testing data (for loading data into the model),
            TensorFlow serving_input_receiver_fn suitable for use in serving (for model serving/prediction),
            List of feature names (same order as in training data, GUI display purposes),
            List of class names (same order as their numerical representation in training data, for confusion matrix and
            GUI display purposes),
            Number of classes to use in lucd_uds.get_tf_dataset_* and lucd_ml.get_predictions_* functions for TensorFlow.
            String "input_name" representing the name of the model input layer for use with TF signature def 
            when generating predictions.
    """

Note that the inputs to the model function are defined in the Unity client, and hence should not be altered.

Label_mapping Function

Return values from the label_mapping function are used by the LMF to compute the confusion matrix precision and recall statistics. For proper construction of the confusion matrix, a dict mapping training data’s label values (integers) to expressive strings should be returned.

Here’s an example label_mapping definition:

def label_mapping():
    return {0: 'I. setosa', 1: 'I. virginica'}

PyTorch

For PyTorch-based modeling, the developer is required to implement the same functions as for TensorFlow: model and lable_mapping. The use of the label_mapping function for PyTorch is exactly the same as for TensorFlow. Hence, only details for the model function will be described for PyTorch.

Model Function

As opposed to TensorFlow-based modeling, for which the model function implements a developer’s AI model, model for PyTorch is primarily used for executing model training and validation logic. This is mainly because with PyTorch, model training logic is designed to be much more under a developer’s control. As one can see in the example code, and similar to traditional PyTorch style, the actual AI model can be defined as a separate class inside python file. Details of the model function for PyTorch are below.

def model(training_data, validation_data, num_features, training_steps, learning_rate, regularization_value, log_dir,
          training_param_map, embedding_matrix, embedding_size, word_index_mapping, max_document_length, pad_value,
          train_id):
    """Function used by LMF for training and analyzing PyTorch Estimator models.

        Args:
            training_data (torch.utils.data.Dataset): PyTorch dataset representing training data.
            validation_data (torch.utils.data.Dataset): PyTorch dataset representing validation data.
            num_features (tuple): The shape of the features input for a model (no use for PyTorch identified).
            training_steps (int): Number of steps for model training.
            learning_rate (float): Model's learning rate.
            regularization_value (float): Model's regularization value.
            log_dir (string): Path designating where checkpoints will be written (needed for training).
            training_param_map (dict): Dictionary containing miscellaneous parameters.
            embedding_matrix (numpy array): 2D numpy array where each row represents an embedding for an indexed word
            (for text classification models).
            embedding_size (int): Size/length of embeddings in embedding_matrix, or length of embeddings to be learned
            (for text classification models).
            word_index_mapping (dict): Dict mapping string words to their int index representations (for text
            classification models).
            max_document_length (int): Int defining the maximum number of tokens to be used for free text input into the
            model for training (for text classification models).
            pad_value (int): Int defining index value used for padding documents for training, validation, and testing
            (for text classification models).
            train_id (str): Unique identifier of the underlying training in the database.

        Returns:
            Trained PyTorch model,
            List floats representing final model performance statistics values,
            List of class names (same order as their numerical representation in training data, for confusion matrix and
            GUI display purposes).
    """

Sci-kit Learn

Sci-kit learn model functions should return either a sklearn estimator or a sklearn pipeline. XGBoost model functions should return the booster object returned from xgb.dask.train.

def model(train_features, train_labels, eval_features, eval_labels, num_features, training_steps, learning_rate, 
            regularization_value, log_dir, training_param_map, embedding_matrix, embedding_size, word_index_mapping, 
            max_document_length, pad_value, train_id):
    """Function used by LMF for training and analyzing sklearn / xgboost models.

        Args:
            train_features (dask.dataframe.DataFrame): Dask DataFrame training data.
            train_labels (dask.dataframe.DataFrame): Dask DataFrame training labels.
            eval_features (dask.dataframe.DataFrame): Dask DataFrame evaluation data.
            eval_labels (dask.dataframe.DataFrame): Dask DataFrame evaluation labels.
            num_features (tuple): The shape of the features input for a model (no use for PyTorch identified).
            training_steps (int): Number of steps for model training.
            learning_rate (float): Model's learning rate.
            regularization_value (float): Model's regularization value.
            log_dir (string): Path designating where checkpoints will be written (needed for training).
            training_param_map (dict): Dictionary containing miscellaneous parameters.
            embedding_matrix (numpy array): 2D numpy array where each row represents an embedding for an indexed word
            (for text classification models).
            embedding_size (int): Size/length of embeddings in embedding_matrix, or length of embeddings to be learned
            (for text classification models).
            word_index_mapping (dict): Dict mapping string words to their int index representations (for text
            classification models).
            max_document_length (int): Int defining the maximum number of tokens to be used for free text input into the
            model for training (for text classification models).
            pad_value (int): Int defining index value used for padding documents for training, validation, and testing
            (for text classification models).
            train_id (str): Unique identifier of the underlying training in the database.

        Returns:
            Trained Sci-kit learn estimator or pipeline, or XGBoost Model.
            List of class names (same order as their numerical representation in training data, for confusion matrix and
            GUI display purposes).
    """

XGBoost

XGBoost model functions should return the object returned from xgb.dask.train.

def model(train_features, train_labels, eval_features, eval_labels, num_features, training_steps, learning_rate, 
            regularization_value, log_dir, training_param_map, embedding_matrix, embedding_size, word_index_mapping, 
            max_document_length, pad_value, algorithm, base_score, booster, objective, eval_metric, seed, train_id):
    """Function used by LMF for training and analyzing sklearn / xgboost models.

        Args:
            train_features (dask.dataframe.DataFrame): Dask DataFrame training data.
            train_labels (dask.dataframe.DataFrame): Dask DataFrame training labels.
            eval_features (dask.dataframe.DataFrame): Dask DataFrame evaluation data.
            eval_labels (dask.dataframe.DataFrame): Dask DataFrame evaluation labels.
            num_features (tuple): The shape of the features input for a model (no use for PyTorch identified).
            training_steps (int): Number of steps for model training.
            learning_rate (float): Model's learning rate.
            regularization_value (float): Model's regularization value.
            log_dir (string): Path designating where checkpoints will be written (needed for training).
            training_param_map (dict): Dictionary containing miscellaneous parameters.
            embedding_matrix (numpy array): 2D numpy array where each row represents an embedding for an indexed word
            (for text classification models).
            embedding_size (int): Size/length of embeddings in embedding_matrix, or length of embeddings to be learned
            (for text classification models).
            word_index_mapping (dict): Dict mapping string words to their int index representations (for text
            classification models).
            max_document_length (int): Int defining the maximum number of tokens to be used for free text input into the
            model for training (for text classification models).
            pad_value (int): Int defining index value used for padding documents for training, validation, and testing
            (for text classification models).
            algorithm (str): One of "booster", "xgbclassifier", or "xgbregressor"
            base_score (float): the initial prediction score of all instances, global bias
            booster (str): One of "gbtree", "dart", or "gblinear"
            objective (str): Objective string to be passed to booster object
            eval_metric (str): Evaluation metric string to be passed to booster object
            seed (int): Random seed
            train_id (str): Unique identifier of the underlying training in the database.

        Returns:
            Trained Sci-kit learn estimator or pipeline, or XGBoost Model.
            List of class names (same order as their numerical representation in training data, for confusion matrix and
            GUI display purposes).
    """