Skip to content
Lucd Modeling Framework | Lucd Unified Dataspace (UDS) API | 6.2.7

Unified Dataspace API

lucd_uds

LucdPyTorchVirtualDataset()

class lucd_uds.LucdPyTorchVirtualDataset(_type, virtual_dataset_id, feature_list, label, feature_name, process_data, transform=None, dask_dataframe=None)

PyTorch Dataset subclass representing Lucd virtual dataset.

Enables use of pre-defined Lucd virtual dataset for training and evaluating PyTorch models.


LucdPyTorchVirtualDatasetHorovod()

class lucd_uds.LucdPyTorchVirtualDatasetHorovod(_type, feature_list, label, feature_name, data_type, transform=None)

PyTorch Dataset subclass representing Lucd virtual dataset.

Enables use of pre-defined Lucd virtual dataset for training and evaluating PyTorch models when Horovod is used for distributed training.


LucdPyTorchVirtualDatasetText()

class lucd_uds.LucdPyTorchVirtualDatasetText(virtual_dataset_id, feature_list, label, process_data, word_index_mapping, dask_dataframe=None)

PyTorch Dataset subclass representing Lucd virtual dataset.

Enables use of pre-defined Lucd virtual dataset for training and evaluating PyTorch models.


encode_text()

lucd_uds.encode_text(text, word_index_mapping)

Encodes text into their index representations for input into a TensorFlow model.

Parameters

  • text (str) – String representing raw text to be encoded.

  • word_index_mapping (dict) – Dict mapping word/token to an index.

Returns

List of encoded tokens (list).


generate_pytorch_text_batch()

lucd_uds.generate_pytorch_text_batch(batch)

Creates batches of text data formatted for the PyTorch nn.EmbeddingBag layer.

Parameters

  • batch (list) – List of text/features and labels.

Returns

Tensor of text data (torch.tensor), Tensor of offsets (torch.tensor), Tensor of label data (torch.tensor).


get_asset()

lucd_uds.get_asset(asset_id)

Retrieves a pre-defined set of word embeddings and other vocab data).

Parameters

  • asset_id (str) – ID f the embedding to retrieve from Dask scheduler.

Returns

Dict mapping vocab word to embedding (dict), Numpy 2-D array containing all embeddings (np.array), int representing embedding dimension (int), Dict mapping vocab word to index (dict), index of padding token (‘‘) (int).


get_dataframe()

lucd_uds.get_dataframe(virtual_dataset_id)

Returns Dask dataframes.

Parameters

  • virtual_dataset_id (str) – Virtual dataset ID.

Returns

Dask dataframe (dataframe)


get_tf_dataset()

lucd_uds.get_tf_dataset(input_name_type_dict, training_data, num_features, target_type, num_classes=1)

Returns a TensorFlow Dataset object.

Enables use of pre-defined Lucd virtual dataset for training, testing, or evaluating TensorFlow models. Used for tabular data.

Parameters

  • input_name_type_dict (dict) – Dictionary mapping feature names to their types.

  • training_data (dask.dataframe) – Dask dataframe containing training data.

  • num_features (tuple) – Tuple describing multi-dimensional feature size of training data.

  • target_type (tf.dtype) – TensorFlow type of the target variable (e.g., int32 for classification, float64 for regression).

  • num_classes (int) – Int defining the number of classes represented by the labels of the dataset to be returned. This is for datasets to be used only for “non-canned” TensorFlow estimator models. Internally, this enables one-hot encoding of categorical label data.

Returns TensorFlow Dataset object (tf.data.Dataset).


get_tf_dataset_horovod()

lucd_uds.get_tf_dataset_horovod(input_name_list, input_type_dict, num_features, target_type, batch_size, data_type, feature_name_list, label_name_list, process_data, num_classes=1)

Returns a TensorFlow Dataset object for training using Horovod framework.

Enables use of pre-defined Lucd virtual dataset for training, testing, or evaluation TensorFlow models when Horovod is used for distributed training. Used for tabular data.

Parameters

  • input_name_list (list) – List of strings representing input names for the Dataset.

  • input_type_dict (dict) – Dictionary mapping feature names to their types.

  • num_features (int) – Number of features (int) in training data.

  • target_type (tf.dtype) – TensorFlow type of the target variable (e.g., int32 for classification, float64 for regression).

  • batch_size (int) – Batch size (int) for the TensorFlow dataset.

  • data_type (str) – “TRAINING”, “EVALUATION”, “TESTING”

  • feature_name_list (list) – names to use for virtual dataset feature columns.

  • label_name_list (list) – names to use for virtual dataset label columns.

  • process_data (func) – function for processing datasets.

  • num_classes (int) – Int defining the number of classes represented by the labels of the dataset to be returned. This is for datasets to be used only for “non-canned” TensorFlow estimator models. Internally, this enables one-hot encoding of categorical label data.

Returns

TensorFlow Dataset object (tf.data.Dataset).


get_tf_dataset_image()

lucd_uds.get_tf_dataset_image(input_name_type_dict, training_data, num_features, target_type, num_classes=1)

Returns a TensorFlow Dataset object containing images.

Enables use of pre-defined Lucd virtual dataset containing images for training, testing, or evaluating TensorFlow models. This function must be used when the feature column utilized contains numpy ndarrays representing images.

Parameters

  • input_name_type_dict (dict) – Specifies the types of the different inputs

  • training_data (dask.dataframe) – Dask dataframe containing training data; should contain numpy ndarrays.

  • num_features (tuple) – Tuple containing shape of features (int) in training data.

  • target_type (tf.dtype) – TensorFlow type of the target variable (e.g., int32 for classification, float64 for regression).

  • num_classes (int) – Int defining the number of classes represented by the labels of the dataset to be returned. This is for datasets to be used only for “non-canned” TensorFlow estimator models. Internally, this enables one-hot encoding of categorical label data.

Returns

TensorFlow Dataset object (tf.data.Dataset).


get_tf_dataset_text()

lucd_uds.get_tf_dataset_text(input_name, training_data, word_index_mapping, pad_value, max_document_length, target_type, num_classes=1)

Returns a TensorFlow Dataset object for training purposes.

Enables use of pre-defined Lucd virtual dataset for training, testing, or evaluating TensorFlow models. Used for free-text data.

Parameters

  • input_name (str) – String representing input name for the Dataset.

  • training_data (dask.dataframe) – Dask dataframe containing training data.

  • word_index_mapping (dict) – Dict mapping string words to integer indices.

  • pad_value (int) – Int defining index value to use for padding (post-padding) document inputs.

  • max_document_length (int) – Int defining maximum number of tokens to which a document (input) will be limited.

  • target_type (tf.dtype) – TensorFlow type of the target variable (e.g., int32 for classification, float64 for regression).

  • num_classes (int) – Int defining the number of classes represented by the labels of the dataset to be returned. This is for datasets to be used only for “non-canned” TensorFlow estimator models. Internally, this enables one-hot encoding of categorical label data.

Returns

TensorFlow Dataset object (tf.data.Dataset).


store_checkpoint()

lucd_uds.store_checkpoint(model_id, vds_id, user_id, log_dir)

Stores all necessary content for check-pointing a TensorFlow model.

Parameters

  • model_id (int) – Model ID.

  • vds_id (str) – VDS ID.

  • user_id (str) – User ID.

  • log_dir (str) – Path to which the model was exported after training.

Returns

None.


train_eval_test_split_dataframe()

lucd_uds.train_eval_test_split_dataframe(virtual_dataset_id, evaluation_size, test_size)

Returns Dask dataframes based on a Lucd virtual dataset.

Randomly splits a pre-defined Lucd virual dataset into training, evaluation, and test datasets.

Parameters

  • virtual_dataset_id (str) – ID of Lucd virtual dataset to be represented by this PyTorch Dataset.

  • evaluation_size (float) – Float (0.0, 1.0) representing the proportion of the virtual dataset to include in the evaluation split.

  • test_size (float) – Float [0.0, 1.0) representing the proportion of the virtual dataset to include in the test split.

Returns

Dask dataframe for training (dask.dataframe), Dask dataframe for evaluation (dask.dataframe), Dask dataframe for testing (dask.dataframe).

Raises

Exception – Error if test_size + evaluation_size is greater than or equal to 1.0.


train_eval_test_split_dataframe_2()

lucd_uds.train_eval_test_split_dataframe_2(virtual_dataset_id, evaluation_size, test_size)

Returns Dask dataframes based on a Lucd virtual dataset.

Randomly splits a pre-defined Lucd virual dataset into training, evaluation, and test datasets. Also forms separate dataframes for features and labels, as defined by training run parameters specified in the Lucd GUI.

Parameters

  • virtual_dataset_id (str) – ID of Lucd virtual dataset to be represented by this PyTorch Dataset.

  • evaluation_size (float) – Float (0.0, 1.0) representing the proportion of the virtual dataset to include in the evaluation split.

  • test_size (float) – Float [0.0, 1.0) representing the proportion of the virtual dataset to include in the test split.

Returns

Dask dataframe containing feature column for training (dask.dataframe), Dask dataframe containing label columns for training (dask.dataframe), Dask dataframe containing feature column for evaluation (dask.dataframe), Dask dataframe containing label columns for evaluation (dask.dataframe), Dask dataframe containing feature column for testing (dask.dataframe), Dask dataframe containing label columns for testing (dask.dataframe).

Raises

Exception – Error if test_size + evaluation_size is greater than or equal to 1.0.


train_eval_test_split_pytorch()

lucd_uds.train_eval_test_split_pytorch(virtual_dataset_id, evaluation_size, test_size, feature_name, _type, process_data, text_data=False, word_index_mapping=None, horovod_flag=False, transform=None)

Returns training and evaluation PyTorch Datasets.

Randomly splits a pre-defined Lucd virtual dataset into training, evaluation, and test datasets. If horovod_flag is True, then the datasets are returned as LucdPyTorchVirtualDatasetHorovod objects. Else if largescale is True, they are returned as LucdPyTorchVirtualDatasetLargeScale. If largescale is False, then LucdPyTorchVirtualDataset objects are returned.

Parameters

  • virtual_dataset_id (str) – ID of Lucd virtual dataset to be represented by this PyTorch Dataset.

  • evaluation_size (float) – Float (0.0, 1.0) representing the proportion of the virtual dataset to include in the evaluation split.

  • test_size (float) – Float [0.0, 1.0) representing the proportion of the virtual dataset to include in the test split.

  • feature_name (str) – String identifying features for PyTorch Dataset.

  • _type (str) – ‘classification’ or ‘regression’.

  • process_data (func) – traditional transformation function.

  • text_data (bool) – True if the data represented by the virtual_dataset_id is for text classification, False if not.

  • word_index_mapping (dict) – Dict mapping string words to integer indices.

  • horovod_flag (bool) – Set True if returned datasets are to be used for Horovod-enabled training; Default False.

  • transform – PyTorch Transform; Default None.

Returns

PyTorch Dataset object for training (torch.Dataset),
PyTorch Dataset object for evaluation (torch.Dataset),
PyTorch Dataset object for testing (torch.Dataset).

Raises

Exception – Error if test_size + evaluation_size is greater than or equal to 1.0.


train_eval_test_split_tensorflow()

lucd_uds.train_eval_test_split_tensorflow(virtual_dataset_id, evaluation_size, test_size, process_data)

Returns training, evaluation, and test Dask dataframes to be used for creating TensorFlow Datasets.

Randomly splits a pre-defined Lucd virual dataset into training, evaluation, and test datasets. If training_size is 1.0, then this function will return a single Dataset, based on the entire virtual dataset.

Parameters

  • virtual_dataset_id (str) – Virtual dataset ID.

  • evaluation_size (float) – Float [0.0, 1.0).

  • test_size (float) – Float [0.0, 1.0).

  • process_data (func) – function for processing datasets.

Returns

Dask dataframe for training (dask.dataframe), Dask dataframe for evaluation (dask.dataframe), Dask dataframe for testing (“None” if testing_size is 0) (dask.dataframe), Dask dataframe containing only labels for testing (e.g., for computing a confusion matrix; “None” if testing_size is 0) (dask.dataframe), tuple defining number of features defined in the original virtual dataset.


zip_checkpoint()

lucd_uds.zip_checkpoint(model_id, graph_version, log_dir)

Packages all necessary content for check-pointing a TensorFlow model.

Parameters

  • model_id (int) – Model ID.

  • graph_version (int) – Graph version.

  • log_dir (str) – Path to which the model was exported after training.

Returns

Name of zipped file (str).


zip_keras_model()

lucd_uds.zip_keras_model(_keras_model, model_id, graph_version, log_dir)

Exports and packages from a keras sequential / functional model for Serving.

Parameters

  • _keras_model – Keras model object.

  • model_id (int) – Model ID.

  • graph_version (int) – Graph version.

  • log_dir (str) – Path to which the model was exported after training.

Returns

Filename of the zip archive (str).


zip_pt_model()

lucd_uds.zip_pt_model(model, model_id, model_path, graph_version)

Zips a PyTorch model.

Parameters

  • model – PyTorch model object.

  • model_id (int) – Model ID.

  • model_path (str) – Path to which the model was exported after training.

  • graph_version (int) – Graph version.

Returns

Name of zipped file (str).

zip_tf_model()

lucd_uds.zip_tf_model(_estimator, serving_input_receiver_function, model_id, graph_version, log_dir)

Exports and packages all necessary content for serving a TensorFlow model.

Parameters

  • _estimator – TensorFlow estimator object used for obtaining the model.

  • serving_input_receiver_function – TensorFlow serving input function.

  • model_id (int) – Model ID.

  • graph_version (int) – Graph version.

  • log_dir (str) – Path to which the model was exported after training.

Returns

Name of zipped file (str).