Skip to content
Lucd Python Client | Lucd Unified Dataspace API | 1.0.1

Unified Dataspace API

eda.lib.lucd_uds

LucdPyTorchVirtualDataset()

class eda.lib.lucd_uds.LucdPyTorchVirtualDataset(_type, virtual_dataset_id, feature_list, label, feature_name, process_data, transform=None, dataframe=None)

PyTorch Dataset subclass representing Lucd virtual dataset.
Enables use of pre-defined Lucd virtual dataset for training and evaluating PyTorch models.


LucdPyTorchVirtualDatasetText()

class eda.lib.lucd_uds.LucdPyTorchVirtualDatasetText(virtual_dataset_id, feature_list, label, process_data, word_index_mapping, dask_dataframe=None)

PyTorch Dataset subclass representing Lucd virtual dataset.
Enables use of pre-defined Lucd virtual dataset for training and evaluating PyTorch models.


encode_text()

eda.lib.lucd_uds.encode_text(text, word_index_mapping)

Encodes text into their index representations for input into a TensorFlow model.

Parameters

  • text (str) – String representing raw text to be encoded.
  • word_index_mapping (dict) – Dict mapping word/token to an index.

Returns

List of encoded tokens (list).


generate_pytorch_text_batch()

eda.lib.lucd_uds.generate_pytorch_text_batch(batch)

Creates batches of text data formatted for the PyTorch nn.EmbeddingBag layer.

Parameters

  • batch (list) – List of text/features and labels.

Returns

Tensor of text data (torch.Tensor), Tensor of offsets (torch.Tensor), Tensor of label data (torch.Tensor).


get_asset()

eda.lib.lucd_uds.get_asset(asset_id: str, limit: int = 100)

Retrieves a pre-defined set of word embeddings and other vocab data).

Parameters

  • asset_id (str) – ID f the embedding to retrieve from Dask scheduler.
  • limit (int) – Number of index entries to return to read from database.

Returns

Dict mapping vocab word to embedding (dict), Numpy 2-D array containing all embeddings (np.array), int representing embedding dimension (int), Dict mapping vocab word to index (dict), index of padding token (‘‘) (int).


get_dataframe()

eda.lib.lucd_uds.get_dataframe(virtual_dataset_id: str, limit: int = 100)

Returns Dask dataframes.

Parameters

  • virtual_dataset_id (str) – Virtual dataset ID.
  • limit (int) – Number of index entries to return to read from database.

Returns

Dask dataframe (dask.dataframe)


get_tf_dataset()

eda.lib.lucd_uds.get_tf_dataset(input_name_type_dict, training_data, num_features, target_type, num_classes=1)

Returns a TensorFlow Dataset object.
Enables use of pre-defined Lucd virtual dataset for training, testing, or evaluating TensorFlow models. Used for tabular data.

Parameters

  • input_name_type_dict (dict) – Dictionary mapping feature names to their types.
  • training_data (dask.dataframe) – Dask dataframe containing training data.
  • num_features (int) – Number of features (int) in training data.
  • target_type (tf.dtype) – TensorFlow type of the target variable (e.g., int32 for classification, float64 for regression).
  • num_classes (int) – Int defining the number of classes represented by the labels of the dataset to be returned. This is for datasets to be used only for “non-canned” TensorFlow estimator models. Internally, this enables one-hot encoding of categorical label data.

Returns

TensorFlow Dataset object (tf.data.Dataset).


get_tf_dataset_image()

eda.lib.lucd_uds.get_tf_dataset_image(input_name_type_dict: dict, training_data: <module 'dask.dataframe' from 'c:\\\\users\\\\joelbranch\\\\anaconda3\\\\envs\\\\lucd\\\\lib\\\\site-packages\\\\dask\\\\dataframe\\\\__init__.py'>, num_features: tuple, target_type, num_classes: int = 1)

Returns a TensorFlow Dataset object containing images.

Enables use of pre-defined Lucd virtual dataset containing images for training, testing, or evaluating TensorFlow models. This function must be used when the feature column utilized contains numpy ndarrays representing images.

Parameters

  • input_name_type_dict (dict) – Specifies the types of the different inputs

  • training_data (dask.dataframe) – Dask dataframe containing training data; should contain numpy ndarrays.

  • num_features (tuple) – Tuple containing shape of features (int) in training data.
  • target_type (tf.dtype) – TensorFlow type of the target variable (e.g., int32 for classification, float64 for regression).
  • num_classes (int) – Int defining the number of classes represented by the labels of the dataset to be returned. This is for datasets to be used only for “non-canned” TensorFlow estimator models. Internally, this enables one-hot encoding of categorical label data.

Returns

TensorFlow Dataset object (tf.data.Dataset).


get_tf_dataset_text()

eda.lib.lucd_uds.get_tf_dataset_text(input_name, training_data: <module 'dask.dataframe' from 'c:\\\\users\\\\joelbranch\\\\anaconda3\\\\envs\\\\lucd\\\\lib\\\\site-packages\\\\dask\\\\dataframe\\\\__init__.py'>, word_index_mapping, pad_value, max_document_length, target_type, num_classes=1)

Returns a TensorFlow Dataset object for training purposes.

Enables use of pre-defined Lucd virtual dataset for training, testing, or evaluating TensorFlow models. Used for free-text data.

Parameters

  • input_name (str) – String representing input name for the Dataset.
  • training_data (dask.dataframe) – Dask dataframe containing training data.
  • word_index_mapping (dict) – Dict mapping string words to integer indices.
  • pad_value (int) – Int defining index value to use for padding (post-padding) document inputs.
  • max_document_length (int) – Int defining maximum number of tokens to which a document (input) will be limited.
  • target_type (tf.dtype) – TensorFlow type of the target variable (e.g., int32 for classification, float64 for regression).
  • num_classes (int) – Int defining the number of classes represented by the labels of the dataset to be returned. This is for datasets to be used only for “non-canned” TensorFlow estimator models. Internally, this enables one-hot encoding of categorical label data.

Returns

TensorFlow Dataset object (tf.data.Dataset).


train_eval_test_split_dataframe()

eda.lib.lucd_uds.train_eval_test_split_dataframe(virtual_dataset_id, evaluation_size, test_size)

Returns Dask dataframes based on a Lucd virtual dataset.

Randomly splits a pre-defined Lucd virtual dataset into training, evaluation, and test datasets.

Parameters

  • virtual_dataset_id (str) – ID of Lucd virtual dataset to be represented by this PyTorch Dataset.
  • evaluation_size (float) – Float [0.0, 1.0) representing the proportion of the virtual dataset to include in the evaluation split.
  • test_size (float) – Float [0.0, 1.0) representing the proportion of the virtual dataset to include in the test split.

Returns

Dask dataframe for training (dask.dataframe), Dask dataframe for evaluation (dask.dataframe), Dask dataframe for testing (dask.dataframe).

Raises

Exception – Error if test_size + evaluation_size is greater than or equal to 1.0.


train_eval_test_split_dataframe_2()

eda.lib.lucd_uds.train_eval_test_split_dataframe_2(virtual_dataset_id, evaluation_size, test_size)

Returns Dask dataframes based on a Lucd virtual dataset.

Randomly splits a pre-defined Lucd virtual dataset into training, evaluation, and test datasets. Also forms separate dataframes for features and labels, as defined by training run parameters specified in the Lucd GUI.

Parameters

  • virtual_dataset_id (str) – ID of Lucd virtual dataset to be represented by this PyTorch Dataset.
  • evaluation_size (float) – Float [0.0, 1.0) representing the proportion of the virtual dataset to include in the evaluation split.
  • test_size (float) – Float [0.0, 1.0) representing the proportion of the virtual dataset to include in the test split.

Returns

Dask dataframe containing feature column for training (dask.dataframe), Dask dataframe containing label columns for training (dask.dataframe), Dask dataframe containing feature column for evaluation (dask.dataframe), Dask dataframe containing label columns for evaluation (dask.dataframe), Dask dataframe containing feature column for testing (dask.dataframe), Dask dataframe containing label columns for testing (dask.dataframe).

Raises

Exception – Error if test_size + evaluation_size is greater than or equal to 1.0.


train_eval_test_split_pytorch()

eda.lib.lucd_uds.train_eval_test_split_pytorch(virtual_dataset_id, evaluation_size, test_size, feature_name, _type, process_data, text_data=False, word_index_mapping=None, transform=None)

Returns training and evaluation PyTorch Datasets.

Randomly splits a pre-defined Lucd virtual dataset into training, evaluation, and test datasets.

Parameters

  • virtual_dataset_id (str) – ID of Lucd virtual dataset to be represented by this PyTorch Dataset.
  • evaluation_size (float) – Float [0.0, 1.0) representing the proportion of the virtual dataset to include in the evaluation split.
  • test_size (float) – Float [0.0, 1.0) representing the proportion of the virtual dataset to include in the test split.
  • feature_name (str) – String identifying features for PyTorch Dataset.
  • _type (str) – ‘classification’ or ‘regression’

  • process_data (func) – traditional transformation function.

  • text_data (bool) – True if the data represented by the virtual_dataset_id is for text classification, False if not.
  • word_index_mapping (dict) – Dict mapping string words to integer indices.
  • transform – PyTorch Transform; Default None.

Returns

PyTorch Dataset object for training (torch.Dataset), PyTorch Dataset object for evaluation (torch.Dataset), PyTorch Dataset object for testing (torch.Dataset).

Raises

Exception – Error if test_size + evaluation_size is greater than or equal to 1.0.


train_eval_test_split_tensorflow()

eda.lib.lucd_uds.train_eval_test_split_tensorflow(virtual_dataset_id, evaluation_size, test_size, process_data)

Returns training, evaluation, and test Dask dataframes to be used for creating TensorFlow Datasets.

Randomly splits a pre-defined Lucd virtual dataset into training, evaluation, and test datasets. If training_size is 1.0, then this function will return a single Dataset, based on the entire virtual dataset.

Parameters

  • virtual_dataset_id (str) – Virtual dataset ID.
  • evaluation_size (float) – Float [0.0, 1.0).
  • test_size (float) – Float [0.0, 1.0).
  • process_data (func) – function for processing datasets.

Returns

Dask dataframe for training (dask.dataframe), Dask dataframe for evaluation (dask.dataframe), Dask dataframe for testing (“None” if testing_size is 0) (dask.dataframe), Dask dataframe containing only labels for testing (e.g., for computing a confusion matrix; “None” if testing_size is 0), number of features defined in the original virtual dataset (int).

Comments