This documentation describes the Lucd Modeling Framework, and its usage for training and managing python-based TensorFlow, PyTorch, and Dask XGBoost machine learning models. Specifically, the framework provides libraries for the following tasks:
accessing Lucd virtual datasets in python for model training and evaluation (or other analytic tasks);
storing structures representing trained models and training checkpoints.
The framework supports the following approach to training models:
user supplies python script with custom logic for all training, evaluation, and testing tasks;
user uses a Lucd python library for (1) creating datasets based on pre-defined Lucd virtual datasets and (2) storing trained models and checkpoints back to Lucd.
Details on these approaches are covered throughout the remainder of this documentation.
Notable Supported Capabilities¶
The Lucd modeling framework consists of an ever-evolving set of capabilities. The following is a list of notable modeling capabilities (among others) supported in the current release.
TensorFlow Estimator-Based Modeling¶
TensorFlow generally supports AI modeling using either low-level APIs or easier-to-use high-level estimator APIs. Currently, the Lucd modeling framework supports estimator-based model development. Note that Keras modeling components may be used as well, as long as they are converted to estimators prior to training see https://www.tensorflow.org/tutorials/estimator/keras_model_to_estimator.
Classification and Regression Modeling¶
For TensorFlow modeling, all dataset feature column types are supported, enabling support for a broad range of numeric and categorical features. Regarding support for categorical features, in the current release, the domain of such a feature must be known at training time. For example, if you choose to use a feature “car_make” as a categorical feature, you must know all the possible “makes” when you write your model. This requirement will be removed in a future release. Also, the conversion of non-numerical data to numerical data (e.g., for encoding label/target values) based on a scan of the “entire” dataset is not supported in the current release. However, replacement operations are supported in Lucd’s exploratory data analysis (EDA) framework.
For TensorFlow modeling, label types are assumed to be TensorFlow int32.
For TensorFlow and PyTorch modeling, use of embedding data (e.g., word2vec for representing free text) as model input is supported.
For PyTorch, the TorchText library is supported, but n-grams are nor supported in the current release.
Note: Currently, when using text input, only the text/embedding input is allowed as a feature, enabling conventional text classification. Future releases will enable the use of multiple feature inputs alongside text data.
For TensorFlow and PyTorch modeling, use of image data (i.e., pixel values) as model input is supported.
Distributed XGBoost using Dask¶
Distributed training of XGBoost models using the Dask parallel data analytics framework is supported.
Support for TensorFlow and PyTorch distributed training is under development.
The Lucd modeling framework supports the following languages and machine learning -related libraries:
- Python v3.6.5
- TensorFlow (for Python) v2.1
- PyTorch v1.0
Since Dask and Dask-ML libraries are included in the modeling framework, Dask-based distributed Scikit-learn modules should also be supported. However, the usage of such models in the current release of the framework has not been tested, and hence, distributed Scikit-learn operation may be unpredictable.
Creating Models for Lucd¶
The following documentation contains further details and examples for developing machine learning models for Lucd.