Skip to content

Tutorial: Data Science Tutorial Part 2 of 3

Background on Lucd

The Lucd Enterprise AI Data Science Platform is a highly secure, scalable, open and flexible platform for persisting an fusing large and numerous datasets and training AI models for production against those datasets. The Lucd platform is an end to end platform that can be deployed in public cloud environments, on premise on bare metal hardware, or the Lucd multi-tenant PaaS can be directly accessed. The platform consists of:

  • A scalable open data ingest capability
  • A petabyte scale unified data space data repository
  • 3-D Visualization and Exploration
  • An Exploratory Data Analysis Rest Service
  • A Kubernetes environment to train PyTorch and TensorFlow models
  • NLP Word Embedding and Explainable AI Assets
  • Model results visualization and exporting to internal or external serving capability

Introduction, Prerequisites

This tutorial demonstrates the steps required to train an AI model on data leveraging the Lucd Data Science Platform. The tutorial is a toy, leveraging the IRIS dataset, designed to show the basic steps to train a model. In the example a Virtual Data Set is created, A custom operation adds a categorical feature to the existing continuous features. Then a custom Pytorch model is developed and trained in the platform. Both the Lucd 3D UI and the Lucd Python Client are leveraged during the tutorial. The tutorial is brokein up into three Parts:

  1. Part 1: Creating a Virtual Data Set (VDS) https://github.com/jmstadt/Tutorials/blob/master/Lucd_Part_1_of_3_Data_Science_Tutorial.ipynb
  2. Part 2: Performing a Custom Operation during Exploratory Data Analysis
  3. Part 3: Developing a Custom AI Model and Training in the Lucd Platform https://github.com/jmstadt/Tutorials/blob/master/Lucd_Part_3_of_3_Data_Science_Tutorial-TF_1.ipynb

Prerequisites are:

1. Complete Part 1 of 3

At the end of Part 1 https://github.com/jmstadt/Tutorials/blob/master/Lucd_Part_1_of_3_Data_Science_Tutorial.ipynb, we have a Virtual Data Set (VDS) that we can query with the Lucd Python Client. If you are unfamilar with leveraging the Lucd Python Client to access a VDS, refer to the following Tutorial https://github.com/jmstadt/Tutorials/blob/master/Lucd_Pulling_a_Virtual_Data_Set_from_the_Lucd_Unified_Data_Space.ipynb

2. Create, Test, and Upload the Custom Operation in the Lucd Python Client

Detailed Instructions for creating a custom operation can be found here: https://community.lucd.ai/hc/en-us/articles/360042224592

In [1]:
import lucd
from eda.lib import lucd_uds

login to the Lucd Enterprie AI Platform with the same credentials you used in Part 1

In [2]:
client = lucd.LucdClient(domain="<your domain>",
                         username="<your username>",
                         password="<your password>",
                         )

Refer to the following tutorial to get the VDS ID of the VDS you created in part 1: https://github.com/jmstadt/Tutorials/blob/master/Lucd_Pulling_a_Virtual_Data_Set_from_the_Lucd_Unified_Data_Space.ipynb

Or alternatively, From Part 1, in the Assets tab, you can double click on the VDS you created and the Id will be saved to your clipboard to paste

Read the VDS into a local Dask Dataframe. Note: We are limiting the size as we are just using this data to verify that the custom operation will work. We will use the custom operation on the full VDS from the UI below

In [4]:
df = lucd_uds.get_dataframe("demo_9223370449703500951", limit=100).reset_index(drop=True)
2020-04-19 16:34:04,617 | root | INFO | dask.py:28 | Creating Dask LocalCluster: http://localhost:60000/status
In [5]:
df.head()
Out[5]:
flower.petal_length flower.petal_width flower.sepal_length flower.sepal_width flower.species flower_mean std.display std.model std.source std.timestamp
0 1.4 0.2 5.1 3.5 I. setosa False species: I. setosa flower IRIS 1574175988109
1 1.5 0.1 4.9 3.1 I. setosa False species: I. setosa flower IRIS 1574175988110
2 5.6 2.2 6.4 2.8 I. virginica True species: I. virginica flower IRIS 1574175988112
3 5.1 2.0 6.5 3.2 I. virginica False species: I. virginica flower IRIS 1574175988112
4 6.6 2.1 7.6 3.0 I. virginica True species: I. virginica flower IRIS 1574175988112

Create a custom operatio

Per this Tutorial, we want to create a new boolean column as to whether petal length is greater than the mean petal length

You can experiment in your notebook until you get what you want. The resultant custom operation will be a function such as the following:

In [7]:
def create_greater_than_mean_column(df):
    column_mean = df["flower.petal_length"].mean()
    new_col=df["flower.petal_length"] > column_mean
    df = df.assign(flower_mean=new_col)
    return df

You can verify that the custom operation works locally

In [8]:
new_df = df.map_partitions(create_greater_than_mean_column)

As you can see by calling .head() there is a new "flower_mean" column in your local dataframe

In [9]:
new_df.head()
Out[9]:
flower.petal_length flower.petal_width flower.sepal_length flower.sepal_width flower.species flower_mean std.display std.model std.source std.timestamp
0 1.4 0.2 5.1 3.5 I. setosa False species: I. setosa flower IRIS 1574175988109
1 1.5 0.1 4.9 3.1 I. setosa False species: I. setosa flower IRIS 1574175988110
2 5.6 2.2 6.4 2.8 I. virginica True species: I. virginica flower IRIS 1574175988112
3 5.1 2.0 6.5 3.2 I. virginica True species: I. virginica flower IRIS 1574175988112
4 6.6 2.1 7.6 3.0 I. virginica True species: I. virginica flower IRIS 1574175988112

Now upload this custom operation to the Lucd platform:

In [ ]:
from eda.int import custom_operation

data = {
        "operation_name": "create_greater_than_mean_column",
        "author_name": "<your name>",
        "author_email": "<your email>",
        "operation_description": "<your description>",
        "operation_purpose": "<your purpose>",
        "operation_features": ["flower.petal_length"],
        "operation_function": create_greater_than_mean_column
}

response_json, rv = custom_operation.create(data)

3. Perform Your Custom Operation in the Lucd UI

Now, go back to the UI and the search you saved, click on the saved search and select new op

In the next screen select custom operation, you should see the custom operation that you created above. Click on it and you can see the code. In the Apply Method selection, select "Map_Partitions" because this is an operation on the full dataframe. For other operations row wise operations you would select Apply. Click the green check mark

You will see the new custom operation added as a fork to the saved search

As in Part 1, you can give save this new VDS giving it a name

Per the above, from the Assets tab, you can copy that new vds id to your clipboard and pull that new VDS into your local notebook

In [14]:
custom_op_df = lucd_uds.get_dataframe("demo_9223370449524874138", limit=100).reset_index(drop=True)

As you can see by calling .head() on the new dataframe, there is now a new "flower_mean" column per your custom operation

In [15]:
custom_op_df.head()
Out[15]:
flower.petal_length flower.petal_width flower.sepal_length flower.sepal_width flower.species flower_mean
0 4.1 1.3 5.7 2.8 I. versicolor False
1 1.5 0.1 4.9 3.1 I. setosa False
2 5.6 1.4 6.1 2.6 I. virginica True
3 5.7 2.3 6.9 3.2 I. virginica True
4 6.6 2.1 7.6 3.0 I. virginica True

This completes part 2 of this Tutorial. In Part 3, we will use that new dataframe that we created to create and train a model to predict Flower Species. We will use the continuous columns that were in the existing dataframe as well as the categorical column we created with our custom operation

In [ ]: