Skip to content

ElasticDL TF Transform Explore

brightcoder01 edited this page Nov 11, 2019 · 35 revisions

TensorFlow Transform in ElasticDL Explore

Why TF-Transform in ElasticDL

Data preprocess is an important part before model training in the entire ML pipeline. Keep consistency between offline and online.

Is feature column enough for this work? Feature Column Api can do a part of Transform work. But it can't cover all the feature engineering requirements. Especially in the following two aspects:

Analyzer

Let's take scaling one column of dense data to [0, 1) for example.

Feature Column

import tensorflow as tf
def _scale_age_to_0_1(input_tensor):
    min_age = 1
    max_age = 100
    return (input_tensor - min_age) / (max_age - min_age)
tf.feature_column.numeric_column('age', normalizer_fn=_scale_age_to_0_1)

TensorFlow Transform

import tensorflow_transform as tft
outputs['age'] = tft.scale_to_0_1(inputs['age'])

For the feature column case, we need define the min and max value of column age as constants in the code. It's common that we refit the model using the latest data. The statistic value will change from the data from different days. It's impractical to update these value in the code everyday for the daily job.
For the TF Transform case, we only use just one Api tft.scale_to_0_1. TF Transform will analyze the whole dataset at first, calculate the min and max. After getting the analysis result, it then transform the data.

Inter Columns Calculation

From feature column Api doc, except cross_column, all the other columns only execute the transform on just one column of data. We can't implement inter columns calculation using Feature Column Api just as follows:

column_new = column_a * column_b

Challenges

Apache Beam and Runtime Engine

Walkthrough Issues

Integration with SQLFlow

Please check the typical SQL expression for model training as follows. SELECT * From iris.train means retriving data from the data source. It's mapped to SQL query in database or Odps SQL in Odps table. COLUMN sepal_length, sepal_width is mapped to feature_column array.
We need extend the synatx to fully express the TF-Transform logic. The key of the transform process is an user defined preprocess_fn. We can use transform Api and TF ops to define the process logic, and it's a TF graph and can be serialized to SavedModel. The preprocess_fn is flexible, we recommend to define it in a separate python fine and refer it in the SQL expression.

SELECT *
FROM iris.train
TO TRAIN DNNClassifier
WITH model.n_classes = 3, model.hidden_units = [10, 20]
COLUMN sepal_length, sepal_width, petal_length, petal_width
LABEL class
INTO sqlflow_models.my_dnn_model;

Open Questions

  1. From official tutorial example, TF-Transform is integrated with Estimator. While exporting the model as SavedModel, we need construct the serving_input_fn from the tf_transform_output to define the inference signature. For TF2.0, we are using keras the define the model and inference signature is generated automatically in this case. Model with feature columns works fine. Not sure whether it works well with TF-Transform + Feature Columns?