Skip to content

ElasticDL Serving Solution Explore

Mingliang edited this page Oct 17, 2019 · 37 revisions

Motivation

Besides training, model serving is a very important part in the end-to-end machine learning lifecycle. Publishing the trained model as a service in production can make the model valuable in the real world.

At the current stage, ElasticDL focuses on the training part. We don't have our own or can't reuse any existed serving infrastructure to serve our trained models. (Why?) Our target is to figure out the serving solution.

Direction

Store the ElasticDL model in the SavedModel format.
SavedModel is the universal serialization format for tensorflow models. It's language neutral and can be loaded by multiple frameworks (such as TFServing, TFLite, TensorFlow.js and so on). We choose to store the ElaticDL model into SavedModel format. In this way, we can leverage various mature solutions to serving our model in different scenarios.

The model size varies from several kilobytes to several terabytes in different scenarios. We divide the model size into two categories: Small or medium size and large size. The small or medium size model can be loaded by a process, and the latter can not fit in a single process. Training and serving strategies will be different between these two cases. Please check the following table:

Master Storage AllReduce Parameter Server
Small or Medium Size Model SavedModel SavedModel SavedModel
Large Size Model N/A N/A Distributed Parameter Server for Serving

Distributed Parameter Server for Serving
This is for the case that the model can't fit in a single process. We partition the model variables into multiple shards, store them in distributed parameter server for serving. In the serving stage, the inference engine will execute the serving graph, query the variable values from distributed parameter server as needed and finish the calculation.
The latency and SLA requirement is higher for serving compared with training. The parameter server instance count is in proportion to the QPS of the inference traffic. And for serving, the parameter server only needs look up the static embedding table. It's simpler than training. We will separate the parameter servers between training and serving.
We will consider this solution in a separate design in the next step.

Challenges

  • How to save the model trained with parameter server as SavedModel?
    For the model of large size, we are designing parameter server to restore the variables and embeddings. Currently we use Redis as a temporary solution. In our model definition, we use ElasticDL.Embedding instead of tf.keras.layers.Embedding to interact with our parameter server. ElasticDL.Embedding use tf.py_function to invoke Rpc to call the parameter server.
    But in the stage of saving model, the customized ElasticDL.Embedding layer is not mapped to any native TensorFlow op and can't be saved into SavedModel. The embedding vectors stored in parameter server are lost. The embedding look up can't work in the serving process.

Ideas and Experiments

Open Question

  • Is the following scenario possible? User writes tf.keras.layer.Embedding in the model definition. While running the model in ElasticDL, if PS is turned on, the keras native Embedding layer is replaced with ElasticDL.Embedding layer to interact with parameter server. In this way, user can write the model using TensorFlow native Api, but can execute in distributed way in ElasticDL. It's more user friendly.