Skip to main content

Tensorflow Docker Model Server

Guangzhou, China

TensorFlow Serving is a flexible, high-performance serving system for machine learning models, designed for production environments. TensorFlow Serving makes it easy to deploy new algorithms and experiments, while keeping the same server architecture and APIs. TensorFlow Serving provides out-of-the-box integration with TensorFlow models, but can be easily extended to serve other types of models and data.

TensorFlow Serving with Docker

The serving server can be pulled from Docker Hub and is available with and without GPU support - pick the one you need:

docker pull tensorflow/serving:latest-gpu
docker pull tensorflow/serving:latest

The serving images (both CPU and GPU) have the following properties:

  • Port 8500 exposed for gRPC
  • Port 8501 exposed for the REST API
  • Optional environment variable MODEL_NAME (defaults to model)
  • Optional environment variable MODEL_BASE_PATH (defaults to /models)

The Tensorflow Serving Repository already provides a few models for testing that we can use:

git clone https://github.com/tensorflow/serving
cd ./serving

Now we can run the docker container and mount one of those models:

docker run -t --rm -p 8501:8501 \
-v "$(pwd)/tensorflow_serving/servables/tensorflow/testdata/saved_model_half_plus_two_gpu:/models/half_plus_two" \
-e MODEL_NAME=half_plus_two \
tensorflow/serving:latest-gpu

This will run the docker container, launch the TensorFlow Serving Model Server, bind the REST API port 8501, and map our desired model from our host to where models are expected in the container. We also pass the name of the model as an environment variable, which will be important when we query the model.

Even though I already installed Nvidias GPU support for Docker I am still getting an error message here:

Failed to start server. Error: UNKNOWN: 1 servable(s) did not become available: {{{name: half_plus_two version: 123} due to error: INVALID_ARGUMENT: Cannot assign a device for operation a: {{node a}} was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device

Serving with Docker using your GPU

Before serving with a GPU, in addition to installing Docker, you will need:

nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11 Driver Version: 525.60.11 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 0% 56C P0 29W / 130W | 780MiB / 6144MiB | 4% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

Rerunning the GPU image - it seems all that I was missing was the GPU flag --gpus all:

docker run --gpus all -p 8501:8501 \
--mount type=bind,source=$(pwd)/tensorflow_serving/servables/tensorflow/testdata/saved_model_half_plus_two_gpu,target=/models/half_plus_two \
-e MODEL_NAME=half_plus_two -t tensorflow/serving:latest-gpu &

And this time we have lift-off:

I tensorflow_serving/model_servers/server_core.cc:486] Finished adding/updating models
I tensorflow_serving/model_servers/server.cc:118] Using InsecureServerCredentials
I tensorflow_serving/model_servers/server.cc:383] Profiler service is enabled
I tensorflow_serving/model_servers/server.cc:409] Running gRPC ModelServer at 0.0.0.0:8500 ...
I tensorflow_serving/model_servers/server.cc:430] Exporting HTTP/REST API at:localhost:8501 ...

The Half Plus Two model generates 0.5 * x + 2 for the values of x we provide for prediction. This model will now have ops bound to the GPU device, and will not run on the CPU. We can now make a prediction using the Tensorflow Serving REST API. When sending the values for x 1,2 and 5 I expect a returned prediction of 2.5, 3.5 and 4.5:

curl -d '{"instances": [1.0, 2.0, 5.0]}' \
-X POST http://localhost:8501/v1/models/half_plus_two:predict

Taaadaa:

{
"predictions": [2.5, 3.0, 4.5]
}