6 min read
In our previous post, we demonstrated how GPUs can dramatically reduce the time you need for a TensorFlow job. But what if we want to run this in production, not just from the laptop? You'd want to be able to deploy your TensorFlow service quickly and manage it easily in production across multiple teams: that's where DC/OS comes in.
Watch a video version of this tutorial:
In part 2 of this tutorial, we'll:
- Install the TensorFlow service without GPUs.
- Run a neural network example.
- Install TensorFlow with GPUs.
- Run the same neural network example.
- Run an example that uses multiple GPUs.
Run TensorFlow on DC/OS without GPUs
First, let's see how easy it is to use TensorFlow on DC/OS, even without GPUs.
Prerequisites
- A DC/OS cluster with 1 private agent with 4 CPUs and 1 public agent with 8 CPUs and 8 Nvidia Tesla K80 GPUs.
- The DC/OS CLI installed.
Deploy the TensorFlow service
First, let's get TensorFlow running on your DC/OS cluster.
- Go to the Services tab of the DC/OS UI.
- Click + to add a service.
- Choose Single Container.
- Toggle to the JSON Editor and paste the following application definition into the editor.
{
"id": "my-tensorflow-no-gpus",
"cpus": 4,
"gpus": 0,
"mem": 2048,
"disk": 0,
"instances": 1,
"container": {
"type": "MESOS",
"docker": {
"image": "tensorflow/tensorflow"
}
}
}
This application definition specifies no GPUs and the standard TensorFlow Docker image.
- Click Review and Run, then Run Service.
Run a TensorFlow example
- Exec into the TensorFlow container from the DC/OS CLI. This command allows you to execute commands inside the container and stream the output to your local terminal.
dcos task exec -it my-tensorflow-no-gpus bash
- Now, let's get some examples to run. Install git and then clone the TensorFlow-Examples repository.
apt-get update; apt-get install -y git git clone https://github.com/aymericdamien/TensorFlow-Examples
- Run and time the same example you ran locally in the last tutorial, the convolutional network example.
cd TensorFlow-Examples/examples/3_NeuralNetworks time python convolutional_network.py
This took my DC/OS cluster 11 minutes.
Run TensorFlow on DC/OS with GPUs
Deploy the TensorFlow service with GPUs
Now that you've got TensorFlow examples running on your cluster, let's see how performance compares when you configure your service to use GPUs.
- Go to the Services tab of the DC/OS UI.
- Click + to add a service.
- Choose Single Container.
- Toggle to the JSON Editor and paste the following application definition into the editor.
{
"id": "tensorflow-gpus-1",
"acceptedResourceRoles": ["slave_public"],
"cpus": 4,
"gpus": 4,
"mem": 2048,
"disk": 0,
"instances": 1,
"container": {
"type": "MESOS",
"docker": {
"image": "tensorflow/tensorflow:latest-gpu"
}
}
}
- This application definition is largely the same as the last one, except, here, you're requesting 4 GPUs and specifying the TensorFlow Docker image that's configured for GPUs.
- Click Review and Run, then Run Service.
Verify access to GPUs
You'll recall that we created a cluster with a public agent that has 8 GPUs, but only requested access to 4. Let's verify that the node has 8 GPUs, and that our service has access to only 4 of them.
- First, use dcos task exec to run a command inside of the container to get the public IP address of the agent node the container is running on.
dcos task exec tensorflow-gpus-1 curl -s ifconfig.co
- Now, use that public IP to SSH into the node and run nvidia-smi to verify the number of GPUs the node has.
ssh <public-ip> nvida-smi
- You should see 8 GPUs installed and running on the machine. The container for your service, however, should only be able to see 4 of those GPUs.
- Run dcos task exec with the bash option to get a shell inside of your service's container.
dcos task exec -it tensorflow-gpus-1 bash
- Set up environment variables so you can run nvida-smi from within this shell.
export LD_LIBRARY_PATH=/usr/local/nvidia/lib64 export PATH=$PATH:/usr/local/nvidia/bin
- Run nvidia-smi to verify that even though you have 8 GPUs installed on the machine, you only have access to four of them inside this container.
nvidia-smi
Run a TensorFlow example with GPUs
Now that you've installed TensorFlow and verified your access to 4 GPUs, let's run the same example as before.
- If you exited the tensorflow-gpus-1 container, reenter it and set up the environment variables by following the steps in the last section.
- Install git and clone the TensorFlow-Examples repository.
apt-get update; apt-get install -y git git clone https://github.com/aymericdamien/TensorFlow-Examples
- Run and time the same example you ran earlier, the convolutional network example.
cd TensorFlow-Examples/examples/3_NeuralNetworks time python convolutional_network.py
- Watch the code find the GPUs and execute.
This took my DC/OS cluster about 2 minutes: about 5 times faster than before!
Launch Two TensorFlow Instances
You'll recall that we have a cluster with 8 GPUs, but we only requested access to 4 of them. Now, let's launch a second TensorFlow instance that will consume the remaining 4 GPUs in parallel with the first.
Running more than one TensorFlow instance in parallel shows that you can have multiple users on the same cluster with isolated access to the GPUs on it.
- Add a third service to your DC/OS cluster with the following application definition, which is similar to the first application definition with GPUs.
{
"id": "tensorflow-gpus-2",
"acceptedResourceRoles": ["slave_public"],
"cpus": 4,
"gpus": 4,
"mem": 2048,
"disk": 0,
"instances": 1,
"container": {
"type": "MESOS",
"docker": {
"image": "tensorflow/tensorflow:latest-gpu"
}
}
}
- Verify that your second TensorFlow instance is running by accessing the Jupyter notebook that runs by default on the TensorFlow Docker image. In the application definition above, the acceptedResourceRoles parameter is set to slave_public, which gives us access to the public IP of the agents where the containers are running.
- Get the public IP of the agent where the task has been launched.
dcos task exec tensorflow-gpus-2 curl -s ifconfig.co
- Go to the STDERR log of the service to get the Jupyter URL. Services > tensorflow-gpus-2 > task-id > paper icon > ERROR (STDERR). You will see this a message similar to the following.
Copy/paste this URL into your browser when you connect for the first time, to login with a token: http://localhost:10144/?token=d4f3d8f80eb97299e74b5254d1600c480c3f042d548e51f5
- Replace localhost with the public IP you found earlier to see the Jupyter notebook.
- Click the Getting Started notebook and run some commands.
Thanks for playing along at home!
The next post in the series will show you how to use DC/OS to dynamically request cluster resources and launch a distributed TensorFlow job across multiple agents. When that job completes, the resources it had used are automatically released back to the cluster and made available to other jobs. This dramatically increases efficiency in comparison to traditional TensorFlow deployment strategies.