For more than five years, DC/OS has enabled some of the largest, most sophisticated enterprises in the world to achieve unparalleled levels of efficiency, reliability, and scalability from their IT infrastructure. But now it is time to pass the torch to a new generation of technology: the D2iQ Kubernetes Platform (DKP). Why? Kubernetes has now achieved a level of capability that only DC/OS could formerly provide and is now evolving and improving far faster (as is true of its supporting ecosystem). That’s why we have chosen to sunset DC/OS, with an end-of-life date of October 31, 2021. With DKP, our customers get the same benefits provided by DC/OS and more, as well as access to the most impressive pace of innovation the technology world has ever seen. This was not an easy decision to make, but we are dedicated to enabling our customers to accelerate their digital transformations, so they can increase the velocity and responsiveness of their organizations to an ever-more challenging future. And the best way to do that right now is with DKP.
Imagine you have just finished writing a small GO-based API server, packaged it into a Docker container, and tested your app in a DC/OS cluster by writing an app definition as described in the DC/OS documentation:
{
"id": "/app-server",
"description": "App definition version 1.",
"cpus": 0.5,
"mem": 32,
"disk": 0,
"instances": 1,
"container": {
"type": "DOCKER",
"volumes": [],
"docker": {
"image": "joergs/app-server"
}
},
"env": {
"APP_SECRET" : "<ENTER SECRET KEY HERE> "
}
}
Deploying this app definition to your production cluster with dcos marathon app add app-server-1.json would succeed, but what do you need to change in order to deploy your app in production?
This post describes the most common fields defined in a DC/OS application definition, and what factors to consider when choosing values for those fields in production.
Memory Allocation
Testing: "mem": 32,
In the app definition "mem" specifies the maximum amount of memory your container can use–if it uses more it will be automatically be killed and then restarted by DC/OS.
32 MB is the minimum memory limit Apache Mesos allows you to specify. Actual memory requirements are application specific, but as a rule of thumb you should not go below the Marathon default of 128 MB. For this simple Go app 256 MB is probably a good value to start out.
In production deployments it is good practice to monitor the actual memory usage (and trigger an alert when the actual usage is close to the resource limit). Check the DC/OS Metrics API for more details.
Production: "mem": 256, (and additional monitoring for actual memory)
Docker tags
Testing: "image": "joergs/app-server"
The "image" field specifies which container image DC/OS will use to start the task, in our case it is the app-server Docker image in the joergs Docker Hub repo.
The first potential issue here is the (missing) usage of tags. Docker images can be tagged in order to distinguish different versions, e.g., ubuntu:16.04 and ubuntu:14.04.
In case one does not specify any tag, it is implicitly assumed to be :latest and hence in the above case joergs/app-server:latest.
This can lead to problems if the app-server (and hence app-server:latest) is updated.
Any newly started instance (e.g., after scaling up or a failing container) would have the new image version, which might not be compatible with your app. Even if it is compatible, it is a nightmare to debug a scenario where 50 of your 100 running instances are based on one image and the other 50 on another.
For that reason you should treat tagged image tags as in immutable (i.e., never push a newer version to the same tag app-server:0.1).
Production: "image": "joergs/app-server:0.1"
Docker repository
Testing: "image": "joergs/app-server:0.1"
In addition to specifying a version tag, the "image" field also specifies where the image is stored. In this case the Docker image is stored and retrieved from joergs dockerhub account, which is a personal repo, as opposed to a company or project-controlled one. This means that no one can push a new app version (app-server:0.2 for example) while the person who controls the account is unavailable (on vacation or leave). If the person who controls the account leaves the project or company, they will retain control of the image and might let it languish, or even worse, break it.
It is better to store production images in an organizational account where multiple people have access and where individual access can be removed.
Hosting images on dockerhub has some downsides: pulling from dockerhub will incur external network traffic and also potentially add latency to deployments. Housing images in dockerhub requires your cluster to have direct access to the internet, which some production clusters don't. So in production cluster it is best to deploy a private registry running inside or at least close to the cluster.
Production: "image": "mesosphere/app-server-demo:0.1" (at best in a private registry)
Docker Containerizer/Container image
Testing:
"container": {
"type": "DOCKER",
On DC/OS you can choose which containerizer runs your Docker container, as well as whether or not you want to use a container image at all.
The "type" field of the original app definition specifies which containerizer DC/OS should use to run the container. The Docker containerizer is specified, so DC/OS will internally use the Docker runtime to run that container image. The Docker containerizer requires a Docker image.
Another option would be to specify the Universal Containerizer, which is built into DC/OS and can ran Docker and other container images natively (i.e., without relying the Docker runtime). It has the advantage of working with other DC/OS features like Pods, GPUs, and container debugging.
In this case, however, we don't require a container image at all. Because our app is a simple GO app without any dependencies, we can treat the compiled binary as an artifact which is fetched by the task itself.
The Go binary is much smaller than the Docker image:
- Go binary: 3.1MB
- Docker image: 311MB
It is good practice to host artifacts in a scalable artifact store with no single point of failure. In the below example we have chosen S3, but other on-prem solutions such as artifactory also work well.
Note that if we have package or library dependencies (the app is relying on Ubuntu 16.04, for example) we should use a container image rather than a binary, which we should build in a repeatable way (e.g., from a Dockerfile).
Production:
"cmd": "./appServer-linux",
"fetch": [
{
"uri": "https://s3.amazonaws.com/downloads.mesosphere.io/dcos-demo/deployments/appServer-linux",
"extract": true,
"executable": true,
"cache": false
}
],
Secrets
Testing:
"APP_SECRET" : "<ENTER SECRET KEY HERE>"
In the original app definition we added a secret value (e.g., AWS credentials) as a environment variable. While this is better than adding the key into the container or app itself, it's still not optimal because it could result in logs or other outputs containing these secret values.
In production it's better to rely on a dedicated secret store such as Vault, which let's you control and change sensitive data. When using DC/OS open source you need to set up such secret store manually. This blog post describes an example. DC/OS enterprise comes with an integrated secret store.
Production:
"APP_SECRET": {
"secret": "token0"
},
"secrets": {
"token0": {
"source": "token"
}
}
(or an equivalent manual use of a secret store)
Health Check
Testing:
"healthChecks": [],
The original app definition does not specify any health checks, which periodically inform DC/OS about the application's state. If a health check fails, DC/OS will consider the task unhealthy and status-aware load balancers can stop sending traffic to that instance of the app. After a task reaches the maximum number of consecutive failures, DC/OS will kill and restart it.
DC/OS supports different kind of of health checks, but the simplest one just checks whether our API server responds with a successful HTTP code (i.e., 2XX). If you want to learn more about the different health check options, check the documentation or watch Gaston's talk on Mesos health checks.
Production:
"healthChecks": [
{
"path": "/",
"portIndex": 0,
"protocol": "HTTP"
}],
A production ready app definition
After considering the above factors, our app definition is ready to be deployed into production. The new definition looks like this:
{
"id": "/app-server-2",
"description": "App definition version 2.",
"cmd": "./appServer-linux",
"cpus": 0.5,
"mem": 256,
"fetch": [
{
"uri": "https://s3.amazonaws.com/downloads.mesosphere.io/dcos-demo/deployments/appServer-linux",
"extract": true,
"executable": true,
"cache": false
}
],
"APP_SECRET": {
"secret": "token0"
},
"healthChecks": [
{
"path": "/",
"portIndex": 0,
"protocol": "HTTP"
}
],
"secrets": {
"token0": {
"source": "token"
}
}
}
In this post we only looked at the app definition itself. There are many other best practices you should consider for production deployments, for example:
- Versioned storage of the app definition allowing for reproducible deployments
- Regular backup of cluster state (e.g., DC/OS state and state of persistent services). To backup and restore an enterprise cluster see the documentation, or watch Fernando's presentation to the Day 2 Operations working group for open source options.
To hear about these and other pitfalls watch the talk Nightmares of an Mesos Support Engineer.
Want to see more amazing talks on DC/OS and Apache Mesos? Register for MesosCon Europe, October 25-27th in Prague. See you there!