It's easy to use Mesosphere's software to achieve better monitoring of production infrastructure. In this post, we show how a simple monitoring system can be easily implemented in your datacenter.
Goal
At Mesosphere, we run a handful of internal services that back web applications used by our customers (for example,
Mesosphere for Google Cloud Platform). We often want to test our backend services to guard against application level issues, like external APIs being inaccessible or connectivity problems to cloud providers or repository mirrors. We'd like our monitoring to be frequent and regular, so that our engineers are alerted in a timely fashion if and when things stop working.
In this post, I'll describe in some level of detail an example of one of the ways we run integration tests against our services. The integration test will be deployed using our internal Mesos cluster and Chronos instance.
If you haven't used Chronos before, think of it as a distributed cron service. You can specify scheduled or one off jobs that can then be executed on a Mesos cluster. It has native support for Docker containers and robust scheduling logic.
We use a handful of other secondary services to help with monitoring - at Mesosphere we're fans of
DataDog, an easy way to collect, aggregate and monitor time series data. It has great integration with other alerting services like
PagerDuty, although in this example we trigger PagerDuty directly.
Implementation
A locally running Python script using the
requests and
dogapi libraries to interact with our internal service's REST API and DataDog's API respectively. We attempt an action twice, logging (using DataDog's provided library) the outcome of each action. If both actions fail, we use the PagerDuty REST API to trigger a page that goes to our on-call team.
The script requires several credentials to access our service's REST API, essentially consisting of a pre-configured key (or OAuth token) along with an SSH public key and the hostname on which to access the internal service. For convenience, these credentials are read in from environment variables using Python's os.environ['MY_VAR'].
Our simple integration testing script (with our proprietary calls stubbed out) is shown below:
[highlight code='python']
#!/usr/bin/python
import jsonimport osimport requestsimport socketimport timefrom dogapi import dog_http_api as api
# DataDog settingsapi.api_key = '<REPLACEME>'api.application_key = '<REPLACEME>'
def try_action(ssh_key): # access an API # code snipped
def cleanup_action(): # access an API # code snipped
def send_to_datadog(host, success): ts = int(time.time()) if success: result = "success" else: result = "failure" metric_name = "example.integration.{}".format(result) api.metric(metric_name, (ts, 1), tags=["host:{}".format(host)]) print ("POSTED to DataDog")
def trigger_pagerduty(host, message): trigger = {"service_key" : "<REPLACEME>", "event_type" : "trigger", "description" : "Integration test failure", "client" : "Example Integration Test", "client_url" : socket.gethostname(), "details" : { "failed_host": host, "provider" : provider, "message" : message } } requests.post('https://events.pagerduty.com/generic/2010-04-15/create_event.json', json = trigger) print ("TRIGGERED PagerDuty")
host = os.environ['HOST']ssh_key = os.environ['SSH_KEY']oauth_token = os.environ.get('OAUTH_TOKEN')
overall_success = Falsemessage = ''
for i in range(2): (success, message) = try_action(ssh_key) overall_success = (success or overall_success) send_to_datadog(host, success) (success, message) = cleanup_action()
if not overall_success: trigger_pagerduty(host, message)
[/highlight]
Dockerization
Since this script uses multiple libraries and we plan to run it on any one of various hosts in our Mesos cluster, Docker is a must.
We use a minimal python-monitoring Dockerfile
published as a public image to the Mesosphere Docker Hub account. This is based upon an Ubuntu base image and has various versions of Python installed, along with the necessary libraries for this application.
To run our application in a python-monitoring container, the following works:
[highlight code='bash']
docker run -t -i --entrypoint=/my-repo/integration_test.py \ -e "PROVIDER=$PROVIDER" \ -e "HOST=$_HOST" \ -e "OAUTH_TOKEN=$OAUTH_TOKEN" \ -e "SSH_KEY=$SSH_KEY" \ --volume=$(pwd):/my-repo mesosphere/python-monitoring:latest
[/highlight]
This command will mount the current directory into /my-repo within the container and grab the current values of the environment variables and pass them through to the container.
Note how the credentials are passed through as environment variables. This makes it considerably simpler to set up a Chronos job later.
An alternative approach is to have your credentials stored in a securely hosted artifact and include this in your job description. When Mesos runs your Chronos job, it'll fetch this artifact into the current working directory (which is mounted into the container).
Production Setup
To set this up in Chronos, it is necessary to post the JSON job description to our running Chronos instance. The job description is fairly straightforward.
[highlight code='json']
{ "schedule": "R/2015-03-13T00:00:00Z/PT1H", "name": "Example Integration Test", "container": { "type": "DOCKER", "image": "mesosphere/python-monitoring" }, "cpus": "1.0", "mem": "512", "uris": [ "https://path.to.my.script/integration_test.py" ], "command": "cd $MESOS_SANDBOX && ./integration_test.py", "environmentVariables": [ { "name": "PROVIDER", "value": "Google" }, { "name": "HOST", "value": "host.to.test" }, { "name": "OAUTH_TOKEN", "value": "<REPLACEME>" }, { "name": "SSH_KEY", "value": "<REPLACEME>" } ]}
[/highlight]
In the JSON above, we:
Configure the schedule to run (see the
Chronos README for in-depth information about specifying ISO-8601 schedules). In this example, we run every hour, beginning at midnight on the 13th of March, 2015.
Name our job (in Chronos, names are IDs, so choose carefully)
Specify the Docker container to pull down
List URIs to pull assets from (i.e. our Python script)
Specify the command to run
Specify various environment variables
Command
$MESOS_SANDBOX is a special environment variable that the Mesos Docker executor provides to the running task. It mounts the current working directory into the container at the path stored in $MESOS_SANDBOX.
In this example, we cd into the $MESOS_SANDBOX directory and execute integration_test.py as a script.
Environment Variables
The name, value pairs within the environmentVariables array are those variables made available to the job and are also implicitly passed through to Docker (in a similar way to the -e name:value method used when running the container directly).
Networking Rules
This is specific to your setup, you may need to ensure the Mesos cluster on which your job will execute has access to the service you're testing against. In our case, we needed to whitelist access from our cluster to the cloud instances hosting the service we were testing.
Using a private Docker Hub image
Whilst we used a public image in this example, you often want to schedule containers based on private images. This is easy to accomplish, simply add a value pointing to a valid .dockercfg to the uris field of your job description.
Running your job
Using the excellent cURL alternative,
httpie, you can easily post your job description to create a new job:
http -v POST my.chronos.host:8081/scheduler/iso8601 < integration-test.json
It'll soon show up in the Chronos UI at
http://my.chronos.host:8081. You can force run it through the UI to see if it's all working correctly. If not (if the status changes to failed), you can access task logs through the Mesos UI or using the
Mesos command line tool.
Summary
We find this tool invaluable to automatically check that our services are up and running correctly. While this is a fairly specific example of how we use our infrastructure, this post shows how straightforward it is to set up and run any sort of Dockerized batch job with Chronos on a Mesos cluster.