Content Delivery Monitoring in AWS with CloudWatch

This post describes a way of monitoring a Tridion 9 combined Deployer by sending the health checks into a custom metric in CloudWatch in AWS. The same approach can also be used for other Content Delivery services. Once the metric is available in CloudWatch, we can create alarms in case the service errors out or becomes unresponsive.

The overall architecture is as follows:
  • Content Delivery service sends heartbeat (or exposes HTTP endpoint) for monitoring
  • Monitoring Agent checks heartbeat (or HTTP health check) regularly and stores health state
  • AWS lambda function:
    • runs regularly
    • reads the health state from Monitoring Agent
    • pushes custom metrics into CloudWatch
I am running the Deployer (installation docs) and Monitoring Agent (installation docs) on a t2.medium EC2 instance running CentOS on which I also installed the Systems Manager Agent (SSM Agent) (installation docs).

In my case I have a combined Deployer that I want to monitor. This consists of an Endpoint and a Worker. The Endpoint uses passive monitoring -- the Monitoring Agent accesses the Endpoint URL using HTTP(S) to read the health status. The Worker uses active monitoring -- it sends heartbeats to the Monitoring Agent reporting health status.

Configure Content Delivery Heartbeats

For my Deployer Worker, the monitoring heartbeats are configured in the file deployer-config.xml by adding the following configuration node:

<Monitoring ServiceType="DeployerWorker" Interval="60s" GenerateHeartbeat="true"/>

At the moment of writing this, the documentation is a big buggy -- I noticed the settings above work, although they yield validation exceptions in the logs.

Configure the Monitoring Agent

I'm using the Monitoring Agent to check the health status of the Deployer Endpoint. I'm using the following cd_monitor_conf.xml:

<?xml version="1.0" encoding="UTF-8"?>
<MonitoringAgentConfiguration Version="11.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

<StartupPeriod StartupValue="60s"/>

<HeartbeatMonitoring ListenerPort="20131" EnableRemoteHeartbeats="true">
<AutomaticServiceRegistration RegistrationFile="RegisteredServices.xml"/>
<Services/>
</HeartbeatMonitoring>

<ServiceHealthMonitorBindings>
<ServiceHealthMonitorBinding Name="HttpServiceHealthMonitor"
Class="com.tridion.monitor.polling.HTTPHealthMonitor"/>
</ServiceHealthMonitorBindings>

<ServiceHealthMonitors>
<HttpServiceHealthMonitor ServiceType="DeployerEndpoint" PollInterval="60s" TimeoutInterval="30s">
<Request URL="http://localhost:8084/mappings" RequestData=""/>
<Response SuccessPattern="httpupload"/>
</HttpServiceHealthMonitor>
</ServiceHealthMonitors>

<WebService ListenerPort="20132"/>
</MonitoringAgentConfiguration>

Notice that I am not using a Monitoring Agent Web Service, because it is not needed. Instead, I am using the netcat (nc) Unix command to retrieve the statuses from the Monitoring Agent:

echo "<StatusRequest/>" | nc localhost 20132

The Monitoring Agent has an in-built simple service that listens on server socket 20132 for incoming connections. If somebody sends the command <StatusRequest/> to this socket, the Monitoring Agent responds with an XML containing statuses for all components it monitors:

<StatusResponse>
<ServiceStatus>
<ServiceType>DeployerEndpoint</ServiceType>
<ServiceInstance></ServiceInstance>
<ProcessId>-1</ProcessId>
<Status>OK</Status>
<StatusChangeTime>2018-12-22T15:51:07Z</StatusChangeTime>
<LastReportTime>2018-12-22T15:50:07Z</LastReportTime>
<MonitoredThreadCount>-1</MonitoredThreadCount>
</ServiceStatus>
<ServiceStatus>
<ServiceType>DeployerWorker</ServiceType>
<ServiceInstance>dummy</ServiceInstance>
<ProcessId>7152</ProcessId>
<Status>OK</Status>
<StatusChangeTime>2018-12-22T15:50:10Z</StatusChangeTime>
<LastReportTime>2018-12-22T17:21:14Z</LastReportTime>
<MonitoredThreadCount>3</MonitoredThreadCount>
<NonRespondingThreads></NonRespondingThreads>
</ServiceStatus>
</StatusResponse>

The information in this XML is precisely what we want as custom metrics in AWS CloudWatch.

The Monitoring Agent server socket only listens for connections to 127.0.0.1, so it can't be accessed remotely. This dictates our architecture on how to retrieve this XML response and how to push it into CloudWatch. Enter the lambda...

AWS Lambda Function

The function is triggered by a CloudWatch event that fires every so often. In my case, I chose every minute.

The lambda uses the boto3 API in order to:
  1. Run the nc command remotely on the Deployer instance using SSM API and capture its output
  2. Create custom metrics from the XML output using CloudWatch API
The code is written in Python 2.7 and looks like this:

import boto3
import time
from xml.dom.minidom import parseString

statuses = {"OK": 0, "Error": 1, "NotResponding": 2}
ssmClient = boto3.client('ssm')
cwClient = boto3.client('cloudwatch')

def lambda_handler(event, context):
response = ssmClient.send_command(
Targets = [{'Key':'tag:Type','Values':['Deployer']}],
DocumentName = 'AWS-RunShellScript',
TimeoutSeconds = 30,
Parameters = { 'commands': ['echo "<StatusRequest/>" | nc localhost 20132'] }
)

commandId = response['Command']['CommandId']
status = response['Command']['Status']
while status == 'Pending' or status == 'InProgress':
time.sleep(2)
response = ssmClient.list_commands(CommandId = commandId)
status = response['Commands'][0]['Status']

response = ssmClient.list_command_invocations(CommandId = commandId)

for invocation in response['CommandInvocations']:
instanceId = invocation['InstanceId']
instanceName = invocation['InstanceName']
response = ssmClient.get_command_invocation(CommandId = commandId, InstanceId = instanceId)

output = response['StandardOutputContent']
if not output:
continue

dom = parseString(output)
statusArray = dom.getElementsByTagName('ServiceStatus')

for statusEl in statusArray:
ServiceType = statusEl.getElementsByTagName('ServiceType')[0].firstChild.data
MetricName = "SDL" + ServiceType.replace(" ", "") + "Status"
Status = statusEl.getElementsByTagName('Status')[0].firstChild.data
StatusNumber = statuses[Status]

cwClient.put_metric_data(
Namespace = 'SDL Web',
MetricData = [{
'Dimensions': [{
'Name': 'InstanceName',
'Value': instanceName
}],
'MetricName': MetricName,
'Value': StatusNumber,
'Unit': 'None'
}]
)

return None


Brief code explanation:
  • Send nc command to all instances tagged with tag name Type equals Deployer, since I don't feel like keeping track of instance IDs. Currently I have only one instance, but in a production environment the Deployer will be separated into Endpoint and several Worker instances;
  • Wait until command finished execution on all target instances and command status is no longer Pending or InProgress;
  • Read each CommandInvocation within our generic command, so that we are able to retrieve the command output on individual instances;
  • Read the StandardOutputContent from each invocation and parse it into a DOM;
  • For each ServiceStatus node in the XML, translate Status text into a code (0 means OK, 1 = Error and 2 = Not Responding)
  • Push custom metric into CloudWatch using the instance name as dimension (e.g. dev-deployer.mitza.net), ServiceType as metric name, and translated Status as metric value;

Eventually, when all is working, the following metrics are available in CloudWatch: