Building a serverless site availability monitoring platform with Telegraf, AWS Fargate and InfluxCloud

Ben Tasker

2022-05-28 14:00

I use a free Uptime Robot account to help keep an eye on the availability of www.bentasker.co.uk.

Every 5 minutes, UptimeRobot places requests to my site (and it's origin) and reports on how long those requests take to complete and updates my status page if there are issues.

The free tier only tests from a single location, but is usually a good indicator of when things are going (or starting to go) wrong. I use my uptime-robot exec plugin for Telegraf to pull stats from UptimeRobot into InfluxDB for use in dashboards.

Because I test against my origin as well as the CDN, it's usually possible to tell (roughly) where an issue lies: if CDN response time increases, but origin doesn't, then the issue is likely on the CDN.

Earlier in the week, I saw a significant and sustained increase in the latency UptimeRobot was reporting for my site, with no corresponding change in origin response times.

UptimeRobot reports increase in latency fetching from my site

This suggests possible CDN issues, but the increase wasn't reflected in response time stats drawn from other sources:

There was no increase in the response time metrics recorded by my Privacy Friendly Analytics system
The Telegraf instance (checking the same endpoints) on my LAN wasn't reporting an increase

Given that pfanalytics wasn't screaming at me and that I couldn't manually reproduce the issue, I felt reasonably confident that whatever this was, it wasn't impacting real users.

But, I decided that it would be useful to have some other geographically distributed availability system that I could use for investigation and corroboration in future.

I chose to use Amazon's Elastic Container Server (ECS) with AWS Fargate to build my solution.

This post walks through the process of setting up a serverless solution which runs Telegraf in a Fargate cluster and writes availability data into a free InfluxDB Cloud account.

Why Fargate?

We could achieve the same aim by running Telegraf on some EC2 instances: setup would be absolutely trivial.

However, Telegraf is only going to be polling periodically and the EC2 instances would be running (and being billed for) constantly.

With Fargate we only pay for the time that a container uses (with a minimum billing period of 1 minute per invocation), so we should be able to reduce costs significantly by running Telegraf's docker image within Fargate.

At time of writing, AWS's pricing in us-east-2 is

EC2 t3.micro charged at $0.0104/hr ($7.20 month)
Fargate: charged at $0.04048 per vcpu/hr + $0.004445 GB RAM/hr

(data transfer is extra, but is the same rate for each)

To work out likely Fargate costs we need to make a few calculations based on the following criteria

average run will always be < 1 minute (in practice it's actually more like 2 seconds, so there's a good margin for error here)
we use 0.25 vcpu and 0.5GB RAM (I tested and this is more than enough for our purposes)
we run at 5 minute intervals

We can work out the expected cost as follows

cpu_seconds = 720 # 1 minute of billed time every 5min (60*12)
num_cpus = 0.25 # 0.25 vcpu
gb_ram = 0.5
hourly_cpu_cost = 0.04048
hourly_ram_cost = 0.004445

# Calculate
cpu_cost_second = hourly_cpu_cost / 3600
billed_cpu_cost = (cpu_seconds * cpu_cost_second) * num_cpus
ram_cost_second = hourly_ram_cost / 3600
billed_ram_cost = (ram_cost_second * cpu_seconds) * gb_ram

# Total
total_cost = billed_ram_cost + billed_cpu_cost
print(f"{total_cost} (cpu: {billed_cpu_cost}, ram: {billed_ram_cost})")

Our expected cost for Fargate is $0.002469/hr ($1.777/month), plus data charges.

Telegraf Config Setup

We're going to use quite a simple Telegraf configuration, relying entirely on Telegraf's http_response plugin.

The configuration for this is as simple as

[[inputs.http_response]]
    urls = [
    "https://www.bentasker.co.uk", 
    "https://origin.bentasker.co.uk"
            ]
    response_status_code = 200

We also want a tag value to report the region the probe is testing from. We'll set the value via environment variable, so we create a section to define the tag and pull the value from the container's environment.

[global_tags]
  region = "${TEST_REGION}"

For flexibility, we want the configuration file to be held centrally (so that each region we test from can pull the same single config file and receive any future updates we make to it).

We don't want to host our config on the infrastructure we're supposed to be monitoring, either, because it'd mean Telegraf probes would go down when we need them most (when we're down).

Luckily, Influx Cloud allows us to create and host Telegraf config files

Telegraf

Selecting http_response gives us a prefilled config template, which we can then adjust to suit our needs

# Configuration for telegraf agent
[agent]
  interval = "20s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  hostname = ""
  # We don't care about the hostname of the probe
  omit_hostname = true

[[inputs.http_response]]
    urls = [
    "https://www.bentasker.co.uk", 
    "https://origin.bentasker.co.uk/"
    ]
    response_status_code = 200

[global_tags]
  region = "${TEST_REGION}"

[[outputs.influxdb_v2]]
  urls = ["https://eu-central-1-1.aws.cloud2.influxdata.com"]
  token = "$INFLUX_TOKEN"
  organization = "<my orgname>"
  bucket = "telegraf"

Once we save, InfluxCloud provides a URL and a token, both of which we'll use later.

Telegraf Invocation

The Telegraf docker container, by default, runs Telegraf as a service.

We don't want to pay for the time between collection intervals though, so we need Telegraf to run once and then exit. Luckily, there's a --once option which tells it to do a metrics run, write to outputs and then exit.

We also need to tell Telegraf where to find our InfluxCloud configuration file (with --config) and export some environment variables.

If we were running on a standard server, the commands we'd use for this would be (change eu-west to whatever string works best for you)

export INFLUX_TOKEN="<token>"
export TEST_REGION="eu-west"
telegraf --once --config "<influxcloud telegraf url>"

Later, we'll use this as a guide to help us configure the container in AWS.

Fargate Setup

Next, we need to do the AWS side setup. The process is actually quite simple.

Log into console.aws.amazon.com and go to the Elastic Containers Service.

Then, create a cluster using the Fargate type (tick to create a new VPC if you want to seperate this out from your other AWS infra)

Cluster Creation

Go to Task Definitions and click Create new Task Definition, choose FARGATE and define a few options

Creating a task definition

Next, scroll down and click Add container

Set the standard container options:

Name: telegraf
Image: telegraf:latest

And optionally define a memory limit

Container Creation

In the Environment section we override a few options:

We set the command to

telegraf,--once,--config,https://eu-central-1-1.aws.cloud2.influxdata.com/api/v2/telegrafs/00fffaaaa1111

Where the URL is the URL to our InfluxCloud hosted Telegraf configuration file. This is the command we designed in Telegraf Invocation with the command segments delimited by commas rather than spaces.

We then set a couple of environment variables

INFLUX_TOKEN: <token>
TEST_REGION: eu-west

These are the exports we used in Telegraf Invocation. Note: These can't be edited without deleting and recreating the container definition, but they can be overridden when the task is scheduled later.

Which should lead to the interface looking something like this

Container Environment

Click Add followed by Create and the task definition is created.

First Run

We can now run the task manually to check that everything is working

Back to the Tasks Definition Page
Check the box for the relevant Task Definition
Actions -> Run Task
Choose FARGATE
Set Operating System Family to Linux
Select your Cluster
Select the VPC
Select a subnet
Click Run Task

The task will be created and triggered

Task Created

If you click the little refresh icon, the task should eventually disappear (and only be visible when you change Desired task status to Stopped).

At this point, there should be data in our InfluxCloud bucket

from(bucket: "telegraf")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "http_response")
  |> filter(fn: (r) => r["_field"] == "response_time")

Data in bucket

Scheduling

Next we want to schedule the task to run automatically at a fixed interval

Go to Clusters
Click into the Fargate Cluster
Click the Scheduled Tasks tab
Click Create
Set a recognisable name (I used TelegrafFargate)
Ensure the Schedule rule type is Run at fixed interval
Set the Interval to 5 minutes
Set a recognisable ID (I used telegraf-website-monitor)
Set Launch type to FARGATE
Select your task
Number of tasks: 1
Select a VPC
Select a Subnet

Creating Scheduled Task

Click Create.

Note: the task will also now show up as a rule in AWS Eventbridge.

In most regions, the task will trigger immediately (Singapore seems to be an exception - it waits until the next interval to run) and then will run at 5 minute intervals after that.

Repeat in other regions

We can then follow the exact same process (changing only the value of our TEST_REGION environment variable) within other AWS regions so that we have probes coming from different parts of the globe.

Graphing

Now that we're ingesting data, we need to visualise it.

We're going to use an InfluxCloud Notebook because a Notebook has the option to be publicly shared, meaning that it could potentially be used as a service status page.

To calculate the current service status we take values from the last write by each probe (using last()).

Telegraf's http_response input plugin sets response_status_code_match to 1 if the HTTP response code matches the value we specified in config (in this case, a 200) and 0 if not.

We want to be able to add and remove probes without needing to adjust the query thresholds, so we average the result. Because the maximum possible value is 1 we can treat this as a percentage value.

If the value is 1 (100%), everything is Up
If > 0.75 then things are Mostly Up
If over half (0.5) are up then we're Degraded
If < 0.5 are up, then consider us Down

This logic can be implemented via the following Flux query

from(bucket: "telegraf")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "http_response")
  |> filter(fn: (r) => r["_field"] == "response_status_code_match")
  |> filter(fn: (r) => r["server"] == "https://www.bentasker.co.uk")
  |> last()
  |> group()
  |> mean()
  |> map(fn: (r) => ({r with _value: 
    if r._value == 1 then
       "Up"
    else if r._value < 1 and r._value > 0.75 then
       "Mostly Up"
    else if r._value < 0.75 and r._value > 0.5 then
       "Degraded"
    else
       "Down"
  }))

Current Service Status

We can use a comparatively simple query to create a graph of response times

from(bucket: "telegraf")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "http_response")
  |> filter(fn: (r) => r["_field"] == "response_time")
  |> filter(fn: (r) => r["server"] == "https://www.bentasker.co.uk")
  |> group(columns: ["region", "server"])
  |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)
  |> map(fn: (r) => ({
      _time: r._time,
      _field: r._field,
      _value: r._value * 1000.0,
      region: r.region,
      server: r.server
  }))

We select results for the site, calculate a periodic mean and then use map() to convert the value to milliseconds.

Response Times

Next, we might want a cell displaying some headline stats, the query for this is a little more complex as we need to use join() to get our aggregates into the same table.

// Select the data
data = from(bucket: "telegraf")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "http_response")
  |> filter(fn: (r) => r["_field"] == "response_time")
  |> filter(fn:(r) => r["server"] == "https://www.bentasker.co.uk")
  |> group(columns: ["region"])
  |> map(fn:(r) => ({r with _value: r._value * 1000.0}))
  |> keep(columns: ["_value", "region"])

// Calculate aggregates
mean = data
  |> mean()

max = data
  |> max()

min = data
  |> min()

p95 = data
  |> quantile(q:0.95)


// Perform joins
j1 = join(tables: { mean: mean, max: max }, 
      on: ["region"],
      method: "inner" 
   )

j2 = join(tables: { min: min, j1: j1 }, 
      on: ["region"],
      method: "inner" 
    )

joined = join(tables: { min: j2, perc: p95 }, 
      on: ["region"],
      method: "inner" 
    )

// Format for output
joined
  |> map(fn: (r) => ({
      region: r.region,
      min: r._value_min,
      max: r._value_max,
      mean: r._value_mean,
      p95: r._value_perc
  }))
  |> group()

Response Times Headlines

So that we've the ability to prove any variations on/off CDN, we create the same set of cells for the probes that go to the origin: same queries, different filter for the server tag.

from(bucket: "telegraf")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "http_response")
  |> filter(fn: (r) => r["_field"] == "response_time")
  |> filter(fn: (r) => r["server"] == "https://origin.bentasker.co.uk")
  |> group(columns: ["region", "server"])
  |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)
  |> map(fn: (r) => ({
      _time: r._time,
      _field: r._field,
      _value: r._value * 1000.0,
      region: r.region,
      server: r.server
  }))

Origin Availability Section

Conclusion

I seem to have got through this post without accidentally typoing in either Farscape or Stargate, so let's take a look at what we've done:

Created a Telegraf configuration file hosted on InfluxCloud
Configured a Fargate cluster on Amazon's Elastic Container Service
Created a task definition to run a Telegraf docker container
Configured that container to fetch our InfluxCloud config, do a metrics run based upon it and then exit
Created a Notebook in InfluxCloud

The result is a serverless setup feeding into a page which can be shared publicly to display an overview of current service status.

There are improvements that could be made on the setup I've built:

It'd probably be prudent to move the InfluxCloud token from an environment variable into AWS Secrets Manager
An additional tag and environment variable could be added to note the provider the probe's running on: that way a multicloud setup could be used to probe from different cloud environments

At some point, I might also create a variation of the notebook that can be sent to my CDN provider on trouble tickets.

They inject various response headers

Response headers

Using the http_response plugin's http_header_tags config option would allow these values to be included in the Notebook to facilitate troubleshooting.

Setup of this Fargate based solution is relatively cheap, coming in at around $1.78/month per probe. My total bill for running in 4 regions should be about $7.12 a month, less than a quarter of the $28.80/month cost of a solution built around EC2 t3.micro instances.