Monitoring What Matters

Brad Fair
09.08.20 06:28 PM Comment(s)

I've monitored many different types of systems over the past 20 years: from modem banks, routers, and authentication systems to storage area networks, database clusters and analytics systems. It’s not always easy to know what to monitor, especially in complicated interrelated systems that each produce tons of metrics. App developers and product managers still find it hard to know where to start, even when it comes to monitoring their own creations. In lieu of a better alternative, many of them start where I once did – the four core resources:

  • Processor
  • Memory
  • Network
  • Disk

It's a reasonable place to start, because you can trace many performance issues back to those resources. But it can take a long time to figure out which metrics matter, and it misses an entire class of interesting issues. Now I prefer a more focused approach, which I’ve found yields better results faster.

Start with the End in Mind — What's the Work?

I believe that technology should help people accomplish their goals. It should be in service of others. It should generate useful output. It should just work. So, I prefer to start there, with the end in mind. When analyzing any system, technical or otherwise, I ask myself "what work does this system produce?" Consider it from the perspective of the person who consumes that work, too — that's where the system's value is captured, after all. For example:

  • For a visual analytics system such as Tableau or Power BI, the work product might be to render a dashboard, or compile a list of dashboards that a user can select from. It might be to send an email with time-sensitive information in order to give context to an executive who is making a decision.
  • This method works for non-technical systems too. Consider receptionists sitting at an office's front desk. Their work might be to answer incoming phone calls, or to greet office visitors, or to sign for and distribute packages.

This "work" might be considered the system's purpose, and you would probably be interested in knowing when the quality/quantity of work changes.

Measure What Matters — Work Metrics

Once you've named a system’s work product, think about how to measure that output. Common types of measures for work include:

  • Throughput - How much work per unit of time is the system doing?
  • Success/Error Rates - What percentage of work is considered successful during a specific time frame? What percentage of work is considered erroneous?
  • Duration - How long does it take to produce an output or unit of work?


You should also account for the different dimensions associated with the work, to help spot patterns that might otherwise remain hidden. In many monitoring systems, this additional info can be added to metrics as tags. For instance:

  • Visual Analytics System: Break the above metrics down per node, per end-user, per location, per dashboard, etc... This allows you to view the metrics across different dimensions, and quickly isolate the relevant variables.
  • Receptionists: Capture metrics per receptionist, per location, or per interaction type (greeting a caller on the phone vs. greeting an office visitor).


You can infer quite a bit about a system's internal state given only these work metrics. For instance, tracking the "duration" metric would allow you to quickly see when a dashboard takes longer to load than it normally does. This allows you to get in front of problems before they spiral out of control! We can take it one step further though, so that we can zoom in on the _cause_ of a problem. How do we zoom in? Resource metrics!

Record the Resources

Once you know a system's output, list the resources the system uses to generate that output. It's okay if it's an incomplete list at first, because people tend to be surprisingly good about understanding the important ones. For example:

  • Visual Analytics System: In order to render a dashboard for an end user, the system might depend on a database that contains the dashboard definition; the data to populate the dashboard; a connection pool that maintains open connections to those databases, ready to query; a place to cache results; a way to determine which user can see what data; a way to crunch the numbers; a way to send the results to the end user; the list goes on. Notice that I've included a few of the four core resources, but I'm not focused specifically on them — there's so much more to this system!
  • Receptionists: The phone, pen and paper, the desk, the lobby, a pushcart for packages -- all of these are resources that a receptionist might use to do their work.

Measure What Matters — Resource Metrics

After you're satisfied with your list of resources, determine which metrics would help you understand how the resource is being used:

  • Utilization - The percent of time the resource is not idle, or how much of a resource's finite capacity is used. For example, the percent of time that connections were querying a database; or the percent of time a phone was in use.
  • Saturation - The amount of work waiting to be serviced by the resource, such as disk queue length or the number of calls in the queue for the receptionist.
  • Errors - The number of errors that might not be visible in the work/output itself, like cache misses or TCP retransmits, or failed call transfers due to invalid/misdialed extensions.
  • Availability - The percent of time the resource is available to respond to requests. Alternatively, the percent of time the resource did respond to requests. A server that can handle multiple requests at once might be non-idle but still available, leading to Utilization metrics > 100%. On the other hand, a receptionist might be tending to a customer in the lobby, and therefore unable to answer an incoming phone call.

 

Don’t forget to tag these with the same kinds of dimensions you tagged your work metrics with!

Getting Resourceful — Zooming In

You may have realized that a "resource" could be considered another system altogether. From the perspective of a visual analytics system, a database server is a resource it needs to do its work: its query result is one of the inputs needed to render the dashboard. But from the perspective of the database server, a query result is the work, and it uses different resources to generate that output. When we treat each resource as a system of its own, with its own work/resource metrics, we can zoom in and solve problems faster – especially in complicated systems.

 

There are a couple notable differences between this and the traditional "just watch the four core resources" approach:

  • We can account for logical resources, so we can find issues that don't manifest themselves physically. Connection pool exhaustion is a great example of something that can affect end-user experience while being difficult to spot via traditional IT monitoring tools. 
  • We may identify resources we don't yet have visibility into. For instance, we might be aware that an application has implemented a cache, but have no way extract its metrics. We might also know that we are using shared equipment on a public cloud environment, but we don't really have a way to know when a “noisy neighbor” is affecting our work. I believe “known unknowns" are better than "unknown unknowns", because we can still trace issues to their probable cause, and we can focus efforts on improving observability when and where it matters most.

Enrich with Events

If you've ever called tech support, you might be familiar with the question "have you made any changes recently?" It's a relevant question, because you can normally trace changes in a system's behavior back to a specific event. Recording events as they happen can help you quickly find an issue’s root cause. When you enrich your metrics data with events, you'll also have the added benefit of being able to measure an event's real impact! For any given system, you should list out the kinds of events that might affect the system's behavior. These commonly fall into a handful of categories:

  • Code Changes - updates, upgrades, and installations
  • Configuration Changes - any change in a system's configuration, whether hardware or software
  • Tasks - recurring tasks like backups, and any one-off tasks that administrators might cause
  • Infrastructure Changes - adding or removing RAM, CPU, storage; scaling up/down/out
  • Alerts - any alerts generated by this or other monitoring or management systems

 

After you've listed these events, think about how you can capture and record them as they occur. When capturing an event, you should also tag the events with metadata such as version numbers, task names, timestamps, commit messages, or any other details that might be relevant.

Putting It All Together

By identifying your relevant work and resource metrics, tags, and events, you’ve taken the most essential step towards having your own powerful monitoring system. Where do you go from here? Here’s what we tend to do: 

  1. Collect the work metrics, resource metrics, tags, and events data in one place,
  2. Add more context by collecting and analyzing your system’s logs as well,
  3. Build a high-level dashboard that helps you immediately understand the state of your systems,
  4. Build dashboards that drill into the most important work/resource metrics for each piece of the system,
  5. Implement alerts to help you address problems while they’re still small,
  6. Instrument your custom code to add helpful context to metrics, events, and logs; and lastly,
  7. Iterate! As the system changes and improves, so should its monitoring. Solidify any lessons learned by integrating them into your monitoring and alerting system.

 

It might seem like a lot, but when you do it step by step, it feels like a natural progression. And it’s always worth the effort. If we can help you move from one stage to the next, please give us a call or send us an email!

Brad Fair