The Tao of Ops: Monitoring infrastructure with active testing
"If you don't monitor it, you can't manage it."
That's a variation on the business management saying "you can't manage what you can't measure" (often attributed to Peter Drucker). The saying might not always apply to business, but it definitely applies to Operations.
There are a lot of tools you can bring into your organization to help with monitoring your infrastructure, but they usually look at things only from the "inside perspective" of your own systems. To truly know if the path your Operations team is walking is sane, you need to also check on things from the user's point of view. Otherwise you are missing the best chance to fix something before it becomes a problem that leads your customers to take their business elsewhere.
Active testing of your systems from the outside is crucial and something that is easy enough to set up. For each internal system you are monitoring, ask yourself how you would create a query or request from the outside using that internal system.
A good example to start with is your web server. To actively test it you could make a curl request against the URL for your site and trigger an alert if it returns anything other than a ''200 OK''' response. This pattern can be followed for many services that are web facing. Make sure to include login pages, queries, and anything your customers use every day.
Other checks to make include: - DNS entries - TLS certificates - Mixed content warnings for your pages - Cloud service security settings (e.g., are your S3 buckets open to the public?) - Firewall settings (e.g., are ports open for expired services?)
Each of these should be worked into scripts that should be run on a regular basis, with any exceptions logged and alerted.
We will be working through all of these examples in more detail with later blog posts, so watch this space for future installments.
Do you like reading fun stuff like the Tao of Ops? Then why not sign up for our mailing list? Details below.