This post outlines basic alerting approaches, “anomaly detection”, and why SLOs are necessary for appropriate monitoring and alerting
Being notified about an issue with your service(s) commonly happened in a couple of ways:
I was fortunate enough to be sponsored by my employer to attend AWS re:Invent 2023 (my first conference). This post describes some of the things I learned and thoughts I had from attending the conference
“Everything fails all the time” –Werner Vogels, CTO of Amazon
Resilience exists on a spectrum: from basic backups (minimally resilient) to active-active deployments (extremely resilient). There are trade-offs for each configuration; if your application requires high resilience (such as multi-region deployments), you must take into account the added complexity, monetary cost, and requirement to exercise your remediatiation procedure often. Two suggestions for exercising remeditation procedures with multi-region deployments (i.e.: failovers) are 1) periodically alternating live traffic between regions, and 2) having preset dates in whi...
Note: this post focuses primarily on monitoring and alerting; not observability
As brash as it sounds: if you don’t have adequate monitoring and alerting built into your systems (along with proper incident response), then you don’t care about your users
Without proper M+A (Monitoring and Alerting) practices, you risk poor user experiences, which undoubtedly leads to loss of revenue, reputation, and customer trust. Proper M+A (preferably, SLO-driven) along with remediation techniques are crucial to the success of your business
As a rule of thumb: if it’s not monitored, it’s not ready for production. Unfortunately, engineers commonly implement monitoring solely to “check the box”. This can be even worse than not having any monitoring at all – think about the noise generated from false alerts, or flat out spinning your wheels to monitor things that yield no value. It’s vital that your M+A approach is sound an...
Imagine this: you have a problem that needs solving. You scout out several vendors offering similar software solutions. You find a vendor whose offering seems to fit your needs and they promise you the world. You get the purchase approval from your org, then you seal the deal with the vendor. You’re likely in a contract for some time, and if for whatever reason their solution doesn’t work out: you can always switch vendors in the future, right?
You immediately get some value and excitement from the new solution you purchased, and all is well – until it’s not. You eventually realize there are gaps with the solution. You contact the vendor and three things are likely to ha...
This blog post outlines the three main telemetry signals as of July 2023 (per the OpenTelemetry spec) and how they’re used in harmony to achieve operational excellence (and more!)
Metrics have been traditionally used to monitor infrastructure (how much disk space is remaining? What is my CPU utilization? etc.), and they still are. As we move more towards cloud-native components, we (as application engineers) don’t need to worry as much about these underlying infrastructure metrics because our cloud platform maintains the infrastructure for us (think serverless). As we don’t “own” elastic cloud infrastructure, monitoring/alerting on these metrics is not very actionable as a consumer, and therefore results in noise when things go wrong. This noise often leads to alert fatigue for engineers on your product team (more on this in a future blog post.)
...Being a full-time software engineer since mid-2017, my practices, impact, and mission in the field have evolved (for the better.)
In the early years of my career I was a general software engineer – a jack of all trades, without specialization. In early 2021 I applied for a software engineer position at an insurance company. One thing that made this role different from the others was that it introduced a new concept I hadn’t heard before: SRE (Site Reliability Engineering). While this role wasn’t exactly an SRE position, it aimed to bring SRE culture and practices to the organization… But that’s a story for another time.
In mid-2022 I heard about OpenTelemetry (OTel). The idea of observability excited me, and even more: OTel was an open (and evolving) standard. I hacked together a quick PoC demonstrating distributed tracing and metrics. It was cool ...