Welcome to the SREZone Blog!

Site Reliability Engineering blog, with a focus in observability

Moving beyond anomaly detection

2024-06-16

Mike

Monitoring

Monitoring Alerting SLOs Anomaly Detection
This post outlines basic alerting approaches, “anomaly detection”, and why SLOs are necessary for appropriate monitoring and alerting

Basic alerting approaches

Being notified about an issue with your service(s) commonly happened in a couple of ways:
1. Being notified by the customer(s)
  - How embarrassing! This is the worst type of notification – damage has already been done and customers are frustrated. Monitoring and alerting must be more proactively implemented
2. Alerting based on log record content
  - Simple, but transient errors will introduce alert noise and fatigue. Additionally, clever users can include an alertable keyword such as “ERROR” in their request to trigger a false alert
3. Synthetic monitoring
  - Sure, automated probes checking your services health can be a comforting thought, but it doesn’t represent...
Read All
AWS re:Invent 2023 - an SREs experience

2024-01-15

Mike

Experiences

Conference AWS re:Invent Observability Resilience SLOs

I was fortunate enough to be sponsored by my employer to attend AWS re:Invent 2023 (my first conference). This post describes some of the things I learned and thoughts I had from attending the conference

Learnings

Resilience

“Everything fails all the time” –Werner Vogels, CTO of Amazon

Failovers

Resilience exists on a spectrum: from basic backups (minimally resilient) to active-active deployments (extremely resilient). There are trade-offs for each configuration; if your application requires high resilience (such as multi-region deployments), you must take into account the added complexity, monetary cost, and requirement to exercise your remediatiation procedure often. Two suggestions for exercising remeditation procedures with multi-region deployments (i.e.: failovers) are 1) periodically alternating live traffic between regions, and 2) having preset dates in whi...

Read All
Monitoring (and alerting)

2023-10-14

Mike

Monitoring

Monitoring Alerting Tooling On-call

Note: this post focuses primarily on monitoring and alerting; not observability

Reality check

As brash as it sounds: if you don’t have adequate monitoring and alerting built into your systems (along with proper incident response), then you don’t care about your users

Without proper M+A (Monitoring and Alerting) practices, you risk poor user experiences, which undoubtedly leads to loss of revenue, reputation, and customer trust. Proper M+A (preferably, SLO-driven) along with remediation techniques are crucial to the success of your business

As a rule of thumb: if it’s not monitored, it’s not ready for production. Unfortunately, engineers commonly implement monitoring solely to “check the box”. This can be even worse than not having any monitoring at all – think about the noise generated from false alerts, or flat out spinning your wheels to monitor things that yield no value. It’s vital that your M+A approach is sound an...

Read All
Vendor lock-in: a tale and how OpenTelemetry avoids it

2023-08-12

Mike

Tooling

Vendors Lock-in Rant OpenTelemetry Pipeline Collector Money Politics

Expand for a story, or simply continue reading to get to the details of this post

A “Hypothetical” Story

In search of a solution

Imagine this: you have a problem that needs solving. You scout out several vendors offering similar software solutions. You find a vendor whose offering seems to fit your needs and they promise you the world. You get the purchase approval from your org, then you seal the deal with the vendor. You’re likely in a contract for some time, and if for whatever reason their solution doesn’t work out: you can always switch vendors in the future, right?

The gaps in the solution

You immediately get some value and excitement from the new solution you purchased, and all is well – until it’s not. You eventually realize there are gaps with the solution. You contact the vendor and three things are likely to ha...

Read All
Telemetry Signals

2023-07-07

Mike

Observability

Signals Observability High-level

This blog post outlines the three main telemetry signals as of July 2023 (per the OpenTelemetry spec) and how they’re used in harmony to achieve operational excellence (and more!)

Signals

Metrics

Metrics have been traditionally used to monitor infrastructure (how much disk space is remaining? What is my CPU utilization? etc.), and they still are. As we move more towards cloud-native components, we (as application engineers) don’t need to worry as much about these underlying infrastructure metrics because our cloud platform maintains the infrastructure for us (think serverless). As we don’t “own” elastic cloud infrastructure, monitoring/alerting on these metrics is not very actionable as a consumer, and therefore results in noise when things go wrong. This noise often leads to alert fatigue for engineers on your product team (more on this in a future blog post.)
...

Read All
First post - Why I'm starting this blog

2023-07-02

Mike

People

Introduction Observability Politics

Leading up to this

Being a full-time software engineer since mid-2017, my practices, impact, and mission in the field have evolved (for the better.)

In the early years of my career I was a general software engineer – a jack of all trades, without specialization. In early 2021 I applied for a software engineer position at an insurance company. One thing that made this role different from the others was that it introduced a new concept I hadn’t heard before: SRE (Site Reliability Engineering). While this role wasn’t exactly an SRE position, it aimed to bring SRE culture and practices to the organization… But that’s a story for another time.

Diving into observability

In mid-2022 I heard about OpenTelemetry (OTel). The idea of observability excited me, and even more: OTel was an open (and evolving) standard. I hacked together a quick PoC demonstrating distributed tracing and metrics. It was cool ...

Read All

1/1

Welcome to the SREZone Blog!

Moving beyond anomaly detection

Basic alerting approaches

AWS re:Invent 2023 - an SREs experience

Learnings

Resilience

Failovers

Monitoring (and alerting)

Reality check

Vendor lock-in: a tale and how OpenTelemetry avoids it

A “Hypothetical” Story

In search of a solution

The gaps in the solution

Telemetry Signals

Signals

Metrics

First post - Why I'm starting this blog

Leading up to this

Diving into observability