SREZone Blog

AWS re:Invent 2023 - an SREs experience


I was fortunate enough to be sponsored by my employer to attend AWS re:Invent 2023 (my first conference). This post describes some of the things I learned and thoughts I had from attending the conference

Learnings

Resilience

“Everything fails all the time” –Werner Vogels, CTO of Amazon

Failovers

Resilience exists on a spectrum: from basic backups (minimally resilient) to active-active deployments (extremely resilient). There are trade-offs for each configuration; if your application requires high resilience (such as multi-region deployments), you must take into account the added complexity, monetary cost, and requirement to exercise your remediatiation procedure often. Two suggestions for exercising remeditation procedures with multi-region deployments (i.e.: failovers) are 1) periodically alternating live traffic between regions, and 2) having preset dates in which you intentionally force a failover to test that the remediation is still working as intended. After all, if a failover region is unhealthy/unusable when it’s needed, what’s the point of having one?

As much as SREs are concerned with high availability, a presenter at a resilience Chalk Talk gave a refreshing statement: “chances are your application does not need multi-region deployments”. The reason being: a single AWS region with multiple AZ’s (availabilty zones) is already pretty resilient. Of course, this statement was followed up by an audience member pointing out the ~3-hour-long us-east-1 outage of June 2023, which the audience and presenters playfully laughed off. This sort of widespread outage is not a common occurrence (and I trust a proper postmortem was conducted to prevent similar issues in the future), but depending on the requirements of your system, you may still want to err on the side of caution

Throttling, timeouts, and retries

When performing throttling, timeouts, or retries: attempt them as early in the call chain as possible. Throttling should occur as close to the client as possible to prevent overwhelming the backend (typically a result of buggy clients or maliciously-intended denial-of-service attacks). Retries should be implemented with backoff-and-jitter to avoid continuously slamming downstream services, which can easily lead to cascading failures

When working in microservice ecosystems, timeouts should be the highest in duration nearest the caller and lessen in timeout duration as the request goes more downstream. If a downstream component takes its time processing a request (and especially if it performs multiple retries), there’s a chance that by the time said component finally generates a successful response to return, the front-end component has already timed out the clients request. Throwing away the “successful” response in this scenario demonstrates wasted compute resources

Resilience techniques that apply to synchronous systems (as described above) do not always apply 1:1 to asynchronous systems. For example: when dealing with failures and retries from messages on queues, it’s common practice for failed messages to be moved to a DLQ (dead-letter queue) after N [failed] attempts of processing said messages. These DLQ messages can be purged at some point, or if these DLQ messages need to be reprocessed (“redriven”), common practice is to: identify the cause of failure, push code/configs that resolve the failure, then redrive the messages to the original queue for reprocessing. When redriving, it’s extremely important to ensure that your component performs idempotent operations to make certain that data and operations are not duplicated!

AWS offers services (such as Resilience Hub) that allow you to set recovery goals (e.g: RTO and RPO) and get automatic recommendations (e.g.: use autoscaling, CW alarms, etc.) to help guide you to meet your resilience goals. Additional tools such as FIS (Fault Injection Service) and distributed load testing exist as well to push your systems to their limits and gain confidence in their robustness. While a lot of these features are nice to have out of the box, it’s important not to overlook open-source tools that accomplish similar

Further learning resources

After a resilience Chalk Talk, I asked one of the presenters if he had any reading recommendations to learn more about resilience best practices. The resources he mentioned were ServerlessLand, Serverless Resilience References, and “serverlesspresso”

Service Level Objectives

SLOs alone are not enough to define what reliability looks like for an application. Reliability isn’t defined just by engineers, either. The three service-level-*’s along with their (brief) descriptions and corresponding stewards are:

  • SLAs (Agreements) - the Agreement between business and consumer that the service will offer a certain level of availability. If the service breaches the SLA, repercussions such as credit issuance or refunds to the customer will occur. In some cases, legal action may be taken. SLAs are commonly formulated by sales teams (and likely involve legal teams)
  • SLOs (Objectives) - the Objective/goal for the service to meet; typically a bit tighter than the SLA (e.g.: if an SLA is 99%, the SLO may be 99.9%) to provide a cushion to help protect the SLA. SLOs are commonly designed by product teams and engineers
  • SLIs (Indicators) - the Indicator of how a service is performing, usually calculated as a percentage (typically: sli = goodEventsCount/totalEventsCount) for evaluation against an SLO. Traditionally, metrics have been used to calculate SLIs, however, SLIs can also be derived from anything with a numerical value (e.g.: counts of certain distributed trace spans). SLIs are most often implemented by engineers

The more services that have aligning SLOs, the better. If a critical component does not align with its dependents SLOs, the entire system is at risk, since there’s no mutual acknowledgement of expected service level

Note that SLOs can be moving targets. As technical and customer requirements change, so will the SLO goal. If an SLO goal changes, it’s likely that the system will change as well. In business terms: as requirements and SLOs change, the cost of operating the system may be reduced (in the case of loosening the SLO) or increased (in the case of tightening the SLO)

SLOs in AWS

At a Chalk Talk about SLOs, I asked the presenters “what are the next steps/how to get started with SLOs in AWS?”. I was answered with “nothing exists in AWS to create SLOs as of today, but [strong hint] listen to the keynote on Thursday”. Sure enough, at that keynote, Werner announced CloudWatch “Application Signals”, with one of the features being SLOs!

After re:Invent I hopped into the AWS console to experiment with SLOs in CloudWatch. I was able to create an SLO based on CloudWatch metrics that’d fuel the SLI. Unfortunately, I backed out of my timeboxed exploration shortly after realizing the sandbox account I was in didn’t have any traffic (or at least the metrics I chose didn’t have values to work with) nor components I was familiar with. That’s my bad – if you want more details about SLOs in AWS, check out this AWS blog post

Out of curiosity, I checked the CloudFormation and AWS Terraform provider docs for an SLO resource, but didn’t find anything. Fair enough; I trust as AWS receives feedback on this SLO feature from its users and the product matures, the API may change a bit (no need to write an IaC component twice)

Forget the 9’s (in some cases)!

When thinking about SLO goals, most people immediately reach for 9’s (99%, 99.9%, 99.99%, …). For a highly-available service, that might be fine, but not all systems need that level of availability. Maybe 80% is all your users need to be happy. Best to chat with your product team and stakeholders before building an overly reliable system, and remember that the closer to 100% the SLO goal is, the more expensive the system will be

Also, higher SLO goals aren’t always better – it depends on the systems expected outcomes. Take for example, an SLO measuring the correctness of a coin flip application. If the coin lands on heads 99.99% of the time, clearly something is wrong. For this coin flip application, we’d want a correctness SLO with a goal of 50%. The formula for calculating this SLI could be sli = headsResultsCount/totalFlipCount. Giving the SLO some tolerance such as ±2%, the SLI value would have to be between 48% and 52% to ensure SLO is not breached. A more in-depth article regarding this type of correctness SLO can be found here

Observability

X-Ray (and beyond)

One of the challenges my team has faced was “how do we [distributed] trace events over EventBridge”? While my team was able to implement a solution by injecting the w3c traceparent into the event payload (rather than into the headers, since EventBridge dropped any custom headers at the time), the solution felt a bit hacky. I pressed on the presenters at the o11y Chalk Talk about this, and they suggested the use of X-Ray to fulfill the need. No surprise there, and unfortunately I didn’t probe farther on non-X-Ray support for distributed tracing over EventBridge. So for now, we can only hope that AWS will offer better support for non-X-Ray tracing (i.e.: OpenTelemetry) in more of their services in the future

I chatted more with the presenters and I mentioned that “active tracing” is baked into several other AWS services (such as API Gateway). “If my team uses X-Ray to get end-to-end distributed traces, my team would still like to see the traces in an o11y backend other than X-Ray. Is this possible?” The presenters revealed that [AWS managed] Grafana has a plugin to display traces from X-Ray. Neat

It also appears that the OTel collector offers an X-Ray receiver. Since this receiver receives spans emitted by the existing X-Ray SDK, which seems to only be configurable in custom code, I don’t believe it’s possible to configure X-Ray settings for managed services (such as API Gateway or EventBridge) to send X-Ray spans to an X-Ray receiver in your own OTel collector. Would be a cool feature to have in the future, though!

While all AWS services aren’t 100% interoperable with OTel just yet, the Grafana X-Ray plugin and OTel collector X-Ray receiver give me a feeling that things are heading in the right direction for more interoperability in the future

RUM

Nothing groundbreaking here, just news to me: CloudWatch does offer RUM (Real User Monitoring). As with most browser RUM tools, simply configure some settings then add the provided <script> snippet to your HTML code to enable RUM data to flow in

Before investing fully into CloudWatch RUM, it may be worth exploring OpenTelemetry’s browser instrumentation

On the topic of browser monitoring, though not quite RUM, I learned that CloudWatch also offers “synthetic canaries” to test your webpages and display diffs between what the synthetic requests captured vs. what was expected

Cross-account and inter-region observability

I’ve before wondered: how could we get all of our metrics and logs (from all regions and AWS accounts) in one place, without paying an expensive egress cost? This question resurfaced in my mind as I worked through a CloudWatch workshop and noticed CloudWatch was a regional service, so I raised the question to the facilitator

The answer I got was more around CloudWatch unified/cross-account observability. There’s allegedly no egress charges for this (for logs and metrics) since the “monitoring” account is simply given IAM permissions to view telemetry (metrics, logs, traces, etc.) residing in connected AWS accounts. After digging around the AWS docs a bit more, it seems that cross-region support also exists, but I wasn’t able to find much details around pricing

The conference itself

Wishes

The re:Invent [Android] app could’ve been better. Frequently expiring credentials in the app required multiple logins throughout the week of the conference. The app always required internet access to open – if you needed to access your schedule while you didn’t have WiFi or service (e.g.: in underground bus terminals), you were SoL. I heard rumors that AWS contracted this app to be built by a third party, and based on my experience with it, I believe it

Attendees could’ve been less shy. For example, the amount of attendees eating alone at lunch was a bit disheartening. The opportunity to network with other professionals from around the world over lunch is a gift. I sure took the opportunity to meet a new group of people at each meal, connecting with intelligent, ambitious professionals and I am grateful for it

Suggestions for future attendees

Reserve your sessions online ASAP! Otherwise you’ll spend a good amount of time waiting in line for the chance to attend sessions

Sessions are scattered along venues on The Strip. Aim to have most of your sessions at 1 venue per day (or leave at least an hour+ between sessions at different venues) to ensure you’re not rushing between sessions. Bring your issued re:Invent badge everywhere (even afterhours) for access to shortcuts that connect venues

Builder sessions/workshops contain a lot of content – likely too much to finish in one sitting. Perform the workshops at home before (or after) the conference (or don’t, and simply skim the workshop content), but either way: ask questions to the facilitators. That’s what you’re paying for. Whether facilitators are principal engineers or technical account managers, they know their AWS products inside and out, and are more than capable of providing answers that may just blow you away

Try to avoid breakout sessions, as they aren’t very engaging and the recordings are uploaded to YouTube for later watching. A lot of the information from breakout sessions can be found in AWS whitepapers, too

With Chalk Talks, stay after the talks to ask additional questions to presenters as well as hear what other attendees have to ask. Also, talk to other attendees! I was fortunate enough to [unknowingly] sit next to and chat with the VP of SRE at a major banking company during a Chalk Talk. Through chatting with him, I learned his company has a 30-40% adoption rate of SRE practices across their services (which I think is impressive)! SRE maturity for their teams is tracked through a scorecard (with criteria such as SLOs, alerts, incident count, and toil) that can be used as a data-driven approach to nudge teams into compliance with SRE best practices

If you haven’t picked up on this already: talk to everyone! Prioritize chatting with participants in sessions you’re attending, whether you’re waiting in a queue or waiting for the presenters to start their talk. Don’t be afraid to sit and chat with new faces at conference-provided meals. Last but not least, if you’re flying into re:Invent, chat with people (who are also likely traveling for the conference) on the plane (I made a great connection doing this)! Optionally, wear some tech swag (tech-branded clothing) to give others a reason to strike up a conversation with you

If you want to eat meals provided by the conference, the earlier you get to a dinning room, the better. Don’t show up any later than 30 minutes before the meal ends

Allocate at least a couple of hours to check out the Expo where hundreds of vendors are set up. Know which vendors are attending, and if you’re lucky enough, you might bump into some of your tech superheroes ;). Also, leave some extra space in your luggage for all the swag you’ll collect

Weeks (or months?) before the conference, search online to find out which re:Invent conference “parties” are happening in the evenings. Most of these are great networking events sponsored by vendors, and spots can fill up pretty quick. Unfortunately for me, the parties I wanted to attend were already booked up by the time I found out about them. Either way, it’s Vegas – there’s always something to do. On the last night of the conference, be sure to attend re:Play for a night of fun, free food, and an open bar

Closing

re:Invent 2023 was a wonderful experience and it is what you make of it. From my perspective, the conference was roughly 40% networking, 35% learning, and 25% sales. The people you’ll meet and the things you’ll learn are invaluable for any engineers career


Similar Posts

Comments

Content