SREZone Blog

Telemetry Signals


This blog post outlines the three main telemetry signals as of July 2023 (per the OpenTelemetry spec) and how they’re used in harmony to achieve operational excellence (and more!)

Signals

Metrics

Metrics have been traditionally used to monitor infrastructure (how much disk space is remaining? What is my CPU utilization? etc.), and they still are. As we move more towards cloud-native components, we (as application engineers) don’t need to worry as much about these underlying infrastructure metrics because our cloud platform maintains the infrastructure for us (think serverless). As we don’t “own” elastic cloud infrastructure, monitoring/alerting on these metrics is not very actionable as a consumer, and therefore results in noise when things go wrong. This noise often leads to alert fatigue for engineers on your product team (more on this in a future blog post.)

That’s not to say we shouldn’t use metrics at all. Instead of leveraging infrastructure metrics, we should move up a layer and capture metrics at the application level. More specifically, capture application metrics that are critical to the end users journey and the business. Is the checkout page taking too long to load? Are payments failing to be accepted? Both of those scenarios could negatively affect the business – well, at the core, will affect your customers, who are incidentally the lifeblood of your business.

I’ll ramble on in a separate blog post about metrics, SLOs (Service Level Objectives), and alerting, but the point to take away for now is it’s not feasible to alert on each bad metric emitted. Instead, we must aggregate metrics and calculate if our service is at risk of being unreliable. Once these aggregate metrics (Service Level Indicators/SLIs) show that reliability is in danger (and an actionable solution is feasible), the engineer(s) on-call are notified to restore the service to health so business can continue as usual.

Distributed Traces

Once an on-call engineer is notified that something is going wrong, they must identify why and where the problem is occurring before fully resolving the issue. Depending on the alert message, an engineer is likely to log into their observability backend tool to start “asking questions”, typically through dashboards or saved queries. A chart visual might indicate a spike in latency for certain requests. The engineer might filter for all requests that returned an error. Whatever the error may be, after some investigation, the engineer can see exactly where in the system something is going wrong, and start to identify attributes in the telemetry data that may reveal the root cause of the issue.

To understand how distributed tracing enables this sort of insight, it’s important to understand the definition and anatomy of a trace. A distributed trace, simply put, is the representation of a request flowing end-to-end (across each component) within (and sometimes across) a system(s). Observability backends commonly display traces in a waterfally view - if you Google image search ‘distributed tracing waterfall’, you’ll see what I mean. Each block in the distributed tracing chart is a span (one unit of work), which contains information about it’s span_id, span_parent, start_time, duration, and corresponding trace_id at a minimum. (Note, trace_id, span_parent, trace_state, and trace_flags are considered span “context” and propagated via request headers when crossing boundaries in a system to “stitch” the spans together in the observability backend). Additional values should be included in the span as well, such as data about the emitting component (e.g.: resource name, deployed region, etc.) and the request itself (e.g.: payload, userId, etc.). Attaching useful attributes to spans is what enables us to identify patterns (and anomalies) to better understand erroneous requests.

Implementing distributed tracing is fairly simple. To get started, “auto-instrumentation” automatically instruments popular libraries (e.g.: http clients, database drivers, etc.) within your application so you can get traces (and other signals) from those libraries with little-to-no work – the application will emit telemetry for that library at runtime without any changes to your application code. This “auto-instrumentation” is often achieved by libraries being “pre-instrumented” with OpenTelemetry or by including a metapackage that performs monkey-patching for the specified library at init time. Auto-instrumentation alone is often not enough, though – it should be augmented with manual instrumentation in your code to provide a more detailed view of what your system is doing, particularly around custom business logic.

Logs

When distributed traces don’t provide enough details to fully diagnose an issue - or if distributed traces aren’t present at all - logs are our last resort. Traditionally, sifting through logs required SSHing into a server (sometimes multiple servers) and grepping the log file for issues. To make this process easier, log-forwarding agent software was installed on machines to automatically forward logs to a centralized logging tool (e.g. ELK, Splunk, GrayLog, etc.). When working with distributed and/or redundant systems, it’s important that resource metadata describing the source of each log (where the log came from) is indicated in each log record contained in the centralized logging tool, otherwise it’d be difficult to pinpoint issues back to the affected machine.

In addition to having resource metadata in our logs, it’s important to also include rich context about the request in a consistent, structured format. Why? Because structured data is what machines best understand. Single-lined, unstructured logs are fine for human readability, but as systems scale, this simple approach of a human reading logs line-by-line does not. With the incredible processing power machines offer, it’d be silly not to write structured logs that allow us to leverage machines to effortlessly slice and dice our data to find patterns and anomalies. Proper system insight comes from the combination of human intuition and the findings generated by powerful machines (big data!)

With systems continuously surpassing record transactional throughput (thousands of transactions per second (TPS) are common, and some systems even have millions of TPS), it’s not always feasible to pinpoint unique issues simply by correlating metrics with logs via timestamp. By leveraging context-rich logging libraries, corresponding trace_ids are often included in generated logs automatically to expose better insight into the systems.

Using signals in harmony

  • Metrics feed SLIs/SLOs, which alert us when our service is unreliable
  • View distributed traces to figure out where in the system things are going wrong, and ideally why they are going wrong
  • Review structured logs to get more details about what’s going on within a component
  • Once the root of the issue is identified, a fix is put into place and deployed to production
    • Note: the first “fix” put in place is often a temporary solution to “stop the bleeding”, until a more permanant fix is available for deployment

Using telemetry signals in harmony as outlined here enables rapid Mean-Time To Resolution (MTTR), which results in happier customers, and a more successful business

Note that it’s important for telemetry signals to correlate with each other in order to provide an efficient process for the production engineer troubleshooting the system. To achieve this, all telemetry must be stored either A) in a single observability backend, or B) multiple observability backends that play nicely with eachother (enable correlation of data). Before this, the telemetry data should correlate with each other via trace_id – where appropriate, metrics should have ‘exemplars’ containing trace_id, and structured logs should also contain trace_id.

Recommendations

Metrics

  • Leverage what you can out of the box where it makes sense (e.g.: API Gateway response codes, latency, etc.)
  • Don’t page on-callers based on raw metrics (things constantly fail in complex systems). Rather, alert on meaningful (and actionable) events, such as SLOs being in danger of being breached (more to come on this in a separate blog post)
  • Emit custom (application) metrics that are meaningful for end-user journeys. Just be cautious of creating metrics with high cardinality dimensions, as this can get expensive

Distributed Traces

  • An absolute must for distributed systems/modern architectures
  • OpenTelemetry is the de facto standard for tracing, and supports most popular languages
  • Leverage auto-instrumentation, and augment with manual instrumentation for a better view of your systems

Logs

  • Veer towards including “log” messages as rich events in your distributed tracing spans where it makes sense (when running serverless FaaS, I prefer span events. When running a multi-threaded, long-running application, additional logs seem appropriate to expose behavior that may not tie back to a single request, and likely affects multiple requests). Note that the depth of this suggestion depends on how the end-user will be consuming event information. If your team isn’t super mature with their instrumentation, how about a log wrapper that adds all log records as events to the active span?
  • Include trace_id as needed (instrumentation should do this automatically)
  • String-concatenated messages aren’t great for machine usability. Rather than having a {"message": "User 32101 has logged in"}, we should be logging in a high-dimensional way such as {"message": "User has logged in", Attributes: { "userId": 32101 } }. This makes slicing and dicing our log data much easier

No matter the signal you emit, ensure there is no sensitive data such as PII or credentials included – this type of information is an attractive target for attackers

Outside of operational excellence

You’ve probably heard that “data is beautiful”; telemetry data absolutely doesn’t need to be used just for systems diagnosis. Having telemetry emitted and accessible by anyone on your team (from engineers to product managers to business stakeholders), brings everyone closer to the lens of production. Telemetry data can be used for business intelligence ([How] Are our users using a new feature? Are all of our users having a satisfactory experience using our product? How should we prioritize our work in the future? etc.). Of course, the data points needed to ask some of these questions are business specific - something you wouldn’t get from auto-instrumentation alone ;)


Similar Posts

Comments

Content