Observability in distributed systems

Carlos Morales

Carlos Morales • 2022-05-01

Modern and distributed architectures offer many advantages, no doubt. One of the disadvantages is they have an extra dimension of complexity. This makes observability (understanding what happens in a system) even more important.

Importance

Monitoring supports rapid diagnosis to resolve imminent problems. The main reasons why monitoring is important:

Can you know …

Healthy, stable, and reliable systems require that we must be able to answer these questions at any time:

  1. Can you know when something wrong is happening?
  2. Can you know how bad the problem is?
  3. Can you know what is the root cause so you can fix it?
  4. Can you do a post-mortem analysis so it does not happen again?

Who would not want to know them? Sadly, being able to answer those questions is not a trivial task. Let’s review what makes it hard.

Complexity

There are multiple reasons why observability became more complex than it used to be in the past. These are my top reasons:

Microservice architectures

When we were running monolith applications, the complexity resided in the code. When running microservice architecture the complexity is in the interactions. Monitoring the code may be less important today than it used to be. On the other hand, monitoring the interactions between the services and between cloud-native services became critical.

complexity is moving from the code into the interactions

Container-based architectures

Container-based architectures include some challenges:

Traceability

Understanding how an incoming request interacts with your system remains a requirement. Tracking the flow of a request across the cloud network and the interaction across multiple microservices is more difficult. Perhaps even across multiple data centers.

Distributing tracing is understanding how the same request goes through across multiple distributed systems. There are multiple techniques, one of the most common is to add a unique id that goes through each service (e.g. in a very specific HTTP header, as part of the logged message, etc.). This requires agreement across all implied systems.

Massive generated data

As we have many more systems to observe and monitor, this generates a massive amount of data. We must find cost-efficient and smarter ways to store this data. We must:

Once we collect and identify this massive amount of data, we must find a way to navigate, search and extract any meaning.

Compliance

During the last years, many more regulations appeared that heavily impact the IT sector. Some industries (like finance and health industries) require to audit of all transactions for a defined period of time. Not everything that is monitored and logged is important for auditing. We must differentiate the collected data, protect or anonymize it when required, and make it accessible if/when the regulators request it.

Conclusions

Distributed systems are awesome. They allow us to do amazing tasks with high availability.

Sadly, they offer two big disadvantages:

These problems are harder to monitor and to triage than previous monoliths. As complexity grows, the requirement to monitor them also grows.