Observability in distributed systems

Carlos Morales • 2022-05-01

Modern and distributed architectures offer many advantages, no doubt. One of the disadvantages is they have an extra dimension of complexity. This makes observability (understanding what happens in a system) even more important.

Importance

Monitoring supports rapid diagnosis to resolve imminent problems. The main reasons why monitoring is important:

Security: you cannot defend a system if you do not know what is happening on the perimeter and inside.
User-focused: our services meet the business needs and keep end-user happy. Alerts should inform when this is not happening.
Mitigate problems: reduce time to detect, triage, and mitigate problems. This is achieved by having enough data about the state of our service and accurately confirming its correctness.
Optimize costs: previous points must be cost-effective.

Can you know …

Healthy, stable, and reliable systems require that we must be able to answer these questions at any time:

Can you know when something wrong is happening?
Can you know how bad the problem is?
Can you know what is the root cause so you can fix it?
Can you do a post-mortem analysis so it does not happen again?

Who would not want to know them? Sadly, being able to answer those questions is not a trivial task. Let’s review what makes it hard.

Complexity

There are multiple reasons why observability became more complex than it used to be in the past. These are my top reasons:

Microservice architectures

When we were running monolith applications, the complexity resided in the code. When running microservice architecture the complexity is in the interactions. Monitoring the code may be less important today than it used to be. On the other hand, monitoring the interactions between the services and between cloud-native services became critical.

complexity is moving from the code into the interactions

Container-based architectures

Container-based architectures include some challenges:

Ephemeral infrastructure: One of the consequences of Infrastructure as Code, is environments are meant to last for a limited amount of time, trading reproducibility for complexity. This impacts on observability. All monitoring tools must be described as code too.
More levels of abstractions: Modern applications run in managed services where the underlying OS have been abstracted away. This simplifies the development of the business requirements, this is great. Nevertheless, the abstracted software sometimes cause issues.

Traceability

Understanding how an incoming request interacts with your system remains a requirement. Tracking the flow of a request across the cloud network and the interaction across multiple microservices is more difficult. Perhaps even across multiple data centers.

Distributing tracing is understanding how the same request goes through across multiple distributed systems. There are multiple techniques, one of the most common is to add a unique id that goes through each service (e.g. in a very specific HTTP header, as part of the logged message, etc.). This requires agreement across all implied systems.

Massive generated data

As we have many more systems to observe and monitor, this generates a massive amount of data. We must find cost-efficient and smarter ways to store this data. We must:

value the data set that is being stored,
define what are the retention periods, and
how this data is going to be used.

Once we collect and identify this massive amount of data, we must find a way to navigate, search and extract any meaning.

Compliance

During the last years, many more regulations appeared that heavily impact the IT sector. Some industries (like finance and health industries) require to audit of all transactions for a defined period of time. Not everything that is monitored and logged is important for auditing. We must differentiate the collected data, protect or anonymize it when required, and make it accessible if/when the regulators request it.

Conclusions

Distributed systems are awesome. They allow us to do amazing tasks with high availability.

Sadly, they offer two big disadvantages:

as more systems interact, these interactions offer greater points of failure.
all independent systems may work in isolation, at the same time end-to-end interactions may not.

These problems are harder to monitor and to triage than previous monoliths. As complexity grows, the requirement to monitor them also grows.