Recently I’ve started to work in a mysterious area of IT world – Observability. It appears that it’s quite relevant in modern IT companies:)
In an ideal scenario, whenever IT systems fail, users shouldn’t feel a thing due to modern approach to system design. The resiliency of distributed systems should help the IT systems to either self-heal or fail-over to backup infrastructure.
However we don’t live in such world and outages & incidents are happening quite regularly. Well, most of them are caused by new deployments and software changes:)
Whenever incidents happen though, engineers must have a view inside the system to understand what areas of the system are being impacted and drill down to a specific root cause.
Traditional monitoring will not always work because it’s designed to work on pre-defined rules. Distributed IT systems fall (or degrade) in all sorts of weird states which cannot be always picked up by these thresholds.
This is where Observability comes into play.
The goal of Observability is to create an environment for engineers to ask questions about the internal states of the system. With the help of data & visualization tools engineers should have possibility to unravel the knot of the system complexity.
Observability is a set of documentation, policies, processes, collecting right sort of data, logs, traces and having a place to visualize everything. It is essentially a tool set for engineers to see into the complex distributed systems.
This is extremely useful for monitoring deployments & their impact across the whole system, helpful when resolving major incidents or tracking down smaller system degradation.