Starting with observability in a company

When starting to implement observability practices in a company, work with the resources that you have. Until they are fully exhausted, or they start to limit your processes, only then think of buying a new tool that will lift these constraints.

I often get reached out by representatives of some observability/monitoring software stack offering their solutions. It’s not that simple. Tooling isn’t the answer if you are just starting. Problems won’t get solved with new shiny tools. If a company is technologically and culturally not prepared to embrace benefits provided by observability, there’s very little benefit of purchasing new tooling. 

Observability work starts with the people. It starts with the evolution of the teams’ culture to deliver & support IT systems based on data. Data is good, it helps to understand IT systems, it brings clarity and not just for an isolated system component, but for a whole distributed system that brings value to the end user.

The teams must utilize data to understand system performance, react and resolve system degradations & outages, track how deployments are impacting service. Chances are, that there are already some metrics & logs generated by IT systems, which are not yet fully utilized. Squeeze as much as possible from them, look for opportunities to measure blind spots in your system.

To see if you have any blind spots (parts of systems that are poorly or not measured at all) – leverage high level and lower level system design charts. Having documentation is vital. Try to visualize the user’s journey through the parts of the system. Think about business metrics (demand in terms of transactions/users), system performance metrics (response time, latency, loads) and IT component utilization (CPU/Memory/Network/Disk) – are these metrics getting collected?

If you collect data from all corners of the system and people are actually looking at it, utilizing it, you can polish your operational and development processes to leverage data as well.

So when you utilize everything at your disposal:
– You start collecting data from all corners of IT systems
– People are leveraging and making decisions based on data
– Development & operational processes are based on data
– Incidents & performance degradations are understood as much as possible given the amount and type of data
And all of this isn’t enough for you to deliver good IT service, only then you might consider purchasing a new shiny observability tool that will enable you to lift your IT to another level.