how to understand IT systems (my experience)

If you are not from IT world, it can get difficult to understand what are the IT systems and how they work. For the past couple of years I’ve been trying to wrap my head around the complexity of the IT systems and one of the things that helped me was to fitting it into simpler concepts.

Essentially, all IT systems must solve some sort of problem and deliver some value to users. They provide some input and expect something in return. What is happening in-between is the job of IT systems.

Personally, I like to break it down to a single simple concept:

In it’s essence, there are 3 parts – Input, IT system, Output.

  1. User provides some sort of input, for example – requests to see contents of a website;
  2. Then the request goes into the IT system (Black Box) – a chunk of code that is hosted on some physical hardware (even cloud is hosted somewhere physical);
  3. After the code is executed, user receives an output – sees a content of a webpage;

Depending on the task that the IT system is built to perform, the complexity can vary. At this level of abstraction, for me it makes sense to think of systems in high level.

If a user wishes to visit a website, he puts in a web address, which gets routed via internet to a web server, which queries a database to fetch website data, witch returns the user with website content.

A simple website is easy to visualize and understand, however real world IT systems include many more components – load balancers, rate limiters, messaging queues, microservices, database replications, content delivery networks and others.

By having in mind this simple input-output concept, it’s much easier to grasp the IT system complexity.

To put this into further use – we can measure the Inputs, Outputs and in-between.

By tracking how many requests goes into the IT system, based on it’s intended behavior you can expect a certain amount of output. If the dynamic of the input/output changes, you can make some insights about the system performance. For example, if it takes longer to process inputs before outputs are produced, you can suspect a performance degradation, or if there is no output at all – detect system outage.

Measuring inputs and outputs not only let you understand how your product is performing, but it helps to detect inevitable system failures, performance degradation, incidents, understand user behavior as well.

This concept is fundamental for understanding IT systems (at least for me), which leads to better understanding of their performance and user behavior.

deployed changes can cause incidents even after 6 months

did you just deploy a change to prod? critical systems didn’t go down? seems there is no negative impact in general? awesome :)) but..
 
don’t congratulate just yet as although these changes didn’t impact your systems at the moment, they could start the process of butterfly effect. 
 
one early sunday morning an unexpected major incident could happen, which could be caused by a change made 6 months earlier, but it didn’t set off until some conditions were right. 
 
distributed systems often hide their true internal states, which cannot be picked up by monitoring systems.
 
having established proper observability practices could help you catch these potential warning signs early on and prevent inconveniences for your users.
 
collect high cardinality, high dimensionality data about your systems. the data is highly cardinal when you’ve got many unique identifiers of the events (rows); and the data is highly dimensional when you’ve got many attributes (columns).
 
explore it in real time when pushing new code to prod. understand how your systems are interlaced, connect the dots with the rich data. let your engineers ask questions about the internal states of the system which can help them better understand its behavior before it’s too late.

hello_world!:)

Recently I’ve started to work in a mysterious area of IT world – Observability. It appears that it’s quite relevant in modern IT companies:)

In an ideal scenario, whenever IT systems fail, users shouldn’t feel a thing due to modern approach to system design. The resiliency of distributed systems should help the IT systems to either self-heal or fail-over to backup infrastructure.

However we don’t live in such world and outages & incidents are happening quite regularly. Well, most of them are caused by new deployments and software changes:)

Whenever incidents happen though, engineers must have a view inside the system to understand what areas of the system are being impacted and drill down to a specific root cause.

Traditional monitoring will not always work because it’s designed to work on pre-defined rules. Distributed IT systems fall (or degrade) in all sorts of weird states which cannot be always picked up by these thresholds.

This is where Observability comes into play.

The goal of Observability is to create an environment for engineers to ask questions about the internal states of the system. With the help of data & visualization tools engineers should have possibility to unravel the knot of the system complexity.

Observability is a set of documentation, policies, processes, collecting right sort of data, logs, traces and having a place to visualize everything. It is essentially a tool set for engineers to see into the complex distributed systems.

This is extremely useful for monitoring deployments & their impact across the whole system, helpful when resolving major incidents or tracking down smaller system degradation.