Good Strategy

As 2023 draws to a close, I find myself in a reflective state, contemplating about the year and planning the year 2024. A central theme that recurs throughout this reflection is the course we’re steering, which is shaped by strategy or, sometimes, the conspicuous absence of it. I recently read a book named Good Strategy Bad Strategy (Goodreads) by Richard P. Rumelt, which I consider essential for anyone looking to grasp how strategic thinking can channel collective efforts for the greater good. Here, I’d like to highlight a few essential insights from the book.

The premise is simple yet profound: by harnessing our existing resources and circumstances, we can navigate the complexities of the world much more effectively. The recipe for a solid strategy involves a three-step process: diagnosing the challenge (1), formulating guiding policies (2), and executing the coherent actions (3). Think of the diagnosis as identifying a disease, while the guiding policy serves as the well-conceived treatment plan that must be consistently implemented until it is cured. The author does an exceptional job of demystifying the concept of strategy:

1. This step involves recognising the hurdles your business currently faces, something you are typically aware of to some extent. The key is to extract the complexity of your reality into a few essential components.

2. Once the critical issues are identified, it’s time to think of a broader approach to tackle them. These approaches are your guiding policies that provide direction.

3. Finally, you must define specific initiatives that align with the guiding policies. These actions should complement each other and be integrated across all parts of the organisation, internal & external processes and most importantly – user journey of the product you’re creating.

While this may sound overly simplistic, it is far from it. If done correctly, the outcome, as Richard P. Rumelt asserts, is what constitutes a ‘Good Strategy’. He expands in greater detail all of these aspects, provides detailed examples and ways to think about the kernel of strategy. Give it a read, highly recommend.

Capacity Management

Capacity Planning is sort of an art, but mostly maths. After being in the game for a while, you can start to sense the business demand trends and system loads, but that should be backed up by numbers.

There is an in-depth article written by Diego Ballona about Capacity Planning in the ByteByteGo blog showing how deep this subject can go.

However, from a wider picture point of view, Capacity Management can be looked at from economics lenses – as the demand for your service changes, what do you do with your supply? 

The book Data Science for Supply Chain Forecasting has some great content regarding forecasting the business demand, and is beginner friendly, however I have to agree with some criticism – you cannot foresee the future and forecasting cannot be relied on 100%. 

Therefore you need to learn how to jazz. As the situation dynamically changes, you need to adapt your system capacity. This is much easier if the systems are designed to be scalable, but like everywhere, there are always  constraints. I really liked the classic book about the constraints theory called The Goal if you would like to understand more about this topic. 

And if you would like to learn more about capacity planning, you could dig deeper into – The Art of Capacity Planning, which is focused more on Cloud systems, or you can take a look at more big enterprises, a.k.a. ITIL based book – A-Z of Capacity Management.

Starting with observability in a company

When starting to implement observability practices in a company, work with the resources that you have. Until they are fully exhausted, or they start to limit your processes, only then think of buying a new tool that will lift these constraints.

I often get reached out by representatives of some observability/monitoring software stack offering their solutions. It’s not that simple. Tooling isn’t the answer if you are just starting. Problems won’t get solved with new shiny tools. If a company is technologically and culturally not prepared to embrace benefits provided by observability, there’s very little benefit of purchasing new tooling. 

Observability work starts with the people. It starts with the evolution of the teams’ culture to deliver & support IT systems based on data. Data is good, it helps to understand IT systems, it brings clarity and not just for an isolated system component, but for a whole distributed system that brings value to the end user.

The teams must utilize data to understand system performance, react and resolve system degradations & outages, track how deployments are impacting service. Chances are, that there are already some metrics & logs generated by IT systems, which are not yet fully utilized. Squeeze as much as possible from them, look for opportunities to measure blind spots in your system.

To see if you have any blind spots (parts of systems that are poorly or not measured at all) – leverage high level and lower level system design charts. Having documentation is vital. Try to visualize the user’s journey through the parts of the system. Think about business metrics (demand in terms of transactions/users), system performance metrics (response time, latency, loads) and IT component utilization (CPU/Memory/Network/Disk) – are these metrics getting collected?

If you collect data from all corners of the system and people are actually looking at it, utilizing it, you can polish your operational and development processes to leverage data as well.

So when you utilize everything at your disposal:
– You start collecting data from all corners of IT systems
– People are leveraging and making decisions based on data
– Development & operational processes are based on data
– Incidents & performance degradations are understood as much as possible given the amount and type of data
And all of this isn’t enough for you to deliver good IT service, only then you might consider purchasing a new shiny observability tool that will enable you to lift your IT to another level.

how to understand IT systems (my experience)

If you are not from IT world, it can get difficult to understand what are the IT systems and how they work. For the past couple of years I’ve been trying to wrap my head around the complexity of the IT systems and one of the things that helped me was to fitting it into simpler concepts.

Essentially, all IT systems must solve some sort of problem and deliver some value to users. They provide some input and expect something in return. What is happening in-between is the job of IT systems.

Personally, I like to break it down to a single simple concept:

In it’s essence, there are 3 parts – Input, IT system, Output.

  1. User provides some sort of input, for example – requests to see contents of a website;
  2. Then the request goes into the IT system (Black Box) – a chunk of code that is hosted on some physical hardware (even cloud is hosted somewhere physical);
  3. After the code is executed, user receives an output – sees a content of a webpage;

Depending on the task that the IT system is built to perform, the complexity can vary. At this level of abstraction, for me it makes sense to think of systems in high level.

If a user wishes to visit a website, he puts in a web address, which gets routed via internet to a web server, which queries a database to fetch website data, witch returns the user with website content.

A simple website is easy to visualize and understand, however real world IT systems include many more components – load balancers, rate limiters, messaging queues, microservices, database replications, content delivery networks and others.

By having in mind this simple input-output concept, it’s much easier to grasp the IT system complexity.

To put this into further use – we can measure the Inputs, Outputs and in-between.

By tracking how many requests goes into the IT system, based on it’s intended behavior you can expect a certain amount of output. If the dynamic of the input/output changes, you can make some insights about the system performance. For example, if it takes longer to process inputs before outputs are produced, you can suspect a performance degradation, or if there is no output at all – detect system outage.

Measuring inputs and outputs not only let you understand how your product is performing, but it helps to detect inevitable system failures, performance degradation, incidents, understand user behavior as well.

This concept is fundamental for understanding IT systems (at least for me), which leads to better understanding of their performance and user behavior.

deployed changes can cause incidents even after 6 months

did you just deploy a change to prod? critical systems didn’t go down? seems there is no negative impact in general? awesome :)) but..
 
don’t congratulate just yet as although these changes didn’t impact your systems at the moment, they could start the process of butterfly effect. 
 
one early sunday morning an unexpected major incident could happen, which could be caused by a change made 6 months earlier, but it didn’t set off until some conditions were right. 
 
distributed systems often hide their true internal states, which cannot be picked up by monitoring systems.
 
having established proper observability practices could help you catch these potential warning signs early on and prevent inconveniences for your users.
 
collect high cardinality, high dimensionality data about your systems. the data is highly cardinal when you’ve got many unique identifiers of the events (rows); and the data is highly dimensional when you’ve got many attributes (columns).
 
explore it in real time when pushing new code to prod. understand how your systems are interlaced, connect the dots with the rich data. let your engineers ask questions about the internal states of the system which can help them better understand its behavior before it’s too late.

hello_world!:)

Recently I’ve started to work in a mysterious area of IT world – Observability. It appears that it’s quite relevant in modern IT companies:)

In an ideal scenario, whenever IT systems fail, users shouldn’t feel a thing due to modern approach to system design. The resiliency of distributed systems should help the IT systems to either self-heal or fail-over to backup infrastructure.

However we don’t live in such world and outages & incidents are happening quite regularly. Well, most of them are caused by new deployments and software changes:)

Whenever incidents happen though, engineers must have a view inside the system to understand what areas of the system are being impacted and drill down to a specific root cause.

Traditional monitoring will not always work because it’s designed to work on pre-defined rules. Distributed IT systems fall (or degrade) in all sorts of weird states which cannot be always picked up by these thresholds.

This is where Observability comes into play.

The goal of Observability is to create an environment for engineers to ask questions about the internal states of the system. With the help of data & visualization tools engineers should have possibility to unravel the knot of the system complexity.

Observability is a set of documentation, policies, processes, collecting right sort of data, logs, traces and having a place to visualize everything. It is essentially a tool set for engineers to see into the complex distributed systems.

This is extremely useful for monitoring deployments & their impact across the whole system, helpful when resolving major incidents or tracking down smaller system degradation.