deployed changes can cause incidents even after 6 months

did you just deploy a change to prod? critical systems didn’t go down? seems there is no negative impact in general? awesome :)) but..

don’t congratulate just yet as although these changes didn’t impact your systems at the moment, they could start the process of butterfly effect.

one early sunday morning an unexpected major incident could happen, which could be caused by a change made 6 months earlier, but it didn’t set off until some conditions were right.

distributed systems often hide their true internal states, which cannot be picked up by monitoring systems.

having established proper observability practices could help you catch these potential warning signs early on and prevent inconveniences for your users.

collect high cardinality, high dimensionality data about your systems. the data is highly cardinal when you’ve got many unique identifiers of the events (rows); and the data is highly dimensional when you’ve got many attributes (columns).

explore it in real time when pushing new code to prod. understand how your systems are interlaced, connect the dots with the rich data. let your engineers ask questions about the internal states of the system which can help them better understand its behavior before it’s too late.