Rootconf 2015

DevOps and scaling infrastructure

Dhananjay Sathe

@dhananjaysathe

Let’s Talk About Time - Data Driven Organic Monitoring @Directi

Submitted Apr 10, 2015

  • Describe a radical rethink of how we perceive alerts and monitoring and the profound implications it has on how we describe and interact with our infrastructure.
  • Its production implementation and how it enables us to get a fresh new insight and analysis on how we deal with problems.
  • What benefits, challenges and constructs it brings

Outline

The human perception of dimensions and space, specifically time is drastically different from that of machines. Yet, almost every monitoring system today completely ignores this key distinction and floods us with thousands of mechanical streams of data-points that are increasingly a burden for an operator to interpret and react on. They often do not account for vital data about the operational response which is absolutely key to how infrastructure is run in any enterprise.

The perception of time is central to this concept, clustering diverse data-points we build an abstract construct that closely mimics how humans perceive and react to events and situations. While raw data-streams are modeled as low dimension immutable facts that systems can rapidly and effectively interpret, this abstraction is mutable and modeled as an FSM (finite state machine), enabling it to hold several derived dimensions that are of great value. These attributes/dimensions can be attained through transformation functions triggered by the event stream.

This abstraction has had a profound impact on how we at Directi interact with our issues and infrastructure and enables us to explore possibilities that didn’t exist before . We shall take a sneak peak of the λ-architecture and Materialized Views in Slant (our platform) that abstracts the standard monitoring layer into a mesh of highly composable and flexible constructs.

This has empowered us to ask and answer complex questions such as - How do I define the virality/relative score of an issue? How did the operations team respond? What caused my team to lose sleep and how they reacted & resolved these issues ? What deployment, support & policy changes impacted operations and how ? How do I deduce optimal alerts and escalations ? What issues are critical and what was the root cause ?

Finally we shall then look at real world examples such as the ability to identify hot points,overlay diverse but related data, define auto aggregations in a more natural form, doing away with trigger level redundancy. How this enables conversation and allows us to organically explore issues and get a top level unified real-time and historical view of issues in our infrastructure and visualize them.

Requirements

No specific requirements, familiarity with monitoring and alerting systems will help.

Speaker bio

Dhananjay Sathe is a former BITS Pilani grad currently working as a Sr Operations Engineer on the Platform Team at Directi/Endurance where he architects and builds the central operations platform and toolchain.
In the past he has contributed OpenSource Projects such as Samba through the GSoC, Gnome and been one of the developers behind the RoboEarth Cloud Engine. His prior speaking engagements include multiple highly rated talks at PyCon India, Rootconf 14 and GoogleIOx.
His favourite hobbies include programming, travelling, exploring adventure sports and craft brews (in no particular order of preference).

Slides

http://goo.gl/sgmbMM

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

We care about site reliability, cloud costs, security and data privacy