Anomaly Detection at Scale: Architectural Choices for Data Pipelines for 7B events per day

Submitted Jul 2, 2019

Session type: Full talk of 40 mins

Cloud-native applications. Multiple Cloud providers. Hybrid Cloud. 1000s of VMs and containers. Complex network policies. Millions of connections and requests in any given time window. This is the typical situation faced by a Security Operations Control (SOC) Analyst every single day. In this talk, the speaker talks about the high-availability and highly scalable data pipelines that he built for the following use cases :

Denial of Service: A device in the network stops working.
Data Loss : An example is a rogue agent in the network transmitting IP data outside the network
Data Corruption : A device starts sending erroneous data.

The above can be modeled through anomaly detection models. The main challenge here is the data engineering pipeline. With almost 7 Billion events occurring every day, processing and storing that for further analysis is a significant challenge. The machine learning models (for anomaly detection) has to be updated every few hours and requires the pipeline to create the feature store in a significantly small time window.

The core components of the data engineering pipeline are:

apache pulsar
apache flink
apache kafka
apache druid
apche spark
apache cassandra

Apache Pulsar is the pub-sub messaging system. It provides unified queuing and streaming. Think of it as a combination of Kafka and RabbitMQ. The event logs are stored in Druid through kafka topic. Druid supports apache kafka based indexing service for realtime data ingestion. Druid has primitive capabilities to create sliding time window statistics. More complex real-time statistics are computed using Flink. Apache Flink is a stream-processing engine and provides high throughput and low latency. Spark jobs are used for batch processing. Cassandra serves as the data warehouse as well as the final database.

The speaker talks through the architectural decisions and shows how to build a modern real-time stream processing data engineering pipeline using the above tools.

Outline

The problem: overview
Different Architecture Choices
The final architecture - a brief explanation
Real-Time Processing
- Apache Pulsar
  - Queueing and Streaming
  - Pulsar vs Kafka vs RabitMQ
  - Why Pulsar for this application?
- Apache Flink
  - Micro-batching vs Streaming?
  - Basic Spark Streaming Micro Batching With State
  - Flink ADS - Asynchronous Distributed Snapshot
  - Why Flink for this application?
- Apache Druid
  - What is OLAP?
  - ClickHouse vs Druid
  - Why Druid for this application?
Batch Processing
- Apache Spark
  - Data Engineering + Machine Learning
  - ML and MLLIB
- Apache Cassandra
  - What is OLTP?
  - Cassandra vs Hbase vs Couchbase vs Mongo
  - Why Cassandra for this application?
A short demo

Speaker bio

Tuhin Sharma is co-founder of Binaize Labs, an AI based Cyber Security Start-up. He worked in IBM Watson and RedHat as Data Scientist where he mainly worked on Social Media Analytics, Demand Forecasting, Retail Analytics and Customer Analytics. He also worked at multiple start ups where he built personalized recommendation systems to maximize customer engagement with the help of ML and DL techniques across multiple domains like FinTech, EdTech, Media, E-comm etc. He has completed his post graduation from Indian Institute of Technology Roorkee in Computer Science and Engineering specializing in Data Mining. He has filed 5 patents and published 4 research papers in the field of natural language processing and machine learning. Apart from this, He loves to play table tennis and guitar in his leisure time. His favourite quote is “Life is Beautiful.” You can tweet him at @tuhinsharma121.

Slides

https://docs.google.com/presentation/d/1ZCaETragZsRntQTK1nIpFmcNV_ZJVfz2rrh4omeD-24/edit?usp=sharing

The Fifth Elephant round the year submissions for 2019