Design a real-time anomaly detection application using Spark and Machine Learning
Our team works on a streaming application that produces outputs in the form of real-time comparisons between us and our competitors.
The comparisons are consumed by our market managers and hotel partners and are used by them for making their day-to-day decisions and prioritize their business actions for the day.
Owing to the importance and scale of the data, it is important to ensure that our Streaming application always produces the correct comparisons(free of anomalies) or if there are anomalies we quickly come to know and fix the root cause.
That motivated us to build an application that can read the comparisons in real-time and quickly detect anomalies in them.
We are using Spark Streaming as the framework and leveraging the scikit-learn library in Python for the machine learning algorithms.
The best part about our journey is that we started off with just the name of the machine learning algorithm that we wanted to implement and chalked out all the design for implementing it for our Streaming data in less than a week and that is what motivated me to share this journey with you guys.
The session as mentioned in the abstract is primarily about our journey wherein we just started with the motivation to identify anomolous streaming outputs produced by our system.
Ideas to be presented and what should everyone do to implement and design such pipelines quickly. I will be elaborating on these steps during my talk :
Step 1: Find a motivation. Machine learning actually comes into the picture when static rules-based learning from the past data does not work. There are many dimensions to the data and complex relationships between the dimensions.
Step 2: Think of what you are trying to do intuitively. Like in our case we are trying to find anomolous data so naturally it follows that some form of Outlier Detection is what we need. The next step is to study the different algorithms that are available/ already implemented and study them. We chose Local Outlier Factor algorithm for our use case.
Step 3 : Know your data. The system design would heavily depend on the kind of data on which you want to train, how frequently you want to train and what is the kind of predicted data that you want. In our case, we train on a combination of pre-computed and streaming data and the predictions are real-time.
Step 4 : Choose a framework of choice depending on the scale and complexity/specifics of the problem. This is the part wherein we chose to make a Spark Application and further because we were implementing pure Outlier Detection(with novelty set to False) so we went for Structured Streaming because things like windowing, watermarking, concept of unbounded data frame were something that were a direect fit to our problem.
Step 5 : Write the code. First we developed a small prototype in Python and then tested each and every line of code in PySpark. We particularly checked that each and every element/library function works fine with the Spark Dataframes or not in terms of whether it serializable or not.. whether it runs into performance issues..
I will try to explain each of the steps with diagrams so that they very untuitive to understand.
No requirements as this is a crisp talk..
I am Ankit Jain. I completed by B.Tech in Computer Engineering from Delhi Technoloigical University in 2015.
So, I have been working as a Software Developer at the Expedia Group, Gurgaon since then.
My interests include Spark, Scala, Streaming Systems and Big Data. The most recent addition to my list of interests is Machine learning and I am astounded by the kind of cool problems that we can solve using Machine learning.
- This is my first session to an audience outside of my company. So, no previous session details I can provide :)
- Link to my likedin profile : https://www.linkedin.com/in/ankit-jain-03b17578/