Approximate Query Processing
Submitted by DEEPAK GOYAL (@zonker) on Saturday, 31 March 2018
Data Analysts are constantly exploring for various forms of data and searching for new insights to make better decisions for their businesses. Email marketing team at Walmart relies heavily on Customer Segmenter, an in-house tool, which figures out which customers are best suited for an email advertisement based on various attributes. Conducting these data analytics were very costly though, both in terms of time and cluster resources, where even a simple query could take minutes to hours to complete. Most of the time, it is challenging for the analysts to know if their query is going to give them the information they need until they actually run their query and see the results and quite often they have to modify their query several times but the good thing is they don’t need exact results. To save marketing team from this painstaking experience, we use Verdict, a next generation query processor, which can save 100x-200x computational costs of your existing cluster. Verdict provides an immediate answer that is 99.9% accurate where as our analysts were okay with an error bound of five percent. An immediate results helps our analysts whether to go ahead and run the full query or modify their query to better fit an email campaign. Verdict is compatible with all existing SQL based databases and big data engines like Hive and spark for example.
- Intro Slide
- About me.
- Data Analytics
- Email Marketing at Walmart
- Problem Statement: Conducting data analytics is time and resource consuming job. Most of the time the analysts don’t even know if their query is going to give them the information they need until they actually run their query and see the results and quite often they have to modify their query several times. An approximate result would have helped them making an early decision.
- Introduction to VerdictDB
- Compatibility with existing SQL engines
- Features of Verdict
I’m sharing my experiences I’ve had at Walmart to solve a problem of long-running, resource consuming and often futile queries of Customer Segmentation faced by our Email Marketing team. Currently, I’m working on a in-house distributed database/streaming platform based on top of Kafka Streams for Walmart.