The Fifth Elephant round the year submissions for 2019

Submit a talk on data, data science, analytics, business intelligence, data engineering and ML engineering

Building a large-scale Data as a Service (DaaS) platform to consistently deliver high-quality datasets

Submitted by Aayushi Pathak (@09aayushi) (proposing) on May 1, 2019

This is a proposal requesting for someone to speak on this topic. If you’d like to speak, leave a comment.

Session type: Short talk of 20 mins Status: Rejected

Abstract

As a provider of Competitive Intelligence as a Service to eCommerce businesses and consumer
brands, DataWeave aggregates and analyses product catalog data from eCommerce websites each
day at massive scale. Once aggregated, this data is fed into a complex process of extraction,
transformation, machine learning, and analyses. These operations are performed on a consistent
basis to provide our customers with easily consumable and actionable insights.
To be precise, we aggregate over 200 million data points across 2000+ web sources to deliver 200+
reports each day.
The web sources span multiple verticals ranging across eCommerce, travel, classified listings, mobile
apps, and more. The data transformation may range from a simple ETL to more complex ML models.
Having a unified framework to collect, transform, analyze, validate and deliver the data at a large
scale is a significant challenge.

Outline

As organizations continue to invest in and consume Big Data, challenges in capturing, structuring, and processing data in a meaningful and cost-effective manner are growing causes for concern. The scope of the problem is wide-ranging, especially since Big Data is now mission-critical in diverse industries, such as healthcare, banking, IoT, retail, and more.
Smart organizations are looking for ways to capture and store data (both internal and third-party) at scale, process it efficiently, and generate actionable insights consistently – all without reaching for deeper pockets.
This talk will throw some light on DataWeave’s journey in building a data processing framework which handles diverse datasets at scale and delivers accurate and high-quality insights consistently to its retail customers.

Talk flow:
Evolution of the platform
1. Data Collection at scale
Managing the challenges to collect publicly available data
Normalization
2. Data processing
AI/ML
Data transformation
3. Data Delivery
Data validation and retry
Delivering custom reports
4. Airflow: For scheduling the entire pipeline

Speaker bio

Rahul Ramesh, Architect, DataWeave
I work as an architect in the data platforms team at DataWeave, a provider of Competitive Intelligence as a Service for eCommerce businesses and consumer brands. I design and manage dataflows to various ‘Datastores’ maintained by the company. I also ensure that all datastores are working at optimum capacity, and data consistency is maintained across them.
I have more than 12 years of experience in the software industry, with extensive experience in building core networks in the telecommunications domain. I hold a master’s degree from IIIT-Bangalore.

Slides

https://drive.google.com/file/d/1ZUa0LdR4se-hOx57mA8sLaozQ4hlHX0t/view?usp=sharing

Preview video

https://www.dropbox.com/sh/p2pmuppfseuxtix/AADhCGKRl1gf6snVogWw0Dfna/Rahul/2019-Rahul-V2.mp4?dl=0

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('You need to be a participant to comment.') }}

{{ formTitle }}
{{ gettext('Post a comment...') }}
{{ gettext('New comment') }}

{{ errorMsg }}