The Fifth Elephant round the year submissions for 2019

Submit a talk on data, data science, analytics, business intelligence, data engineering and ML engineering

Participate Propose a session

Building a large-scale Data as a Service (DaaS) platform to consistently deliver high-quality datasets

Submitted by Aayushi Pathak (@09aayushi) (proposing) on Wednesday, 1 May 2019

This is a proposal requesting for someone to speak on this topic. If you’d like to speak, leave a comment.


Preview video

Session type: Short talk of 20 mins

Abstract

As a provider of Competitive Intelligence as a Service to eCommerce businesses and consumer
brands, DataWeave aggregates and analyses product catalog data from eCommerce websites each
day at massive scale. Once aggregated, this data is fed into a complex process of extraction,
transformation, machine learning, and analyses. These operations are performed on a consistent
basis to provide our customers with easily consumable and actionable insights.
To be precise, we aggregate over 200 million data points across 2000+ web sources to deliver 200+
reports each day.
The web sources span multiple verticals ranging across eCommerce, travel, classified listings, mobile
apps, and more. The data transformation may range from a simple ETL to more complex ML models.
Having a unified framework to collect, transform, analyze, validate and deliver the data at a large
scale is a significant challenge.

Outline

As organizations continue to invest in and consume Big Data, challenges in capturing, structuring, and processing data in a meaningful and cost-effective manner are growing causes for concern. The scope of the problem is wide-ranging, especially since Big Data is now mission-critical in diverse industries, such as healthcare, banking, IoT, retail, and more.
Smart organizations are looking for ways to capture and store data (both internal and third-party) at scale, process it efficiently, and generate actionable insights consistently – all without reaching for deeper pockets.
This talk will throw some light on DataWeave’s journey in building a data processing framework which handles diverse datasets at scale and delivers accurate and high-quality insights consistently to its retail customers.

Talk flow:
Evolution of the platform
1. Data Collection at scale
Managing the challenges to collect publicly available data
Normalization
2. Data processing
AI/ML
Data transformation
3. Data Delivery
Data validation and retry
Delivering custom reports
4. Airflow: For scheduling the entire pipeline

Speaker bio

Rahul Ramesh, Architect, DataWeave
I work as an architect in the data platforms team at DataWeave, a provider of Competitive Intelligence as a Service for eCommerce businesses and consumer brands. I design and manage dataflows to various ‘Datastores’ maintained by the company. I also ensure that all datastores are working at optimum capacity, and data consistency is maintained across them.
I have more than 12 years of experience in the software industry, with extensive experience in building core networks in the telecommunications domain. I hold a master’s degree from IIIT-Bangalore.

Slides

https://drive.google.com/file/d/1ZUa0LdR4se-hOx57mA8sLaozQ4hlHX0t/view?usp=sharing

Preview video

https://www.dropbox.com/sh/p2pmuppfseuxtix/AADhCGKRl1gf6snVogWw0Dfna/Rahul/2019-Rahul-V2.mp4?dl=0

Comments

  • Abhishek Balaji (@booleanbalaji) Reviewer a month ago

    Hello Aayushi/Rahul,

    Thank you for submitting a proposal. To proceed with evaluation, we need to see more details in your slides. Your slides must cover the following:

    • Problem statement/context, which the audience can relate to and understand. The problem statement has to be a problem (based on this context) that can be generalized for all.
    • What were the tools/options available in the market to solve this problem? How did you evaluate alternatives, and what metrics did you use for the evaluation?
    • Why did you pick the option that you did?
    • Explain how the situation was before the solution you picked/built and how it changed after implementing the solution you picked and built? Show before-after scenario comparisons & metrics.
    • What compromises/trade-offs did you have to make in this process?
    • What is the one takeaway that you want participants to go back with at the end of this talk? What is it that participants should learn/be cautious about when solving similar problems?
    • Is the tool free/open-source? If not, what can the audience takeaway from the talk?

    We need to see the updated slides on or before 21 May in order to close the decision on your proposal. If we do not receive an update by 21 May we’ll move the proposal for consideration at a future event.

    • Aayushi Pathak (@09aayushi) Proposer 28 days ago

      @Abhishek, the deck has been updated according to the feedback. Kindly review the same.

      Thanks,
      Aayushi

  • Aayushi Pathak (@09aayushi) Proposer 19 days ago

    @ Abhishek,
    Just wanted to follow up on the proposals we had submitted. By when can we expect the results to be out?

Login with Twitter or Google to leave a comment