The Fifth Elephant 2017

On data engineering and application of ML in diverse domains

Using data pipelines to navigate your data ocean

Submitted by Vipul Mathur (@vipulmathur) on Thursday, 27 April 2017

videocam
Preview video

Technical level

Beginner

Section

Full talk for data engineering track

Status

Submitted

Vote on this proposal

Login to vote

Total votes:  +15

Abstract

One of the main challenges facing companies adopting data-driven analytics-based approach to their business, is how to scale the development and adoption of data products throughout the company. In our experience, managed data pipelines is one approach that has emerged to address these challenges. This talk will introduce data pipelines, and illustrate how the challenges are addressed. The talk will be balanced with concepts, examples, and demos. Anyone interested in learning best practices to increase productivity and agility for analytics-based data products would benefit.

Intended audience

Architects/ data scientists/ engineers looking to scale their analytics work within a company.

Key takeaway: Data Pipelines

  • Are an emerging paradigm/ industry trend in development of data products
  • Address several challenges in developing/ scaling data products today
  • Provide a framework that enforces best practices for cloud-native apps
  • Have the potential to future-proof your data products
  • Are supported by most analytics frameworks, stacks, and cloud platforms today  

Call to action

Let us know in the comments below what you would expect from this session.

Outline

Teaser: TL;DR [1min]

  • A TL;DR version of what this talk is all about  

Motivation: WHY [8min]

  • Why do we need to evolve the way analytics solutions are built and deployed?
  • Illustration of challenges, with examples
    • Show what ad-hoc/ one-off solutions (without pipelines) look like
    • Dealing with multiple apps, multiple roles (DS/ DE/ BA/ …)
    • Lifecycle of apps, leveraging/ reusing existing work
    • Parallels with scaling software development via collaboration and reuse (eg. GitHub)  

Introduction to Data Pipelines: WHAT [10min]

  • What are Data Pipelines?
    • Establish pipeline concepts with examples
  • Data Pipelines available today (pointers only)
    • Cloud-based solutions: AWS, Azure, GCP
    • Analytics stacks: Hadoop, Spark
  • Data Pipelines vs Task Pipelines
    • Pipelines are not new, but data pipelines are special
  • Developing Data Products using Data Pipelines
    • Conceptual steps: devel/ deploy/ operate/ improve
    • Show how pipelines help with examples
  • Benefits of Data Products developed as Data Pipelines
    • Future proofing
    • Best practices  

Managed Data Pipelines using CDAP: HOW [10mins]

  • Introduction to CDAP and Data Pipelines in CDAP
  • Demo of analytics app development and deployment in CDAP using Data Pipelines
  • Advantages of CDAP and Alternatives to CDAP  

Wrap-up: NEXT [3min]

  • Apache Beam and the future of Data Pipelines (noteworthy future item)
  • Links and references (for audience to follow-up later)
  • Key take-home points (over Q&A)  

Q&A [7min]

Speaker bio

7+ years of experience wading through petabytes of machine generated data to extract insights and business value using machine learning and other analytics techniques. Learnt the hard way how to make this work at scale within a large organization.

Links

Preview video

https://www.youtube.com/watch?v=zWKQe409slU

Comments

  • 1
    Zainab Bawa (@zainabbawa) Reviewer a year ago

    Hello Vipul, please upload and share link to a two-min preview video explaining what your talk is about and what is key takeaway for the audience.

Login with Twitter or Google to leave a comment