The Fifth Elephant 2017

On data engineering and application of ML in diverse domains

Vipul Mathur

@vipulmathur

Using data pipelines to navigate your data ocean

Submitted Apr 27, 2017

One of the main challenges facing companies adopting data-driven analytics-based approach to their business, is how to scale the development and adoption of data products throughout the company. In our experience, managed data pipelines is one approach that has emerged to address these challenges. This talk will introduce data pipelines, and illustrate how the challenges are addressed. The talk will be balanced with concepts, examples, and demos. Anyone interested in learning best practices to increase productivity and agility for analytics-based data products would benefit.

Intended audience

Architects/ data scientists/ engineers looking to scale their analytics work within a company.

Key takeaway: Data Pipelines

  • Are an emerging paradigm/ industry trend in development of data products
  • Address several challenges in developing/ scaling data products today
  • Provide a framework that enforces best practices for cloud-native apps
  • Have the potential to future-proof your data products
  • Are supported by most analytics frameworks, stacks, and cloud platforms today

Call to action

Let us know in the comments below what you would expect from this session.

Outline

Teaser: TL;DR [1min]

  • A TL;DR version of what this talk is all about

Motivation: WHY [8min]

  • Why do we need to evolve the way analytics solutions are built and deployed?
  • Illustration of challenges, with examples
    • Show what ad-hoc/ one-off solutions (without pipelines) look like
    • Dealing with multiple apps, multiple roles (DS/ DE/ BA/ ...)
    • Lifecycle of apps, leveraging/ reusing existing work
    • Parallels with scaling software development via collaboration and reuse (eg. GitHub)

Introduction to Data Pipelines: WHAT [10min]

  • What are Data Pipelines?
    • Establish pipeline concepts with examples
  • Data Pipelines available today (pointers only)
    • Cloud-based solutions: AWS, Azure, GCP
    • Analytics stacks: Hadoop, Spark
  • Data Pipelines vs Task Pipelines
    • Pipelines are not new, but data pipelines are special
  • Developing Data Products using Data Pipelines
    • Conceptual steps: devel/ deploy/ operate/ improve
    • Show how pipelines help with examples
  • Benefits of Data Products developed as Data Pipelines
    • Future proofing
    • Best practices

Managed Data Pipelines using CDAP: HOW [10mins]

  • Introduction to CDAP and Data Pipelines in CDAP
  • Demo of analytics app development and deployment in CDAP using Data Pipelines
  • Advantages of CDAP and Alternatives to CDAP

Wrap-up: NEXT [3min]

  • Apache Beam and the future of Data Pipelines (noteworthy future item)
  • Links and references (for audience to follow-up later)
  • Key take-home points (over Q&A)

Q&A [7min]

Speaker bio

7+ years of experience wading through petabytes of machine generated data to extract insights and business value using machine learning and other analytics techniques. Learnt the hard way how to make this work at scale within a large organization.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jump starting better data engineering and AI futures