Using data pipelines to navigate your data ocean

Jul 2017

24 Mon

25 Tue

26 Wed

27 Thu 08:15 AM – 10:00 PM IST

28 Fri 08:15 AM – 06:25 PM IST

29 Sat

30 Sun

MLR Convention Centre, Whitefield, Bengaluru,

Using data pipelines to navigate your data ocean

Submitted Apr 27, 2017

Section: Full talk for data engineering track Technical level: Beginner

One of the main challenges facing companies adopting data-driven analytics-based approach to their business, is how to scale the development and adoption of data products throughout the company. In our experience, managed data pipelines is one approach that has emerged to address these challenges. This talk will introduce data pipelines, and illustrate how the challenges are addressed. The talk will be balanced with concepts, examples, and demos. Anyone interested in learning best practices to increase productivity and agility for analytics-based data products would benefit.

Intended audience

Architects/ data scientists/ engineers looking to scale their analytics work within a company.

Key takeaway: Data Pipelines

Are an emerging paradigm/ industry trend in development of data products
Address several challenges in developing/ scaling data products today
Provide a framework that enforces best practices for cloud-native apps
Have the potential to future-proof your data products
Are supported by most analytics frameworks, stacks, and cloud platforms today

Call to action

Let us know in the comments below what you would expect from this session.

Outline

Teaser: TL;DR [1min]

A TL;DR version of what this talk is all about

Motivation: WHY [8min]

Why do we need to evolve the way analytics solutions are built and deployed?
Illustration of challenges, with examples
- Show what ad-hoc/ one-off solutions (without pipelines) look like
- Dealing with multiple apps, multiple roles (DS/ DE/ BA/ ...)
- Lifecycle of apps, leveraging/ reusing existing work
- Parallels with scaling software development via collaboration and reuse (eg. GitHub)

Introduction to Data Pipelines: WHAT [10min]

What are Data Pipelines?
- Establish pipeline concepts with examples
Data Pipelines available today (pointers only)
- Cloud-based solutions: AWS, Azure, GCP
- Analytics stacks: Hadoop, Spark
Data Pipelines vs Task Pipelines
- Pipelines are not new, but data pipelines are special
Developing Data Products using Data Pipelines
- Conceptual steps: devel/ deploy/ operate/ improve
- Show how pipelines help with examples
Benefits of Data Products developed as Data Pipelines
- Future proofing
- Best practices

Managed Data Pipelines using CDAP: HOW [10mins]

Introduction to CDAP and Data Pipelines in CDAP
Demo of analytics app development and deployment in CDAP using Data Pipelines
Advantages of CDAP and Alternatives to CDAP

Wrap-up: NEXT [3min]

Apache Beam and the future of Data Pipelines (noteworthy future item)
Links and references (for audience to follow-up later)
Key take-home points (over Q&A)

Q&A [7min]

Speaker bio

7+ years of experience wading through petabytes of machine generated data to extract insights and business value using machine learning and other analytics techniques. Learnt the hard way how to make this work at scale within a large organization.

The Fifth Elephant 2017