Using Data to make data processing reliable again

Jul 2018

23 Mon

24 Tue

25 Wed

26 Thu 07:45 AM – 06:15 PM IST

27 Fri 07:45 AM – 05:35 PM IST

28 Sat

29 Sun

NIMHANS Convention Centre, Bengaluru

Using Data to make data processing reliable again

Submitted Mar 21, 2018

Section: Full talk Technical level: Intermediate

Data Driven performance management of Big Data Infrastructure is very different from performance management of standard applications like web servers. A single cluster is submitted multiple simultaneous discrete applications where each of these applications can comprise up to hundreds of thousands of tasks of varying complexities. If these jobs are not tuned properly, then it’s easy to both blow up the costs because of an underutilized cluster or starve the jobs and miss SLA’s because of shortage of resources.

This talk is targeted towards engineers who administer Big Data Clusters and would like to improve the efficiency and utilization of their clusters using a data-driven methodology.

Say, You have been storing the job characteristics for SQL queries that are run on you cluster

Query
Query Schedule, start and end times
Number of Map and Reduce tasks
Cumulative CPU seconds and Memory seconds
Data scanned, processed, and written

And you also know the layout of the data which form the input to these queries

Column types, shape and range
Partitioned columns and size of those partitions
Data serialization format

With these two datasets, stored over a period of time, we will try to answer the following questions:

What do we know about the most expensive jobs running on our cluster?
Can we identify the most common anti-patterns in our adhoc workload and take some defensive action against those suspect queries.
Can we identify clusters of tables that are frequently joined together and recommend a better data layout/schema to reduce database load.

Though, there are other parameters like Cluster Configuration and Cluster Resource Allocation which also affect the job’s performance, but we will keep the scope of this talk limited to the Job Statistics and Data Layout. Also, we are going to discuss analysis of only the SQL workloads, which form the major percentage of jobs running on Hive, Spark or Presto clusters.

To serve these needs, we built Tenali, Qubole’s SQL parser and analyzer which we intend to open source shortly. Tenali is a collection of scoping rules and heuristics, that given a set of queries and corresponding job characteristics, generate insights to improve the jobs efficiency.

Outline

Types of Data and how we capture them at Qubole
Discuss design of Tenali and its approach for capturing table lineage and data flows.
Discuss some well known algorithms and their performance on these datasets
Examples of how we use this data to improve the efficiency of our data offerings at Qubole

Requirements

Understanding of Data tools like Hadoop, Hive, Spark, etc.,
Familiarity with ML nomenculature like Classification, Clustering, Nearest Neighbour, etc.,

Speaker bio

Devjyoti is working with Qubole as Data Engineer and helps the company gain more insights into the performance of its data processing tools.

The Fifth Elephant 2018