Using Data to make data processing reliable again
Submitted by devjyoti (@kprotocol) on Wednesday, 21 March 2018
Data Driven performance management of Big Data Infrastructure is very different from performance management of standard applications like web servers. A single cluster is submitted multiple simultaneous discrete applications where each of these applications can comprise up to hundreds of thousands of tasks of varying complexities. If these jobs are not tuned properly, then it’s easy to both blow up the costs because of an underutilized cluster or starve the jobs and miss SLA’s because of shortage of resources.
This talk is targeted towards engineers who administer Big Data Clusters and would like to improve the efficiency and utilization of their clusters using a data-driven methodology.
Say, You have been storing the job characteristics for SQL queries that are run on you cluster
2. Query Schedule, start and end times
3. Number of Map and Reduce tasks
4. Cumulative CPU seconds and Memory seconds
5. Data scanned, processed, and written
And you also know the layout of the data which form the input to these queries
1. Column types, shape and range
2. Partitioned columns and size of those partitions
3. Data serialization format
With these two datasets, stored over a period of time, we will try to answer the following questions:
1. What do we know about the most expensive jobs running on our cluster?
2. Can we identify the most common anti-patterns in our adhoc workload and take some defensive action against those suspect queries.
3. Can we identify clusters of tables that are frequently joined together and recommend a better data layout/schema to reduce database load.
Though, there are other parameters like Cluster Configuration and Cluster Resource Allocation which also affect the job’s performance, but we will keep the scope of this talk limited to the Job Statistics and Data Layout. Also, we are going to discuss analysis of only the SQL workloads, which form the major percentage of jobs running on Hive, Spark or Presto clusters.
To serve these needs, we built Tenali, Qubole’s SQL parser and analyzer which we intend to open source shortly. Tenali is a collection of scoping rules and heuristics, that given a set of queries and corresponding job characteristics, generate insights to improve the jobs efficiency.
- Types of Data and how we capture them at Qubole
- Discuss design of Tenali and its approach for capturing table lineage and data flows.
- Discuss some well known algorithms and their performance on these datasets
- Examples of how we use this data to improve the efficiency of our data offerings at Qubole
Understanding of Data tools like Hadoop, Hive, Spark, etc.,
Familiarity with ML nomenculature like Classification, Clustering, Nearest Neighbour, etc.,
Devjyoti is working with Qubole as Data Engineer and helps the company gain more insights into the performance of its data processing tools.