Generating Data Analytics Reports using Scalable Config Driven Framework
Submitted by Satish Gopalani (@satishg) on Tuesday, 4 September 2018
Technical level: Intermediate
Generating a prolific number of Analytics Reports from 100’s of different dimensions and metrics for customers and internal stakeholders has been a critical work of BigData Analytics team at PubMatic.
Writing custom jobs to provide analytic reports, leads to repetitive efforts and redundancy of business logic in many different jobs.
Another challenge is scaling the platform which already processes 500 billion transactions (50 terabytes of data) per day on a 900-node cluster with ever-growing volume.
Therefore, we built a platform that allows creating a configuration driven data processing pipeline with highly re-usable business functions. It is also extensible to utilize cutting-edge technologies in the ever-changing big data ecosystem. This platform enables our development teams to build a robust batch data processing pipeline to power analytics dashboards. It also empowers novice users to provide a configuration with fact and dimensions to generate ad-hoc reports in a single data processing job. Framework intelligently identifies and re-uses existing business functions based on user inputs. It also provides an abstraction layer that keeps core business logic un-affected by any technology changes. This framework is currently powered by Spark, but it can be easily configured with other technologies.
- Overview of Data Pipelines @ PubMatic
- Scale and its issues
- Data Framework Details
- Uses of the framework and future use cases
A Machine Learning/AI and Distributed Systems engineer who enjoys solving complex problems and design application and systems to work at scale.Have worked on engineering various complex projects which include building predictive ML project for online advertising, deriving interseting insights on IPL(Indian Premier League), building connectors to offload data to Hadoop and even modifying Hadoop HDFS source code to make Namenode more scalable. I have B.Tech in Computer Science from VIT, Pune and have specialization in “Big Data Analytics” from IIM Bangalore.
A Big Data Engineer with ample of experience working at scale with Spark, MapReduce and HDFS. Handled more than 60TB of data streaming everyday in the cluster of 900 nodes with 45PB under management. Deeply intereseted in designing & implementing complex & scalable data processing pipelines. Have varied interests ranging from bigdata, analytics, software engineering to being a food blogger.