Previous proposalMaking people happy - a story of compliance
Running Spark on Kubernetes
Submitted by Abhishek A Amralkar (@aamralkar) on Wednesday, 10 July 2019
Section: Full talk (40 mins) Category: Distributed systems Status: Rejected
Apache Spark is an essential tool for data scientists,
offering a robust platform for a variety of applications ranging from large scale data transformation to
analytics to machine learning.
Each time deta scientist come with their application/model it uses different set of libraries and dependencies.
we use standalone , self managed spark cluster. So its becoming difficult to
distributed dependencies on cluster every time.
Also running multiple jobs in parallel becoming tricky due to these dependencies.
Data scientists and ML engineers are now adopting container based appliactions to improve their workflow,
packaging dependencies and creating reproducible artefacts.
We are living in container deployment era
With containers its becoming super easy to bundle your application along with all dependencies and run it on any Cloud,
OnPremise. Containers are ephemeral which means they can get killed any time,
when you run your application in containers you need to make sure there is no downtime
and another containers restarts on its own.
Thats how tool like Kubernetes comes into and play a important role to manage Containers with zero downtime.
Kubernetes can take care of scaling requirements, failover, deployment patterns, and more.
Kuberenetes is one of the fastest growing and adaptable technologies in the DevOps
What Kubernetes is?
Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and
services, that facilitates both declarative configuration and automation. It has a large,
rapidly growing ecosystem. Kubernetes services, support, and tools are widely available.
We will go through the “”why we organizations should consider running Spark jobs on Kubernetes? rather than running it on inbuilt resource manager.”“
Better resource utilization
Overcoming the standalone scheduler limitations
We will walk through the demo for running a basic Spark Job and how to do monitoring for the same.”
Big Data Architect
Abhishek leads the Cloud Infrastructure / DevSecOps team at Talentica Software, where he designs the next generation of Cloud Infrastructure in a cost-effective and reliable manner without comprising on infrastructure and application security. He has experience in working across various technology domains like Data Center Security, Cloud Operations, Cloud Automation, writing tools around infrastructure and Cloud Security.
His current focus is on Security Operations and Clojure.