Rootconf Pune edition

Rootconf Pune edition

On security, network engineering and distributed systems



Abhishek A Amralkar


Running Spark on Kubernetes

Submitted Jul 10, 2019

Apache Spark is an essential tool for data scientists,
offering a robust platform for a variety of applications ranging from large scale data transformation to
analytics to machine learning.

Each time deta scientist come with their application/model it uses different set of libraries and dependencies.
we use standalone , self managed spark cluster. So its becoming difficult to
distributed dependencies on cluster every time.
Also running multiple jobs in parallel becoming tricky due to these dependencies.

Data scientists and ML engineers are now adopting container based appliactions to improve their workflow,
packaging dependencies and creating reproducible artefacts.

We are living in container deployment era

With containers its becoming super easy to bundle your application along with all dependencies and run it on any Cloud,
OnPremise. Containers are ephemeral which means they can get killed any time,
when you run your application in containers you need to make sure there is no downtime
and another containers restarts on its own.

Thats how tool like Kubernetes comes into and play a important role to manage Containers with zero downtime.
Kubernetes can take care of scaling requirements, failover, deployment patterns, and more.

Kuberenetes is one of the fastest growing and adaptable technologies in the DevOps

What Kubernetes is?

Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and
services, that facilitates both declarative configuration and automation. It has a large,
rapidly growing ecosystem. Kubernetes services, support, and tools are widely available.


We will go through the ““why we organizations should consider running Spark jobs on Kubernetes? rather than running it on inbuilt resource manager.””

  • Containerization

  • Better resource utilization

  • Overcoming the standalone scheduler limitations
    We will walk through the demo for running a basic Spark Job and how to do monitoring for the same."



Speaker bio

Sandesh Deshmane
Big Data Architect
Talentica software


Abhishek leads the Cloud Infrastructure / DevSecOps team at Talentica Software, where he designs the next generation of Cloud Infrastructure in a cost-effective and reliable manner without comprising on infrastructure and application security. He has experience in working across various technology domains like Data Center Security, Cloud Operations, Cloud Automation, writing tools around infrastructure and Cloud Security.
His current focus is on Security Operations and Clojure.



{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hybrid access (members only)

Hosted by

We care about site reliability, cloud costs, security and data privacy