Running Spark on Kubernetes

Sep 2019

16 Mon

17 Tue

18 Wed

19 Thu

20 Fri

21 Sat 08:55 AM – 06:20 PM IST

22 Sun

St. Laurn Hotel, Pune

Running Spark on Kubernetes

Submitted Jul 10, 2019

Section: Full talk (40 mins) Category: Distributed systems

Apache Spark is an essential tool for data scientists,
offering a robust platform for a variety of applications ranging from large scale data transformation to
analytics to machine learning.

Each time deta scientist come with their application/model it uses different set of libraries and dependencies.
we use standalone , self managed spark cluster. So its becoming difficult to
distributed dependencies on cluster every time.
Also running multiple jobs in parallel becoming tricky due to these dependencies.

Data scientists and ML engineers are now adopting container based appliactions to improve their workflow,
packaging dependencies and creating reproducible artefacts.

We are living in container deployment era

With containers its becoming super easy to bundle your application along with all dependencies and run it on any Cloud,
OnPremise. Containers are ephemeral which means they can get killed any time,
when you run your application in containers you need to make sure there is no downtime
and another containers restarts on its own.

Thats how tool like Kubernetes comes into and play a important role to manage Containers with zero downtime.
Kubernetes can take care of scaling requirements, failover, deployment patterns, and more.

Kuberenetes is one of the fastest growing and adaptable technologies in the DevOps
Universe.

What Kubernetes is?

Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and
services, that facilitates both declarative configuration and automation. It has a large,
rapidly growing ecosystem. Kubernetes services, support, and tools are widely available.

Outline

We will go through the ““why we organizations should consider running Spark jobs on Kubernetes? rather than running it on inbuilt resource manager.””

Containerization
Better resource utilization
Overcoming the standalone scheduler limitations
We will walk through the demo for running a basic Spark Job and how to do monitoring for the same."

Requirements

NIL

Speaker bio

Sandesh Deshmane
Big Data Architect
Talentica software
https://www.linkedin.com/in/sandesh-deshmane-79997718/

AND

Abhishek leads the Cloud Infrastructure / DevSecOps team at Talentica Software, where he designs the next generation of Cloud Infrastructure in a cost-effective and reliable manner without comprising on infrastructure and application security. He has experience in working across various technology domains like Data Center Security, Cloud Operations, Cloud Automation, writing tools around infrastructure and Cloud Security.
His current focus is on Security Operations and Clojure.

Slides

https://docs.google.com/presentation/d/1u5EQ4Z6Z9CTdxuRD1JMP2OJW15B5T_E1cJp9APxsYXs/edit?usp=sharing

Rootconf Pune edition