A Scalable Alternative to postgis in a Distributed Environment

The ninth edition of The Fifth Elephant will be held in Bangalore on 16 and 17 July 2020.

The Fifth Elephant brings together over one thousand data scientists, ML engineers, data engineers and analysts to discuss:

Data governance
Data privacy and engineering for privacy including engineering for Personal Data Protection (PDP) bill.
Data cleaning, annotation, instrumentation and productionizing data science.
Identifying and handling fraud + data security at scale
Feature engineering and ML platforms.
What it takes to create data-driven cultures in organizations of different scales.

**Event details:

Dates: 16-17 July 2020
Venue: NIMHANS Convention Centre, Dairy Circle, Bangalore

Why you should attend:

Network with peers and practitioners from the data ecosystem.
Share approaches to solving expensive problems such as cleanliness of training data, annotation, model management and versioning data.
Demo your ideas in the demo sessions.
Join Birds of Feather (BOF) sessions to have productive discussions on focussed topics. Or, start your own Birds of Feather (BOF) session.

Contact details:
For more information about The Fifth Elephant, call +91-7676332020 or email sales@hasgeek.com

Hosted by

The Fifth Elephant

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more

All submissions

Previous Next

A Scalable Alternative to postgis in a Distributed Environment

Submitted May 29, 2020

OSM DB is needed to make spatial sense of the raw location data flowing in. It can answer something as simple as “which state does a location lie in” to something like “which city has the most dense road network”
At Zendrive for our risk modelling we like to calculate the fraction of a trip which was taken on a highway. For reference, there are 31M road entries in the US provided by OSM. For this purpose a trip is broken down into segments using representative points and those segments are marked as highway or non highway. Considering that an hour long trip can have about 200 such points to query, this is not the fastest query to execute. We use a spark job to process approximately 300K trips daily and our number of containers are bound to concurrent connections that RDBMS allows which are around 300, this can become a bottleneck and this batch job can take a few hours to process. With an increase in traffic coming in, this time is always going to go up linearly.
<solution> <shapely + Rtree> We will talk about the method using shapely for running geo queries and RTree for indexing we have removed the dependency of OSM DB and at the same time making sure limited the amount of data that needs to be loaded into each spark executor using our spatial partitioning technique. We will see how this approach is practically infinitely horizontally scalable and how we tuned it in a way to allow us to use 1000s of containers for our use case. We will talk about other applications where this spatial partitioning can be used and how similar principles can be applied for cases where external data has to be loaded.

Outline

Introduction:

About OSM DB
How OSM is serving different use cases at Zendrive
Problem Statement
Zendrive pipeline scale and OSM DB bottlenecks
The solution
Intro to Shapely and Rtree
How can they be used in a simple reverse geocoding problem
Spatial partitioning using map tiles
Calculating the highway ratio using spatial partitioning and shapely
Limitations to this method
Further applications / use cases of spatial partitioning

Speaker bio

After completing my BTech from IIT Kanpur in 2012 and spending a couple of years working in the Photoshop Express at Adobe I have been at Zendrive for almost 6 years now. I have worked through all the teams and am currently leading the customer products team. One important aspect of my job is to make sure the distributed jobs scale properly and having the foresight to fix things before they start to break. This talk is about one such fundamental change we have made in our systems.

Comments

NIMHANS Convention Centre, Bangalore, Bengaluru

Hosted by

The Fifth Elephant

The Fifth Elephant 2020 edition

A Scalable Alternative to postgis in a Distributed Environment

Outline

Speaker bio

Links

Comments