The Fifth Elephant 2020 edition

The Fifth Elephant 2020 edition

On data governance, engineering for data privacy and data science

The ninth edition of The Fifth Elephant will be held in Bangalore on 16 and 17 July 2020.

The Fifth Elephant brings together over one thousand data scientists, ML engineers, data engineers and analysts to discuss:

  1. Data governance
  2. Data privacy and engineering for privacy including engineering for Personal Data Protection (PDP) bill.
  3. Data cleaning, annotation, instrumentation and productionizing data science.
  4. Identifying and handling fraud + data security at scale
  5. Feature engineering and ML platforms.
  6. What it takes to create data-driven cultures in organizations of different scales.

**Event details:

Dates: 16-17 July 2020
Venue: NIMHANS Convention Centre, Dairy Circle, Bangalore

Why you should attend:

  1. Network with peers and practitioners from the data ecosystem.
  2. Share approaches to solving expensive problems such as cleanliness of training data, annotation, model management and versioning data.
  3. Demo your ideas in the demo sessions.
  4. Join Birds of Feather (BOF) sessions to have productive discussions on focussed topics. Or, start your own Birds of Feather (BOF) session.

Contact details:
For more information about The Fifth Elephant, call +91-7676332020 or email sales@hasgeek.com


Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more

Vishal Verma

@vishalzendrive

A Scalable Alternative to postgis in a Distributed Environment

Submitted May 29, 2020

OSM DB is needed to make spatial sense of the raw location data flowing in. It can answer something as simple as “which state does a location lie in” to something like “which city has the most dense road network”
At Zendrive for our risk modelling we like to calculate the fraction of a trip which was taken on a highway. For reference, there are 31M road entries in the US provided by OSM. For this purpose a trip is broken down into segments using representative points and those segments are marked as highway or non highway. Considering that an hour long trip can have about 200 such points to query, this is not the fastest query to execute. We use a spark job to process approximately 300K trips daily and our number of containers are bound to concurrent connections that RDBMS allows which are around 300, this can become a bottleneck and this batch job can take a few hours to process. With an increase in traffic coming in, this time is always going to go up linearly.
<solution> <shapely + Rtree> We will talk about the method using shapely for running geo queries and RTree for indexing we have removed the dependency of OSM DB and at the same time making sure limited the amount of data that needs to be loaded into each spark executor using our spatial partitioning technique. We will see how this approach is practically infinitely horizontally scalable and how we tuned it in a way to allow us to use 1000s of containers for our use case. We will talk about other applications where this spatial partitioning can be used and how similar principles can be applied for cases where external data has to be loaded.

Outline

Introduction:

  • About OSM DB
  • How OSM is serving different use cases at Zendrive
    Problem Statement
  • Zendrive pipeline scale and OSM DB bottlenecks
    The solution
  • Intro to Shapely and Rtree
  • How can they be used in a simple reverse geocoding problem
  • Spatial partitioning using map tiles
  • Calculating the highway ratio using spatial partitioning and shapely
  • Limitations to this method
  • Further applications / use cases of spatial partitioning

Speaker bio

After completing my BTech from IIT Kanpur in 2012 and spending a couple of years working in the Photoshop Express at Adobe I have been at Zendrive for almost 6 years now. I have worked through all the teams and am currently leading the customer products team. One important aspect of my job is to make sure the distributed jobs scale properly and having the foresight to fix things before they start to break. This talk is about one such fundamental change we have made in our systems.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more