The Fifth Elephant 2020 edition
On data governance, engineering for data privacy and data science
Vishal Verma
OSM DB is needed to make spatial sense of the raw location data flowing in. It can answer something as simple as “which state does a location lie in” to something like “which city has the most dense road network”
At Zendrive for our risk modelling we like to calculate the fraction of a trip which was taken on a highway. For reference, there are 31M road entries in the US provided by OSM. For this purpose a trip is broken down into segments using representative points and those segments are marked as highway or non highway. Considering that an hour long trip can have about 200 such points to query, this is not the fastest query to execute. We use a spark job to process approximately 300K trips daily and our number of containers are bound to concurrent connections that RDBMS allows which are around 300, this can become a bottleneck and this batch job can take a few hours to process. With an increase in traffic coming in, this time is always going to go up linearly.
<solution> <shapely + Rtree> We will talk about the method using shapely for running geo queries and RTree for indexing we have removed the dependency of OSM DB and at the same time making sure limited the amount of data that needs to be loaded into each spark executor using our spatial partitioning technique. We will see how this approach is practically infinitely horizontally scalable and how we tuned it in a way to allow us to use 1000s of containers for our use case. We will talk about other applications where this spatial partitioning can be used and how similar principles can be applied for cases where external data has to be loaded.
Introduction:
After completing my BTech from IIT Kanpur in 2012 and spending a couple of years working in the Photoshop Express at Adobe I have been at Zendrive for almost 6 years now. I have worked through all the teams and am currently leading the customer products team. One important aspect of my job is to make sure the distributed jobs scale properly and having the foresight to fix things before they start to break. This talk is about one such fundamental change we have made in our systems.
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}