The Fifth Elephant 2020 edition
On data governance, engineering for data privacy and data science
Sayan Biswas
For effective OOH (Out of Home) advertisement targeting, advertisers are interested in understanding aggregate level statistics about various places of visits (restaurants, shopping malls, theatres etc) by people in any location. In this session, we’ll talk about the challenges and ways to find these statistics from anonymized daily commute data of people. We’ll cover the data preparation and augmentation phases, high-level algorithms, and the optimization strategies for geohash based analysis. We’ll also talk about some of the implementation nuances involving spark and airflow, so familiarity with these technology stacks are going to be helpful. The key takeaway from this session is going to be how to design a geospatial algorithm, overcoming the challenges due to noisy data and running it at scale and in a cost-effective way.
Introduction [5 mins]
Introduction to the problem at hand. Explain why we’re trying to find the places of visits by people given their anonymized geolocation data using geohash-based analysis. Explain how OOH advertising is dependent on the solution of this problem.
Algorithmic Challenges [7-8 min]
This section explains the algorithm at high level and its challenges, which are mostly in the context of finding a good geohash representation of the inaccurate gps data. Fortunately, a good approximation can be achieved if we let go of the restriction that the datapoint has to be exactly at the centre of its boundary. A clustering on the observation data, i.e. the anonymized users’ commute data, on time scale, followed by a join with POI (Places of Interests) dataset gives us potential visits to POIs by people. POIs are nothing but places, e.g. McDonalds, Sainsbury’s etc., which are of interest to the OOH client. These visits are then ranked and each potential visit gets a score, which can be incrementally updated, as we get more and more data periodically.
Data Specific Challenges [3-4 min]
There are some challenges regarding data as well. In most cases, the observation dataset is pretty noisy, because these are gps data. This naturally leads to more false positives in terms of visits to a place. But the incremental scoring helps in reducing the false positives to some extent over time.
Output validation Challenges [3-4 min]
Because of the unsupervised nature of the algorithm, validation of the result is challenging. We had an older version of the algorithm, which acted as a baseline model for us. We then used various aggregate metrics for both the algorithms to tune our new model.
Spark Job Specific Challenges [3-4 min]
Commentary on the incremental nature of the job, which is a double edged sword. Because of its incremental nature, each run takes considerably less time. But to run the algorithm for i-th iteration, now we need output from the i-1th iteration to be present. For this reason, backfilling of the job for a specific time period becomes costly. But luckily, the cost and time savings per run is worth that initial backfilling cost in terms of time and money.
Implementations and Cost Savings [3-4 min]
We’ll talk about the implementation overviews of our algorithm in pyspark. We’ll see how to divide the task at hand into multiple logical units, which are spark jobs for our use-case, and orchestrate them using Apache Airflow in production. We’ll also look at the cost savings w.r.t. to the old model we had in production.
Conclusion [2 min]
Despite the challenges, the geohash based algorithm consistently runs much faster without reducing the accuracy. This gives us a shorter turnaround time, and faster feedback from our clients based on the advertisement campaigns.
N/A
I am an engineer who is primarily curious about Data Science, Data Platforms, and (Post-Quantum) Cryptography. I got my Master’s Degree from IISc Bangalore in CSA. Since then, for almost 5 years, I’ve been focusing on building products around Data Platforms and Data Analytics. I’ve always found it very rewarding to solve challenging business problems within the constraints of budget, and time. Throughout the last year, my involvement in Sahaj has revolved around setting up the foundations of a data pipeline, along with implementing Data Science algorithms using Spark jobs that are efficient, and as a consequence cheaper to run on cloud infrastructure.
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}