The Fifth Elephant 2020 edition

The Fifth Elephant 2020 edition

On data governance, engineering for data privacy and data science

The ninth edition of The Fifth Elephant will be held in Bangalore on 16 and 17 July 2020.

The Fifth Elephant brings together over one thousand data scientists, ML engineers, data engineers and analysts to discuss:

  1. Data governance
  2. Data privacy and engineering for privacy including engineering for Personal Data Protection (PDP) bill.
  3. Data cleaning, annotation, instrumentation and productionizing data science.
  4. Identifying and handling fraud + data security at scale
  5. Feature engineering and ML platforms.
  6. What it takes to create data-driven cultures in organizations of different scales.

**Event details:

Dates: 16-17 July 2020
Venue: NIMHANS Convention Centre, Dairy Circle, Bangalore

Why you should attend:

  1. Network with peers and practitioners from the data ecosystem.
  2. Share approaches to solving expensive problems such as cleanliness of training data, annotation, model management and versioning data.
  3. Demo your ideas in the demo sessions.
  4. Join Birds of Feather (BOF) sessions to have productive discussions on focussed topics. Or, start your own Birds of Feather (BOF) session.

Contact details:
For more information about The Fifth Elephant, call +91-7676332020 or email sales@hasgeek.com


Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more

Sayan Biswas

@sayanbiswas

Challenges of understanding people’s places of visits using unsupervised geospatial techniques

Submitted May 31, 2020

For effective OOH (Out of Home) advertisement targeting, advertisers are interested in understanding aggregate level statistics about various places of visits (restaurants, shopping malls, theatres etc) by people in any location. In this session, we’ll talk about the challenges and ways to find these statistics from anonymized daily commute data of people. We’ll cover the data preparation and augmentation phases, high-level algorithms, and the optimization strategies for geohash based analysis. We’ll also talk about some of the implementation nuances involving spark and airflow, so familiarity with these technology stacks are going to be helpful. The key takeaway from this session is going to be how to design a geospatial algorithm, overcoming the challenges due to noisy data and running it at scale and in a cost-effective way.

Outline

Introduction [5 mins]
Introduction to the problem at hand. Explain why we’re trying to find the places of visits by people given their anonymized geolocation data using geohash-based analysis. Explain how OOH advertising is dependent on the solution of this problem.

Algorithmic Challenges [7-8 min]
This section explains the algorithm at high level and its challenges, which are mostly in the context of finding a good geohash representation of the inaccurate gps data. Fortunately, a good approximation can be achieved if we let go of the restriction that the datapoint has to be exactly at the centre of its boundary. A clustering on the observation data, i.e. the anonymized users’ commute data, on time scale, followed by a join with POI (Places of Interests) dataset gives us potential visits to POIs by people. POIs are nothing but places, e.g. McDonalds, Sainsbury’s etc., which are of interest to the OOH client. These visits are then ranked and each potential visit gets a score, which can be incrementally updated, as we get more and more data periodically.

Data Specific Challenges [3-4 min]
There are some challenges regarding data as well. In most cases, the observation dataset is pretty noisy, because these are gps data. This naturally leads to more false positives in terms of visits to a place. But the incremental scoring helps in reducing the false positives to some extent over time.

Output validation Challenges [3-4 min]
Because of the unsupervised nature of the algorithm, validation of the result is challenging. We had an older version of the algorithm, which acted as a baseline model for us. We then used various aggregate metrics for both the algorithms to tune our new model.

Spark Job Specific Challenges [3-4 min]
Commentary on the incremental nature of the job, which is a double edged sword. Because of its incremental nature, each run takes considerably less time. But to run the algorithm for i-th iteration, now we need output from the i-1th iteration to be present. For this reason, backfilling of the job for a specific time period becomes costly. But luckily, the cost and time savings per run is worth that initial backfilling cost in terms of time and money.

Implementations and Cost Savings [3-4 min]
We’ll talk about the implementation overviews of our algorithm in pyspark. We’ll see how to divide the task at hand into multiple logical units, which are spark jobs for our use-case, and orchestrate them using Apache Airflow in production. We’ll also look at the cost savings w.r.t. to the old model we had in production.

Conclusion [2 min]
Despite the challenges, the geohash based algorithm consistently runs much faster without reducing the accuracy. This gives us a shorter turnaround time, and faster feedback from our clients based on the advertisement campaigns.

Requirements

N/A

Speaker bio

I am an engineer who is primarily curious about Data Science, Data Platforms, and (Post-Quantum) Cryptography. I got my Master’s Degree from IISc Bangalore in CSA. Since then, for almost 5 years, I’ve been focusing on building products around Data Platforms and Data Analytics. I’ve always found it very rewarding to solve challenging business problems within the constraints of budget, and time. Throughout the last year, my involvement in Sahaj has revolved around setting up the foundations of a data pipeline, along with implementing Data Science algorithms using Spark jobs that are efficient, and as a consequence cheaper to run on cloud infrastructure.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more