Ghostbusters: Optimizing debt collections with survival models
Submitted by Fasih Khatib (@fasihsimpl) on May 13, 2019
Abstract
A paylater solution like Simpl comes with risk  some customers don’t pay their bill on time. When this happens, our collections team calls them up and gently reminds them that their bill is due. Some people even try to vanish  they ghost us  without paying their bill, resulting in escalation to our skip trace team.
In this talk I’ll go over how we use survival models to optimize our calling team by deciding who has skipped (and needs a trace), who should get a gentle reminder, and in what order of priority.
Outline
TL;DR
This talk is about using survival models to optimize the process of making collection calls (“dear sir, please pay your bill, it’s overdue”).
Context:
 An overview of how the calling process is structured. This will give an understanding of what we’re trying to optimize.
 Discuss why moving people from one level of the process to another automatically and optimally is important for recovering money.
 Get an understanding of why databacked decisions are important for overall efficiency. Is it worth it to make 7 calls per user or should you escalate after 4 calls?
 Understand how using panel data for user behavior is significantly different from more standard classifiers which use crosssectional data.
A brief introduction to survival models:
 What survival models are, and where they are traditionally used. Get an introduction to basic terminology like survival function, hazard rate, censoring, etc.
 Take a look at nontraditional applications of survival models in fields like sales lead prioritization, marketing automation, etc.
How we use survival models:
 How math concepts are directly relevant to the business  a hazard function is directly useful as a lead score, while a survival function tells us who the ghosts are. Math => business decisions.
 Constructing hazard curves via parametric (Weibull) and nonparametric (KaplanMeier) and connecting them to our real data.
 Cox proportional model
 Data limitations force us to use censored models.
 Take a look at productionizing these models; how to use this information to make better decisions. One model can solve many problems (escalation, lead scoring, writeoff, etc.)
Requirements
This talk is accessible to those with some prior experience in statistics and/or machine learning
Speaker bio
Fasih is a data scientist at Simpl, India’s top pay later platform. When he’s not busy playing video games, he’s busy writing about allthingsBayes and functional programming. Prefers adrakwalichai over coffee, suggests ordering from Tata Cha over Chai Point, and paying using Simpl.
Links
 Implementing TSum: An Algorithm for Table Summarization  http://fasihkhatib.com/2018/10/21/ImplementingTSumAnAlgorithmforTableSummarization
 Frequentism vs Bayesianism  http://fasihkhatib.com/2019/05/10/TheMachineLearningNotebookFrequentismvsBayesianism/
Slides
https://docs.google.com/presentation/d/1qnMPPi3caa_9y1pShqxf7mbEbJsk33fpEgYX6WDQ0Y/edit?usp=sharingPreview video
https://www.youtube.com/watch?v=_KRRgKWi2WAComments


Zainab Bawa (@zainabbawa)
Looks like an interesting talk, Fasih. The slides are yet to be completed. I’d like to see:
 Where panel data is needed and where it is not needed?
 Preparing panel data – is there only way of doing this? Or, is the approach you have outlined one of the many approaches that can be used for preparing panel data?
 What is it that the audience can take away from this talk?

Fasih Khatib (@fasihsimpl) Proposer
Hey, Zainab. I’ve updated the link to share my detailed slides. It should answer most of the questions you have. If there’s anything missing still, do let me know.

Abhishek Balaji (@booleanbalaji)
Hi Fasih,
Thanks for taking the time out for the rehearsal. Here’s the summarized feedback from the rehearsal:
 Time taken: 18 mins
 The slides need a lot of work in adding flow charts, diagrams where needed, color coding forumlae and equations, and the infra.
 How the flow of data and where exactly the model is placed in the flow
 Map how the theoretical concepts relate to the usecase in this project. (Survival = User picking up call etc)
 Add some context about how the presentation is for and set the agenda on what you’re gonna talk about
 Add prompts for people to ask you questions. What can they talk to you about?
 Add more details on the models used and Lindy effect
 Add the metrics on the project/model. How did the distribution or model change after Ghostbusters was implemented.
 Explain why it’s only one survivor model and what the alternatives were
 Explain what the big picture is and elaborate on what you are going to doWe need to see your revised slides and proposal by Jun 7, 2019. We will communicate the next steps and decision on your reheasal after we evaluate your revised slides.

Fasih Khatib (@fasihsimpl) Proposer
Hey, Abhishek. I’ve updated my slides to incorporate the suggestions. I’ve also edited the duration of my talk to be 40 minutes since I won’t be able to speak about the topic in reasonable detail in the 25 minute duration.

Abhishek Balaji (@booleanbalaji)
Thanks Fasih. Confirming your talk. We’ll finalize on the duration based on a second rehearsal. For now, I’ve added it on the schedule as well.

Fasih Khatib (@fasihsimpl) Proposer
Sounds good.

Abhishek Balaji (@booleanbalaji)
Hi Fasih,
Here’s some more feedback to incorporate into your slides:
 Good that Cox proportionality model has been included
 Issues with the Weibull distribution still persists
 Definite results for the Cox model is still missing
 slide 30 shows a plot with two users that seems hypothetical, but beyond that, where do you show that it does any better than empirical hazard ratio?
 Also the math is very unclear  for example, on the right column of slide 27, there is no matching index j on the righthand side and on the next slide (28) there is floating index i with no reference to j described. Not just typos, even conceptually, you haven’t dealt with properly.
Do incorporate this feedback into your slides before next week. We’ll be scheduling the second rehearsal next week.

Abhishek Balaji (@booleanbalaji)
Hi Fasih,
Some of the feedback still hasnt been incorporated into the slides. Listing them here:
 There’s too much text on the slides and is too verbose. This needs to change and content summarized as points.
 There’s disconnect between the math shown on the slides and what’s eventually applied/coded to solve the problem. What is lacking is also the explanation on how the equations are encoded as part of the software and how they are utilized
 The eventual gainst are not clear from the slides
 Use of “hazard” here is questionable and might have to be changed since this would confuse participants.

Fasih Khatib (@fasihsimpl) Proposer
There’s too much text on the slides and is too verbose.
I’ve reworked the content to reduce the amount of text.
There’s disconnect between the math shown on the slides and what’s eventually applied
I’ve added pseudocode to show how we generate the priority scores. There’s also a flowchart to show how cox and Weibull get put together into the priority score.
The eventual gainst are not clear from the slides
There’s a calibration curve that shows how the model performs.

Abhishek Balaji (@booleanbalaji)
Thanks, Fasih!




Hi Fasih,
Thank you for submitting a proposal. I’m unable to access the slides. Could you make them public? In addition, make sure your proposal covers the following aspects:
As next steps, we’d need to see the detailed and/or updated slides by 21 May, in order to close the decision on your proposal. If we dont receive an update by 21 May, we’d have to move the proposal for consideration for a future conference.
Hey, Abhishek. You should be able to see the slides now.