The Fifth Elephant 2019

Gathering of 1000+ practitioners from the data ecosystem

Ghostbusters: Optimizing debt collections with survival models

Submitted by Fasih Khatib (@fasihsimpl) on May 13, 2019

Session type: Full talk of 40 mins Status: Confirmed & Scheduled


A pay-later solution like Simpl comes with risk - some customers don’t pay their bill on time. When this happens, our collections team calls them up and gently reminds them that their bill is due. Some people even try to vanish - they ghost us - without paying their bill, resulting in escalation to our skip trace team.

In this talk I’ll go over how we use survival models to optimize our calling team by deciding who has skipped (and needs a trace), who should get a gentle reminder, and in what order of priority.



This talk is about using survival models to optimize the process of making collection calls (“dear sir, please pay your bill, it’s overdue”).


  • An overview of how the calling process is structured. This will give an understanding of what we’re trying to optimize.
  • Discuss why moving people from one level of the process to another automatically and optimally is important for recovering money.
  • Get an understanding of why data-backed decisions are important for overall efficiency. Is it worth it to make 7 calls per user or should you escalate after 4 calls?
  • Understand how using panel data for user behavior is significantly different from more standard classifiers which use cross-sectional data.

A brief introduction to survival models:

  • What survival models are, and where they are traditionally used. Get an introduction to basic terminology like survival function, hazard rate, censoring, etc.
  • Take a look at non-traditional applications of survival models in fields like sales lead prioritization, marketing automation, etc.

How we use survival models:

  • How math concepts are directly relevant to the business - a hazard function is directly useful as a lead score, while a survival function tells us who the ghosts are. Math => business decisions.
  • Constructing hazard curves via parametric (Weibull) and non-parametric (Kaplan-Meier) and connecting them to our real data.
  • Cox proportional model
  • Data limitations force us to use censored models.
  • Take a look at productionizing these models; how to use this information to make better decisions. One model can solve many problems (escalation, lead scoring, write-off, etc.)


This talk is accessible to those with some prior experience in statistics and/or machine learning

Speaker bio

Fasih is a data scientist at Simpl, India’s top pay later platform. When he’s not busy playing video games, he’s busy writing about all-things-Bayes and functional programming. Prefers adrak-wali-chai over coffee, suggests ordering from Tata Cha over Chai Point, and paying using Simpl.



Preview video


  • Abhishek Balaji (@booleanbalaji) a year ago

    Hi Fasih,

    Thank you for submitting a proposal. I’m unable to access the slides. Could you make them public? In addition, make sure your proposal covers the following aspects:

    • Problem statement/context, which the audience can relate to and understand. The problem statement has to be a problem (based on this context) that can be generalized for all.
    • What were the tools/options available in the market to solve this problem? How did you evaluate these, and what metrics did you use for the evaluation? Why did you decide to build your own ML model?
    • Why did you pick the option that you did?
    • Explain how the situation was before the solution you picked/built and how was the fraud/ghosting after implementing the solution you picked and built? Show before-after scenario comparisons & metrics.
    • What compromises/trade-offs did you have to make in this process?
    • What are the privacy, regulatory and ethical considerations when building this solution?
    • What is the one takeaway that you want participants to go back with at the end of this talk? What is it that participants should learn/be cautious about when solving similar problems?

    As next steps, we’d need to see the detailed and/or updated slides by 21 May, in order to close the decision on your proposal. If we dont receive an update by 21 May, we’d have to move the proposal for consideration for a future conference.

    • Fasih Khatib (@fasihsimpl) Proposer a year ago

      Hey, Abhishek. You should be able to see the slides now.

  • Zainab Bawa (@zainabbawa) a year ago

    Looks like an interesting talk, Fasih. The slides are yet to be completed. I’d like to see:

    1. Where panel data is needed and where it is not needed?
    2. Preparing panel data – is there only way of doing this? Or, is the approach you have outlined one of the many approaches that can be used for preparing panel data?
    3. What is it that the audience can take away from this talk?
    • Fasih Khatib (@fasihsimpl) Proposer a year ago

      Hey, Zainab. I’ve updated the link to share my detailed slides. It should answer most of the questions you have. If there’s anything missing still, do let me know.

  • Abhishek Balaji (@booleanbalaji) a year ago

    Hi Fasih,

    Thanks for taking the time out for the rehearsal. Here’s the summarized feedback from the rehearsal:
    - Time taken: 18 mins
    - The slides need a lot of work in adding flow charts, diagrams where needed, color coding forumlae and equations, and the infra.
    - How the flow of data and where exactly the model is placed in the flow
    - Map how the theoretical concepts relate to the use-case in this project. (Survival = User picking up call etc)
    - Add some context about how the presentation is for and set the agenda on what you’re gonna talk about
    - Add prompts for people to ask you questions. What can they talk to you about?
    - Add more details on the models used and Lindy effect
    - Add the metrics on the project/model. How did the distribution or model change after Ghostbusters was implemented.
    - Explain why it’s only one survivor model and what the alternatives were
    - Explain what the big picture is and elaborate on what you are going to do

    We need to see your revised slides and proposal by Jun 7, 2019. We will communicate the next steps and decision on your reheasal after we evaluate your revised slides.

  • Fasih Khatib (@fasihsimpl) Proposer a year ago

    Hey, Abhishek. I’ve updated my slides to incorporate the suggestions. I’ve also edited the duration of my talk to be 40 minutes since I won’t be able to speak about the topic in reasonable detail in the 25 minute duration.

    • Abhishek Balaji (@booleanbalaji) a year ago

      Thanks Fasih. Confirming your talk. We’ll finalize on the duration based on a second rehearsal. For now, I’ve added it on the schedule as well.

      • Fasih Khatib (@fasihsimpl) Proposer a year ago

        Sounds good.

        • Abhishek Balaji (@booleanbalaji) a year ago

          Hi Fasih,

          Here’s some more feedback to incorporate into your slides:

          • Good that Cox proportionality model has been included
          • Issues with the Weibull distribution still persists
          • Definite results for the Cox model is still missing
          • slide 30 shows a plot with two users that seems hypothetical, but beyond that, where do you show that it does any better than empirical hazard ratio?
          • Also the math is very unclear - for example, on the right column of slide 27, there is no matching index j on the right-hand side and on the next slide (28) there is floating index i with no reference to j described. Not just typos, even conceptually, you haven’t dealt with properly.

          Do incorporate this feedback into your slides before next week. We’ll be scheduling the second rehearsal next week.

          • Abhishek Balaji (@booleanbalaji) a year ago

            Hi Fasih,

            Some of the feedback still hasnt been incorporated into the slides. Listing them here:

            • There’s too much text on the slides and is too verbose. This needs to change and content summarized as points.
            • There’s disconnect between the math shown on the slides and what’s eventually applied/coded to solve the problem. What is lacking is also the explanation on how the equations are encoded as part of the software and how they are utilized
            • The eventual gainst are not clear from the slides
            • Use of “hazard” here is questionable and might have to be changed since this would confuse participants.
            • Fasih Khatib (@fasihsimpl) Proposer 11 months ago

              There’s too much text on the slides and is too verbose.

              I’ve reworked the content to reduce the amount of text.

              There’s disconnect between the math shown on the slides and what’s eventually applied

              I’ve added pseudocode to show how we generate the priority scores. There’s also a flowchart to show how cox and Weibull get put together into the priority score.

              The eventual gainst are not clear from the slides

              There’s a calibration curve that shows how the model performs.

              • Abhishek Balaji (@booleanbalaji) 11 months ago

                Thanks, Fasih!

Login to leave a comment