The Fifth Elephant 2019

Gathering of 1000+ practitioners from the data ecosystem

Tickets

GuidedLDA: A Python Package using Semi-Supervised Topic Modelling by Incorporating Lexical Priors

Submitted by Nandan Thakur (@nthakur20) on Friday, 14 June 2019


Preview video

Session type: Tutorial

Abstract

Topic Models have a great potential for helping users understand document corpora. This potential is impeded by their purely unsupervised nature, which often leads to topics that are neither entirely meaningful nor effective in extrinsic tasks. In this talk, I plan to explain how we wrote our own form of Latent Dirichlet Allocation (LDA) in order to guide topic models to learn topics of specific interest to a user. I will also talk about why we proposed a simple and effective solution known as Semi-Supervised Guided Topic Model (GuidedLDA), and the process of open sourcing everything on GitHub.

Outline

[0-15mins]: Introduction to Topic Modeling and an intuition to LDA (Latent Dirichlet Allocation) with some business use-cases and an intuitive easy to understand News Article Example.

[15-20mins]: What is Guided LDA we choose Guided LDA? An understanding of the problem of unsupervised regular LDA to shifting to Semi-Supervised GuidedLDA.

[20-30mins]: How does a Generic LDA work? An Overview of the working of Generic LDA using Bayesian Probability and pertinent examples.

[30-35mins]: What Happens when we seed the document? Detailed working Explanation of the GuidedLDA in terms of Bayesian Probability and relevant examples, How it benefits than using generic LDA.

[35-37mins]: Using GuidedLDA. How to use the GuidedLDA Python Package available online on GitHub. Illustrating Sample Code for demonstrative Purposes.

[37-40mins]: Conclude with GuidedLDA stats and Key Takeaways. Motivate the audience, using a small idea that can emerge from anywhere, even from a small startup in Bangalore.

[40-90mins]: Show a small application where we first clean up a publicly available dataset and perform topic-modeling using regular LDA and GuidedLDA.

Requirements

Parcipants must clone/download the following jupyter-notebook : https://github.com/NThakur20/topic-modeling
Participants must bring their own laptops and should have a basic idea on how to run virtual environments and jupyter-notebooks

Speaker bio

I am a perpetual, quick learner and keen to explore the realm of Data Analytics and Science. I am deeply excited about the times we live in and the rate at which data is being generated and being transformed as an asset. I am well versed in domains such as Natural Language Processing, Machine Learning, and Signal Processing and share a keen interest in learning interdisciplinary concepts involving Machine Learning.

Links

Slides

https://docs.google.com/presentation/d/1YZvRif9kstTegPlqwJQkRV-jlOS134P4-DOOT_FY_bY/edit?usp=sharing

Preview video

https://youtu.be/F6rxyVHGSdk

Comments

  • Abhishek Balaji (@booleanbalaji) Reviewer 4 months ago

    Hi Nandan,

    Thank you for submitting a proposal. We need to see detailed slides and a preview video to evaluate your proposal. Your slides must cover the following:

    • Problem statement/context, which the audience can relate to and understand. The problem statement has to be a problem (based on this context) that can be generalized for all.
    • What were the tools/frameworks available in the market to solve this problem? How did you evaluate these, and what metrics did you use for the evaluation? Why did you pick the option that you did?
    • Explain how the situation was before the solution you picked/built and how it changed after implementing the solution you picked and built? Show before-after scenario comparisons & metrics.
    • What compromises/trade-offs did you have to make in this process?
    • What is the one takeaway that you want participants to go back with at the end of this talk? What is it that participants should learn/be cautious about when solving similar problems?

    We need your updated slides and preview video by Jun 27, 2019 to evaluate your proposal. If we do not receive an update, we’d be moving your proposal for evaluation under a future event.

    • Nandan Thakur (@nthakur20) Proposer 4 months ago

      Hi Abhishek,

      Thanks for all the feedback, it helped me prepare my slides.

      I have created and attached my slides with this proposal now. Please view them and let me know where to improve, if required.

      Thanks,
      Nandan

      • Abhishek Balaji (@booleanbalaji) Reviewer 4 months ago

        Thanks Nadan, the slides look good! Just had a couple more questions:

        • You mentioned that you wrote your own form of Latent Dirichlet Allocation (LDA). Is that Guided LDA? If so, please add more information on the existing solutions available for this and why they did not work in your case?

        • Is this package being used in production? If so, can you add metrics from a production deployment? It’s useful to have the metrics for the examples you’ve provided, but evaluating these is hard since it’s a very small sample and not reflective of the real world. This will help the audience relate to the problem at hand.

        • Alternatively, this looks good as a tutorial where you can walk the audience through the entire workflow, starting from installing the package, to using a sample data set. Since you’re the maintainer/developer on this, this could be a great form for the proposal. Tutorials would be alongside the talks, but would be for a more intimate audience at the conference.

        Let me know what you think and your decision by 27 June so we can take it forward.

        • Nandan Thakur (@nthakur20) Proposer 4 months ago

          Hey Abhishek,

          Thanks for all the feedback.

          In reply to your first question, yes the solution is Guided LDA. So during the time the package was developed there was no other good python package available with the unique solution which guidedLDA provides, although the idea of developing the repo stemmed from a research paper. On the other hand, I have mentioned it quite well in the slides with examples how I started initially with a regular LDA to solve my use-case but it didn’t workout well. Gensim and scikit-learn have their own implementations of regular LDA, but, both gensim and scikit-learn lack good documentation and an intuitive, easy to understand example.

          So, yes the package has been used a lot of times in production for multiple use-cases. But, due to Company’s Policies, I would be unable to share. But even in a Semi-Supervised model, we cannot come up with definite metrics to quantify the perfomance of the package. We manually have checked topics for thousands ourselves and evaluated the results, for which guided LDA turned out to be really accurate and good.

          I like this option of doing a small tutorial, Although before I start preparing my Jupyter-Notebook tutorial and submit by tomorrow, Is it okay if I majorly talk (using my presentation slides) and in the end do a short hands-on tutorial upon the workflow using a sample data set? Would be this then called a tutorial or remain a talk itself? Let me know if these sounds feasible, accordingly I would prepare the tutorial.

          Waiting Eagerly for a reply.

          Thanks in Advance.

          Best,
          Nandan

        • Nandan Thakur (@nthakur20) Proposer 4 months ago

          Hey Abhishek,

          Succesfully created the jupyter-notebook and converted this proposal into a majority talk based session with a hands on tutorial covering the topic. Let me know whether is being moved to evaluation now?

          Thanks,
          Nandan

          • Abhishek Balaji (@booleanbalaji) Reviewer 4 months ago

            Yes, thanks for the update. Will get this evaluated this week and share an update with you.

            • Nandan Thakur (@nthakur20) Proposer 3 months ago

              Hey Abhishek, Any Updates? Do let me know.
              Sorry to bother you.

              Best,
              Nandan

              • Abhishek Balaji (@booleanbalaji) Reviewer 3 months ago

                Hey Nandan,

                Thanks for checking on this. The topic and the proposal is very detailed and would warrant a workshop/tutorial on its own. We cant accomodate this session at the conference this time, and hence will keep it in waitlist for when we’re doing other events around The Fifth Elephant. Would you be interested in participating in a discussion on intent classification/personalization at the conference?

                • Nandan Thakur (@nthakur20) Proposer 3 months ago

                  Hi Abhishek, What do you mean when you say participating in a discussion at the conference? If you could shed some light over it.

                  • Abhishek Balaji (@booleanbalaji) Reviewer 3 months ago

                    Sure. We have off the record discussions scheduled parallel to talks. These are to encourage free discussions at the event, without the constraints of having to put together a talk.

                    The one where you could share your valuable insights is https://hasgeek.com/fifthelephant/2019/proposals/intent-classification-and-personalization-BnggTjRvFFayM3gELESRED

                    I will issue a 10% discount code to you over email, which you can use to purchase a ticket to the conference. If you’re there, you can join any of the sessions or start your own session as well. Let me know once you’ve made a decision and I’ll add you to the discussions regarding the intent classification and personalization BoF.

                • Abhishek Balaji (@booleanbalaji) Reviewer 3 months ago (edited 3 months ago)

                  Nandan, the other scope for participation is if you’d want to present a flash talk on this to introduce folks to this tool/approach. If there’s more interest, you can start an impromptu Birds of a Feather session on the same topic.

                  We’re unable to issue a free ticket to attend the conference. Proposals that dont make it to the schedule are eligible for a 10% discount on tickets.

                  Let me know the decision you make by 18 July, so we can close on this decision and add you to the flash talk session if you choose to do so.

                  • Nandan Thakur (@nthakur20) Proposer 3 months ago

                    Hey Abhishek, Thanks for the offer, but the amount of time required to make others understand, as I start from basics itself would take definitely > 30 mins and compressing it as a flash talk might not be justified for the audience. So Sorry, but I am not interested in the flash talk.

                    • Abhishek Balaji (@booleanbalaji) Reviewer 3 months ago

                      Fair enough. It is quite the effort and a bummer that we cant accomodate this as a session. I’ve added a response to your other question, and hope to meet you at the conference!

Login with Twitter or Google to leave a comment