The Fifth Elephant 2023 Monsoon

On AI, industrial applications of ML, and MLOps



This video is for members only

Akshat Gupta


Harmonising Art and AI: Crafting Jazzy and Juicy Video Snippets through AI

Submitted Jun 30, 2023


In recent times, Live Streaming platforms are gaining popularity where live content is being shown to users. Typically, the videos created by the creators range from 15 minutes to an hour. After intensive research, it was found that a sizable chunk of users drops within first 30 seconds of the video. Another piece of research shows that, on average, a user only has an attention span of 30 seconds. And this number is even lower in Gen Z, which is our main target audience. To solve this problem, we would want to identify the juiciest segments from videos as well as add external features that would prompt a user to land on the base video, overall increasing user engagement and jazziness of the video. Also, we want to make a customizable framework that would cater not only to snippets but also trailers, mashups, etc.

Literature Review

To solve this problem, we did extensive research on the tools that already exist on the market to solve it. When searched globally, there is no single tool or solution that aims to solve this. There are several solutions that try to tackle this in bits and pieces, but not fully. We then read some research papers on how we can do this end-to-end, and from here we got a couple of ideas to try.


We broke our solution into two parts: how to get the base snippet (the juiciest part within the videos) and what are the different post-processing techniques that we can apply to it. To summarise our solution,

Base snippet:

  1. A transcription-based approach to finding speech-to-text (SOTA)
  2. We optimised this model using ctranslate for faster inference.
  3. Used Flan T5 XXL to generate a summary of the sentences.
  4. Used simple transformer-based models to calculate sentence similarity between sentences and a summary.
  5. Used a moving average on the cosine scores to generate the best timestamp for the summary.

Post processing:

  1. Key moments in the video (We used CLIP-based models to identify them based on a prompt and user interactions)
  2. Used frame-level analysis (phasing) to determine shot detection (where sudden changes happen)
  3. Used stickers and gifs (based on context from the Flan model)
  4. Created an in-house solution for memes (using Stable diffusion)
  5. A stable diffusion-based model for artistic video generation
  6. Used ESRGANs to upsample videos to increase quality.

Impact and Future Work

We deployed our solution at scale (500 video snippets per day) in India. We saw a staggering increase of close to 80% in overall time spent and user engagements. As next steps, we are planning to scale this solution to Indonesia and then to the US. We are also aiming to create a new feed just for these videos. We will also be focusing on further improvements, both in base snippets and post-processing.


{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hybrid access (members only)

Hosted by

All about data science and machine learning

Supported by

E2E Cloud is India's first AI hyper scaler, a cloud computing platform providing accelerated cloud-based solutions at maximum optimization and lowest pricing