Video Highlights Generation

This submission has been added to the schedule

This video is for members only

Video Highlights Generation

Submitted Nov 3, 2023

Abstract

Roposo is a live video platform with over ~200 million end users with ~1000 live videos getting uploaded every day, each lasting 15 minutes to 3 hours. In order to increase engagement and improve user experience, we are trying to create a central video feed which will have assets that can easily consumed. This requires converting our events and creator led videos to shorter formats like trailers and short clips, for which, we process the videos with help of AI to get the most important segments.

Broad Outline

Videos can be very diverse as their content can vary from:

people having arguments, dancing or singing (Big Boss, Glance being smart lock screen partner)
just having conversations like in an interview (Creator Led Shows, Exclusive content for Roposo)
a fashion show where a supermodel just walk a runway (Lakme Fashion Week, Glance being a partner).

As a solution to this, we bifurcated videos based on the density of speech happening in them and created separate solutions for a speech-heavy and a visual-heavy video.

For a speech-heavy video, we are use transcription to select the most important segments of a video while for a visual-heavy video, we break videos into shots and generate visual descriptions of the shots to select the most important segments.

We are leveraging the following for our use-case:

Faster Whisper using CTranslate2 for audio transcription.
BLIP and Git for Image Captioning.
Vilt for Visual Question Answering.
Color Histograms for Shot boundary detection.
gpt3.5-turbo for text highlights and summarisation.
Sentence-BERT embedding and cosine similarity for retrival.
All the above models optimised to run on a single T4 GPU using a custom dataloader for parallel processing.

To enhance the viewer experience, we are post-processing our short videos with AI-generated music, custom transitions between shots, animations, stickers, subtitles and a lot more.

The end-to-end processing runs at 5-10 mins for 30 min long video.