The Fifth Elephant 2023 Winter

The Fifth Elephant 2023 Winter

On the engineering and business implications of AI & ML

Tickets

Loading…

This video is for members only

Shyam Choudhary

@schoudhary

Video Highlights Generation

Submitted Nov 3, 2023

Abstract

Roposo is a live video platform with over ~200 million end users with ~1000 live videos getting uploaded every day, each lasting 15 minutes to 3 hours. In order to increase engagement and improve user experience, we are trying to create a central video feed which will have assets that can easily consumed. This requires converting our events and creator led videos to shorter formats like trailers and short clips, for which, we process the videos with help of AI to get the most important segments.

Broad Outline

Videos can be very diverse as their content can vary from:

  • people having arguments, dancing or singing (Big Boss, Glance being smart lock screen partner)
  • just having conversations like in an interview (Creator Led Shows, Exclusive content for Roposo)
  • a fashion show where a supermodel just walk a runway (Lakme Fashion Week, Glance being a partner).

As a solution to this, we bifurcated videos based on the density of speech happening in them and created separate solutions for a speech-heavy and a visual-heavy video.

For a speech-heavy video, we are use transcription to select the most important segments of a video while for a visual-heavy video, we break videos into shots and generate visual descriptions of the shots to select the most important segments.

We are leveraging the following for our use-case:

  • Faster Whisper using CTranslate2 for audio transcription.
  • BLIP and Git for Image Captioning.
  • Vilt for Visual Question Answering.
  • Color Histograms for Shot boundary detection.
  • gpt3.5-turbo for text highlights and summarisation.
  • Sentence-BERT embedding and cosine similarity for retrival.
  • All the above models optimised to run on a single T4 GPU using a custom dataloader for parallel processing.

To enhance the viewer experience, we are post-processing our short videos with AI-generated music, custom transitions between shots, animations, stickers, subtitles and a lot more.

The end-to-end processing runs at 5-10 mins for 30 min long video.

Impact

  • Increased content liquidity on our platform by 300%.
  • Increased average play duration (APD) on short videos by 44%.
  • Increased viewership for original content by 23%.

Future Work

  • Introduction of multi-modality for describing segments.
  • Generalization across on more diverse videos.

Target Audience

Data Scientists and ML Engineers

Speakers

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hybrid access (members only)

Hosted by

Jump starting better data engineering and AI futures

Supported by

Sponsor

Providing all founders, at any stage, with free resources to build a successful startup.