Sep 2024
9 Mon
10 Tue
11 Wed
12 Thu
13 Fri 07:00 PM – 08:00 PM IST
14 Sat
15 Sun
Oct 2024
21 Mon
22 Tue
23 Wed
24 Thu
25 Fri 07:00 PM – 08:00 PM IST
26 Sat
27 Sun
Mayank Kumar
This article has been inspired by discussions in the open source channel of HasGeek’s Whatsapp Community. The conversation provides a fascinating glimpse into the current state and future directions of AI voice technologies, including podcast generation and voice assistants. It showcases the collaborative nature of the open source community, where developers and enthusiasts share their experiences, compare different technologies, and discuss potential improvements and applications.
A notable player in the voice conversation space is Play.ai. Their approach involves recording audio in chunks and transferring it to the server, where Voice Activity Detection (VAD) is likely used to process and generate audio responses. The system employs efficient compression techniques, using the gz format to keep file sizes small - just a few minutes of audio can be compressed to less than 50kb.
ElevenLabs offers a conversational AI solution that operates on websockets. It uses PCM16 @ 16KHz audio, which is continuously buffered, chunked, and then base64 encoded before being sent to the server. The server communicates VAD scores, voice responses (base64 encoded, PCM16), and transcriptions over the websocket.
OpenAI’s offering in this space is their GPT real-time voice technology. While specific details weren’t provided in the discussion, it’s mentioned as a point of comparison for other voice conversation technologies.
When comparing the costs of these technologies:
It’s worth noting that OpenAI’s TTS API, which is more comparable to Play.ai and other TTS offerings, is significantly cheaper at $15 per 1 million characters. This translates to approximately $0.012 per minute, assuming 800 characters are spoken per minute.
For those looking to implement their own pipeline, a combination of Whisper for speech-to-text, GPT-4 for language processing, and Google TTS for text-to-speech can bring costs down to around $0.015/min. While this approach sacrifices real-time capability, it allows for longer conversations at a fraction of the cost.
Creating natural-sounding podcasts with multiple AI-generated voices remains a challenge. However, some promising developments are on the horizon:
Several platforms are available for those looking to implement or integrate voice conversation technologies:
As these technologies continue to evolve, we can expect:
The field of AI-driven voice technologies is rapidly advancing, promising exciting developments in podcast generation, voice assistants, and beyond. As costs decrease and quality improves, we can anticipate these technologies becoming increasingly prevalent in our daily lives and various industries.
Hosted by
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}