[GoodChats] The Evolving Landscape of AI Voice Technologies: Insights from the Open Source Community

Sep 2024

9 Mon

10 Tue

11 Wed

12 Thu

13 Fri 07:00 PM – 08:00 PM IST

14 Sat

15 Sun

Oct 2024

21 Mon

22 Tue

23 Wed

24 Thu

25 Fri 07:00 PM – 08:00 PM IST

26 Sat

27 Sun

All submissions

Previous Next

[GoodChats] The Evolving Landscape of AI Voice Technologies: Insights from the Open Source Community

Submitted Nov 27, 2024

This article has been inspired by discussions in the open source channel of HasGeek’s Whatsapp Community. The conversation provides a fascinating glimpse into the current state and future directions of AI voice technologies, including podcast generation and voice assistants. It showcases the collaborative nature of the open source community, where developers and enthusiasts share their experiences, compare different technologies, and discuss potential improvements and applications.

Voice Conversation Technologies

Play.ai

A notable player in the voice conversation space is Play.ai. Their approach involves recording audio in chunks and transferring it to the server, where Voice Activity Detection (VAD) is likely used to process and generate audio responses. The system employs efficient compression techniques, using the gz format to keep file sizes small - just a few minutes of audio can be compressed to less than 50kb.

ElevenLabs Conversational AI

ElevenLabs offers a conversational AI solution that operates on websockets. It uses PCM16 @ 16KHz audio, which is continuously buffered, chunked, and then base64 encoded before being sent to the server. The server communicates VAD scores, voice responses (base64 encoded, PCM16), and transcriptions over the websocket.

OpenAI’s GPT Real-time Voice

OpenAI’s offering in this space is their GPT real-time voice technology. While specific details weren’t provided in the discussion, it’s mentioned as a point of comparison for other voice conversation technologies.

Cost Comparison

When comparing the costs of these technologies:

ElevenLabs: $0.1/min
Play.ai: $0.18/min
OpenAI’s GPT Real-time Voice: $0.3/min

It’s worth noting that OpenAI’s TTS API, which is more comparable to Play.ai and other TTS offerings, is significantly cheaper at $15 per 1 million characters. This translates to approximately $0.012 per minute, assuming 800 characters are spoken per minute.

Alternative Approaches

For those looking to implement their own pipeline, a combination of Whisper for speech-to-text, GPT-4 for language processing, and Google TTS for text-to-speech can bring costs down to around $0.015/min. While this approach sacrifices real-time capability, it allows for longer conversations at a fraction of the cost.

Podcast Generation

Creating natural-sounding podcasts with multiple AI-generated voices remains a challenge. However, some promising developments are on the horizon:

Metaskepsis is an open-source project that shows potential in this area.
Google’s model used in NotebookLM has been praised for its natural-sounding voices, although it’s currently limited to two default voices.
Some users have experimented with prompting ChatGPT to add filler words like “um” and “ah” to scripts, then using specific voices from ElevenLabs to achieve a more natural sound.

Platforms and Integration

Several platforms are available for those looking to implement or integrate voice conversation technologies:

Khoj is exploring the integration of various voice technologies.
Livekit and Vocode can be used to abstract the real-time components using WebRTC and/or Twilio.
Replicate is being used by some to serve large language models like Llama 3 70B and 405B.

Future Directions

As these technologies continue to evolve, we can expect:

Improved natural-sounding AI voices for podcast generation
More cost-effective real-time voice conversation solutions
Better integration of voice technologies in various applications
Advancements in processing parallel requests to further reduce costs

The field of AI-driven voice technologies is rapidly advancing, promising exciting developments in podcast generation, voice assistants, and beyond. As costs decrease and quality improves, we can anticipate these technologies becoming increasingly prevalent in our daily lives and various industries.