New AI Models Enhance Voice Interactions with Large Language Models

Researchers from Alibaba unveiled FunAudioLLM, a groundbreaking framework designed to facilitate natural voice interactions between humans and large language models (LLMs). The system comprises two key components: SenseVoice for voice understanding and CosyVoice for voice generation.

SenseVoice, available in Small and Large variants, excels in multilingual speech recognition, emotion recognition, and audio event detection. SenseVoice-Small offers low-latency ASR for five languages, while SenseVoice-Large supports high-precision ASR for over 50 languages.

CosyVoice, on the other hand, specialises in multilingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. It supports five languages: Chinese, English, Japanese, Cantonese, and Korean.

The integration of these models with LLMs enables various applications, including speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration.

Experimental results show that SenseVoice outperforms existing models like Whisper in many benchmarks. For instance, SenseVoice-Small is more than 5 times faster than Whisper-small and more than 15 times faster than Whisper-large for speech recognition tasks.

CosyVoice demonstrates high-quality speech synthesis, achieving comparable or better performance than original utterances in terms of content consistency and speaker similarity.

The researchers have open-sourced the models related to SenseVoice and CosyVoice on Modelscope and Huggingface, along with training, inference, and fine-tuning codes on GitHub.

While the system shows promising results, the researchers acknowledge some limitations. These include lower performance for under-resourced languages, lack of streaming transcription capabilities, and the need for improvement in expressive emotional changes while maintaining original voice timbre.

Alibaba previously created an image generator called Tongyi, which challenged Midjourney and Dall-E. This new development, FunAudioLLM, represents a significant step forward in expanding its creative models.

