SpeechGPT2

A fully end-to-end human like speech dialogue model

Perceive and express emotions
Provide various styles of voice responses, such as rap, drama, robotics, fun, and whispers, etc
Using ultra-low bit rate speech codec (750bps)
Multiple Input Multiple Output Language Model (MIMO-LM)
Generating one second speech requires 25 autoregressive decoding steps
Over 100000 hours of academic and field voice data pre training
High quality multi round dialogue voice data

Product Details

SpeechGPT2 is an end-to-end speech dialogue language model developed by the School of Computer Science at Fudan University. It can perceive and express emotions, and provide appropriate speech responses in multiple styles based on context and human instructions. This model uses an ultra-low bit rate speech codec (750bps) to simulate semantic and acoustic information, and is initialized through a multiple input multiple output language model (MIMO-LM). At present, SpeechGPT2 is still a turn based dialogue system, and a full duplex real-time version is being developed with some promising progress. Despite being limited by computing and data resources, SpeechGPT2 still has shortcomings in terms of noise robustness in speech understanding and sound quality stability in speech generation. We plan to open source technology reports, code, and model weights in the future.

SpeechGPT2

Product Details

Related Projects

Understood zKnown

MBox AI Meet

Klee

Kerqu.Ai