
SpeechGPT2
A fully end-to-end human like speech dialogue model
- Perceive and express emotions
- Provide various styles of voice responses, such as rap, drama, robotics, fun, and whispers, etc
- Using ultra-low bit rate speech codec (750bps)
- Multiple Input Multiple Output Language Model (MIMO-LM)
- Generating one second speech requires 25 autoregressive decoding steps
- Over 100000 hours of academic and field voice data pre training
- High quality multi round dialogue voice data
Product Details
SpeechGPT2 is an end-to-end speech dialogue language model developed by the School of Computer Science at Fudan University. It can perceive and express emotions, and provide appropriate speech responses in multiple styles based on context and human instructions. This model uses an ultra-low bit rate speech codec (750bps) to simulate semantic and acoustic information, and is initialized through a multiple input multiple output language model (MIMO-LM). At present, SpeechGPT2 is still a turn based dialogue system, and a full duplex real-time version is being developed with some promising progress. Despite being limited by computing and data resources, SpeechGPT2 still has shortcomings in terms of noise robustness in speech understanding and sound quality stability in speech generation. We plan to open source technology reports, code, and model weights in the future.