VALL-E is a neural codec language model using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather. VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as a prompt. We also extend VALL-E and train a multi-lingual conditional codec language model. VALL-E X can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker’s voice, emotion, and acoustic environment.
数据统计
数据评估
本站AI工具库提供的VALL-E都来源于网络,不保证外部链接的准确性和完整性,同时,对于该外部链接的指向,不由AI工具库实际控制,在2025年2月14日 下午9:37收录时,该网页上的内容,都属于合规合法,后期网页的内容如出现违规,可以直接联系网站管理员进行删除,AI工具库不承担任何责任。