Speech-to-speech large language models (SLLMs) are increasingly attracting widespread attention. Augmented from text-based large language models (LLMs), SLLMs often exhibit degradation in knowledge and reasoning capabilities. We hypothesize that this is due to the fact that current training paradigms for SLLMs didn't bridge the Acoustic-Semantic gap in the feature representation space. To address this, we propose EchoX, which adheres to semantic representations and dynamically generates speech training target. This approach integrates both acoustic and semantic learning, thereby EchoX can preserve the strong reasoning ability as a speech LLM. Experimental methods demonstrate that EchoX, with only ten thousand hours of training data, achieves advanced performance on multiple knowledge question-answering benchmarks.
Current speech-to-speech language models face a fundamental challenge: while traditional LLMs excel at semantic alignment (aligning "Hello" and "Hi"), speech LLMs suffer from poor semantic understanding due to their focus on acoustic modeling. EchoX addresses this acoustic-semantic gap through a novel echo training approach that preserves both reasoning capabilities and speech generation quality.
Key Contributions:
Daily Conversation
Reasoning
Story Telling
Question Answering
If you use EchoX in your research or projects, please cite our paper:
@misc{zhang2025echoxmitigatingacousticsemanticgap,
title={EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs},
author={Yuhao Zhang and Yuhao Du and Zhanchen Dai and Xiangnan Ma and Kaiqi Kou and Benyou Wang and Haizhou Li},
year={2025},
eprint={2509.09174},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.09174},
}