EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs

Abstract

Speech-to-speech large language models (SLLMs) are increasingly attracting widespread attention. Augmented from text-based large language models (LLMs), SLLMs often exhibit degradation in knowledge and reasoning capabilities. We hypothesize that this is due to the fact that current training paradigms for SLLMs didn't bridge the Acoustic-Semantic gap in the feature representation space. To address this, we propose EchoX, which adheres to semantic representations and dynamically generates speech training target. This approach integrates both acoustic and semantic learning, thereby EchoX can preserve the strong reasoning ability as a speech LLM. Experimental methods demonstrate that EchoX, with only ten thousand hours of training data, achieves advanced performance on multiple knowledge question-answering benchmarks.

Overview

Current speech-to-speech language models face a fundamental challenge: while traditional LLMs excel at semantic alignment (aligning "Hello" and "Hi"), speech LLMs suffer from poor semantic understanding due to their focus on acoustic modeling. EchoX addresses this acoustic-semantic gap through a novel echo training approach that preserves both reasoning capabilities and speech generation quality.

Key Contributions:

Problem Identification: Reveals the degraded reasoning and knowledge abilities in current speech LLMs due to conflicting acoustic learning demands.
Echo Training Solution: Integrates both acoustic and semantic learning by dynamically generating speech tokens from semantic representations, mitigating the degradation problem while preserving the reasoning abilities of LLMs.
Efficient Performance: Achieves state-of-the-art performance on knowledge-based question-answering tasks while requiring only 6k hours of training data.

Audio Demos

Daily Conversation

Speech Input

I'm just planning to go on a trip to London.

EchoX Response

That sounds like a great idea! London is a fantastic city with so much to see and do. What are you most looking forward to?

Reasoning

Speech Input

Sarah has a bag with 5 red marbles and 3 blue marbles. She gives 2 red marbles to her friend Tom. Then her friend Lisa gives Sarah 1 blue marble. How many marbles does Sarah have now, and what colors are they?

EchoX Response

To figure out how many marbles Sarah has now, we need to add the marbles she already had to the ones she got from Tom and Lisa. Sarah started with five red marbles and three blue marbles. She gave two red marbles to Tom, so she has five minus two, which is three red marbles left. Then Lisa gave her one blue marble, so she now has three plus one, which is four blue marbles. So, Sarah has a total of three red marbles and four blue marbles, making a total of seven marbles. The colors of the marbles are red and blue.

Story Telling

Speech Input

Tell me the Cinderella story.

EchoX Response

The Cinderella story is a classic fairy tale about a young woman named Cinderella who faces a series of challenges and ultimately finds true love with a prince. Here's how the story goes: Once upon a time, there was a beautiful and kind girl named Cinderella. She lived with her wicked stepmother and two stepsisters, who treated her poorly and made her do all the household chores. Cinderella's father had passed away, and her stepmother and stepsisters had taken over the family's estate. Despite her hardships, Cinderella remained a sweet and gentle soul, always looking out for others and never losing hope. One day, the king invited all the young women in the kingdom to a grand ball, where the prince would choose his bride. Cinderella's stepsisters were excited to attend, but Cinderella was not allowed to go. Her stepmother and stepsisters were cruel to her, making her stay home and do all the cleaning and cooking. However, Cinderella's fairy godmother appeared and helped her get ready for the ball. The fairy godmother turned a pumpkin into a beautiful carriage, mice into horses, and a rat into a coachman. She also gave Cinderella a beautiful gown and off she went to the ball. At the ball, Cinderella danced with the prince and they fell deeply in love. But when the clock struck midnight, Cinderella had to leave in a hurry, leaving behind one of her glass slippers. The prince searched everywhere for the girl whose foot fit the slipper, but none of the other women could fit it. Finally, he came to Cinderella's house and tried the slipper on her foot. It fit perfectly, and the prince was overjoyed to have found his true love. They got married and lived happily ever after. The story of Cinderella has been retold and adapted in many ways over the years, but the core message remains the same: kindness, hard work, and true love can overcome even the greatest challenges.

Question Answering

Speech Input

Could you give me a quick rundown of that attention is all you need paper?

EchoX Response

The Attention is All You Need paper is a research paper published in two thousand seventeen by Vaswani and others. It introduced the Transformer model, which is a type of neural network architecture that has become very popular in natural language processing tasks. The paper proposed a new way to process sequential data, like text, using self-attention mechanisms instead of traditional recurrent neural networks. The Transformer model uses self-attention to weigh the importance of different parts of the input sequence, allowing it to focus on the most relevant information. This approach has several advantages, including: One, parallelization: The Transformer model can process the entire input sequence in parallel, which makes it much faster than traditional recurrent neural networks. Two, flexibility: The self-attention mechanism can be applied to any type of sequential data, not just text. Three, interpretability: The attention weights can be used to understand which parts of the input sequence are most important for the model's predictions. The paper introduced several key components of the Transformer model, including: One, self-attention: A mechanism that allows the model to weigh the importance of different parts of the input sequence. Two, encoder-decoder architecture: A two-stage architecture where the encoder processes the input sequence and the decoder generates the output sequence. Three, position-wise fully connected feed-forward network: A fully connected feed-forward network that is applied to each position in the input sequence. The Attention is All You Need paper has had a significant impact on the field of natural language processing, and the Transformer model has become a widely used architecture for many tasks, including machine translation, text classification, and question answering.

You're welcome to experience EchoX yourself in our Hugging Face Space.

EchoX addresses the challenges of current speech-to-speech models through a comprehensive approach that integrates both training and inference innovations. Our approach consists of two main components:

Training Pipeline

We propose a novel three-stage training pipeline:

Stage I: Speech-to-Text: Converts speech to text, enabling the model to capture semantic information from spoken inputs.
Stage II: Text-to-Codec: Transforms text into speech tokens, bridging the gap between text and speech representations.
Stage III: Echo Training: Combines the outputs of previous stages, training the model to generate speech from semantic understanding while preserving core language intelligence.

Streaming Inference

Given that speech sequences are significantly longer than text, streaming generation becomes essential:

Semantic Completeness: Maintains semantic completeness of each segment to avoid disjointed speech output.
Trigger Mechanism: Determine optimal speech generation timing based on cosine similarity of semantic representation.

This integrated approach ensures that EchoX preserves the reasoning abilities of LLMs while excelling at speech-based tasks with real-time capabilities.

EchoX demonstrates exceptional performance on knowledge-based question answering, achieving strong results with minimal training data and setting a new benchmark for efficiency in speech-to-speech language models.

Key highlights

Efficiency: Matches or surpasses larger models with only 6k training hours.
User Experience: Human evaluations confirm superior helpfulness, naturalness, and overall satisfaction.

These results show EchoX bridges the acoustic–semantic gap while remaining efficient and user-friendly.

BibTeX

If you use EchoX in your research or projects, please cite our paper:


@misc{zhang2025echoxmitigatingacousticsemanticgap,
      title={EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs}, 
      author={Yuhao Zhang and Yuhao Du and Zhanchen Dai and Xiangnan Ma and Kaiqi Kou and Benyou Wang and Haizhou Li},
      year={2025},
      eprint={2509.09174},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.09174}, 
}

EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs

Abstract

Overview

Audio Demos

Approach

Training Pipeline

Streaming Inference

Performance

Key highlights

BibTeX