DeEAR

Decoding the Ear (DeEAR): A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment

Zhiyu Lin^{1, 2}, Jingwen Yang^{1, 2}, Jiale Zhao², Meng Liu², Sunzhu Li², Zhengjun Yue^{1, 3}, Benyou Wang^{1, *}

¹The Chinese University of Hong Kong, Shenzhen, China | ²Li Auto Inc., China | ³Shenzhen Loop Area Institution, China

Audio Samples for DeEAR

This page provides audio demonstrations to help readers intuitively understand the DeEAR framework proposed in our paper.

The page contains two parts:

DeEAR Score Examples: Audio samples across High, Medium, and Low expressiveness levels.
S2S Model Fine-Tuning: Comparison before and after fine-tuning (Section 4.3).

Part 1: DeEAR Score Examples

Level

Audio Sample

DeEAR Scores (0-100)

Low

Overall (expressive): 3.43Emo: 53.72Pros: 49.00Spon: 38.52

Low

Overall (expressive): 4.21Emo: 38.41Pros: 32.83Spon: 34.66

Low

Overall (expressive): 24.52Emo: 33.35Pros: 23.47Spon: 70.09

Low

Overall (expressive): 29.82Emo: 29.51Pros: 26.05Spon: 73.69

Medium

Overall (expressive): 36.46Emo: 30.23Pros: 22.17Spon: 82.84

Medium

Overall (expressive): 52.15Emo: 54.51Pros: 50.59Spon: 93.88

Medium

Overall (expressive): 55.22Emo: 62.18Pros: 60.02Spon: 92.30

High

Overall (expressive): 74.05Emo: 77.91Pros: 71.00Spon: 100.00

High

Overall (expressive): 74.23Emo: 74.62Pros: 75.50Spon: 99.62

High

Overall (expressive): 80.06Emo: 92.63Pros: 66.50Spon: 95.32

High

Overall (expressive): 96.43Emo: 86.93Pros: 71.50Spon: 94.59

High

Overall (expressive): 98.46Emo: 100.00Pros: 72.50Spon: 96.98

High

Overall (expressive): 99.08Emo: 96.79Pros: 70.50Spon: 95.54

Part 2: S2S Model Fine-Tuning

Input text

Output (S2S-Base)

Output (S2S-FT)

嗯嗯，听起来不错，我会试试看的。哎，但我还是觉得时间不够用。

你当然是我师傅啦。一日为师，终身为师。师傅。你这些年过得好吗？哎呀，你看我问的。看海天悦这么大的规模，就知道师傅这几年一定过得很好啊。

多宝啊，这不是这两天西安鸿乐园那边事儿多吗？他帮忙去了。

那倒是您的厨艺可是大厨级别的。那我吃了啊。

我们什么时候去呢？我这个周末我没事。

哎呦，我是真戒了，洛林坚持让我戒了。