MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis

Abstract

We introduce an open source high-quality Mandarin TTS dataset MSceneSpeech (Multiple Scene Speech Dataset), which is intended to provide resources for expressive speech synthesis. MSceneSpeech comprises numerous audio recordings and texts performed and recorded according to daily life scenarios. Each scenario includes multiple speakers and a diverse range of prosodic styles, making it suitable for speech synthesis that entails multi-speaker style and prosody modeling. We have established a robust baseline, through the prompting mechanism, that can effectively synthesize speech characterized by both user-specific timbre and scene-specific prosody with arbitrary text input. The open source MSceneSpeech Dataset and audio samples of our baseline are available at https://speechai-demo.github.io/MSceneSpeech/.

MSceneSpeech Dataset

We introduce MSceneSpeech, a high-quality, monolingual Mandarin multi-domain speech dataset. The texts to be recorded is highly correlated to the scene, and the audio is recorded by professional audio actors or actresses under a clear background, which is designed to capture the nuances of expressive speech across multiple scenarios. It includes four distinct scenes—Chat, News, QA, and Storytelling—with approximately 15 hours of audio. This dataset aims to provide a comprehensive resource for training TTS models to generate speech with varying prosody, reflecting the variations encountered in day-to-day communication.


The dataset has four main categories, with a detailed description as follows:

  • Chat: Casual Conversations Informal dialogues, interactive discussions, and crosstalk.
  • QA: Question-and-answer interactions from online shopping platforms, and queries pertaining to website construction.
  • News: News segments from national television broadcasts in China.
  • Story: Story for children and adults, encompassing diverse storytelling styles and themes.
Please note that while the first two categories represent two-person interactions, they are recorded by one speaker.

Dataset Stats:

The detailed dataset stats are as follows.

Scene Number of speakers Total time (hr) Number of clips
Chat 4 5.32 2162
News 2 2.06 747
QA 3 2.22 907
Story 4 3.35 1286

Here is the audio attributes of different scenes. The scene is highly variant on Speed, Pitch and Energy, which can all serve as a great indicator of a highly variant prosody

Interpolate start reference image.

Here is the a comparison of Mean of pitch, Variance of pitch, and Skew of pitch across three datasets: DidiSpeech (DS), Aishell3 (AS3), and our MSceneSpeech (MS). The statistics rep- resent averaged metrics from individual speakers within each dataset. As can be observed from the figure, MSceneSpeech has a higher pitch variance and skewness compared to the other two datasets, indicating a more diverse pitch distribution.

Interpolate start reference image.

Dataset demo:

MSceneSpeech's utility lies not only in its prosodic richness but also in the uniformity of voice timbre across different prosodic contexts. This duality allows for nuanced voice synthesis, enabling TTS models to generate varied speech outputs with disentangled representations for timbre and prosody. In this part, we will show demo audios of different scenes.

Scene Audio_Speaker1 Audio_Speaker2
Chat
News
QA
Story

Organization of the dataset:

The dataset is provided as predefined train-test split.
The organization of the directory structure is hierarchical, based on the scene, then the speaker. Within the speaker folder, we provides the audio with its corresponding text and pronunciation (pinyin). The following ASCII diagram depicts this structure: │ ├── readme.txt │ └── train/ │ └── chat/ ... | └── test/ │ └── qa/ │ └── 客服问答一_犀牛有角/ │ ├── 0.wav │ └── 0.txt | └── 0_pinyin.txt ...

Access:

Dataset is available at here. If you have any question on this dataset, please contact the following email.

Model Overview

Interpolate start reference image.

The overall architecture of our baseline. Duration, pitch, and energy are extracted from the prompt (In training: unmasked part; In inference: reference speech). It serves as conditions for their respective predictors. And losses are calculated only on masked part.

Prosody Transfer In Adaptive TTS

Ref-Spk Ref-Prosody Text Our baseline

(Ref-Spk)

(Story Telling)
突然,蝴蝶停在了一朵花上,波波小心翼翼地靠近,屏住呼吸,他伸出手,轻轻地抓住了蝴蝶。
朋友们鼓励波波,只要你相信自己,不放弃,你一定能抓得到蝴蝶。

(Live Commerse)
大家好,欢迎来到今天的直播间,我是你们的主播珊珊,很高兴能够在这里与大家见面。
如果你也喜欢这个小零食,记得关注我们的直播间,我们每周都会有很多美味又健康的零食分享给大家。

(News Broadcasting)
知识产权含金量明显提升,是近年来我国知识产权高质量发展的特征之一。
作为全球首个发明专利有效量超三百万件的国家,我国发明专利有效量已位居全球第一。

(Customer Service)
您好,感谢您选择我们的产品,请问有什么我可以帮助您的吗?
我遇到了一些技术问题,无法完成我需要的任务。

(Education Teaching)
在语文学习中,阅读理解是必不可少的一部分,它是对文本深度理解与感悟的关键。
阅读理解不仅需要理解和分析文本的能力,还要求具备批判性思维和解决问题的能力

(Ref-Spk)

(Story Telling)
突然,蝴蝶停在了一朵花上,波波小心翼翼地靠近,屏住呼吸,他伸出手,轻轻地抓住了蝴蝶。
朋友们鼓励波波,只要你相信自己,不放弃,你一定能抓得到蝴蝶。

(Live Commerse)
大家好,欢迎来到今天的直播间,我是你们的主播珊珊,很高兴能够在这里与大家见面。
如果你也喜欢这个小零食,记得关注我们的直播间,我们每周都会有很多美味又健康的零食分享给大家。

(News Broadcasting)
知识产权含金量明显提升,是近年来我国知识产权高质量发展的特征之一。
作为全球首个发明专利有效量超三百万件的国家,我国发明专利有效量已位居全球第一。

(Customer Service)
您好,感谢您选择我们的产品,请问有什么我可以帮助您的吗?
我遇到了一些技术问题,无法完成我需要的任务。

(Education Teaching)
在语文学习中,阅读理解是必不可少的一部分,它是对文本深度理解与感悟的关键。
阅读理解不仅需要理解和分析文本的能力,还要求具备批判性思维和解决问题的能力

Zero-Shot Style Transfer

Note:We test the zero shot ability of style transfer to totally unseen dataset ESD as ref-prosody.


Ref-Spk Ref-Prosody Text Our baseline

(Ref-Spk:Mandrain)

(ESD Dataset Unseen)
当然是了,我现在快饿死了。
得了吧,别这么胆小啦。

(ESD Dataset Unseen)
当然是了,我现在快饿死了。
得了吧,别这么胆小啦。

(Ref-Spk:Mandrain)

(ESD Dataset Unseen)
听说你要去香港看你叔叔。
这可真不像是场英超比赛。

(ESD Dataset Unseen)
听说你要去香港看你叔叔。
这可真不像是场英超比赛。

Prosody Transfer In Adaptive TTS In English and English Chinese Mixture

Note:Basically,we are testing model's cross-lingual ability. As we only train English Data of VCTK in pretaining stage, finetuning style in Mandrain and fintuing speaker in Mandrain. So the english pronunciation performs no good and style transfer performs no good.


Ref-Spk Ref-Prosody Text Our baseline

(Ref-Spk:Mandrain)

(Story Telling:Mandrain)
Who is been repeating all that hard stuff to you?
I thank you for this mercy!
Your path now goes south.
AI for Data, Data for AI等技术的落地,将使我们在海量数据中探索更深层次的价值。

(Customer Service:Mandrain)
Who is been repeating all that hard stuff to you?
I thank you for this mercy!
Your path now goes south.
AI for Data, Data for AI等技术的落地,将使我们在海量数据中探索更深层次的价值。

(Ref-Spk:Mandrain)

(Story Telling:Mandrain)
Who is been repeating all that hard stuff to you?
I thank you for this mercy!
Your path now goes south.
AI for Data, Data for AI等技术的落地,将使我们在海量数据中探索更深层次的价值。

(Customer Service:Mandrain)
Who is been repeating all that hard stuff to you?
I thank you for this mercy!
Your path now goes south.
AI for Data, Data for AI等技术的落地,将使我们在海量数据中探索更深层次的价值。

Model Comparison

Ref-Spk A³T Adaspeech 4 Our baseline