MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis


We introduce an open source high-quality Mandarin TTS dataset MSceneSpeech (Multiple Scene Speech Dataset), which is intended to provide resources for expressive speech synthesis. MSceneSpeech comprises numerous audio recordings and texts performed and recorded according to daily life scenarios. Each scenario includes multiple speakers and a diverse range of prosodic styles, making it suitable for speech synthesis that entails multi-speaker style and prosody modeling. We have established a robust baseline, through the prompting mechanism, that can effectively synthesize speech characterized by both user-specific timbre and scene-specific prosody with arbitrary text input. The open source MSceneSpeech Dataset and audio samples of our baseline are available at

MSceneSpeech Dataset

We introduce MSceneSpeech, a high-quality, monolingual Mandarin multi-domain speech dataset. The texts to be recorded is highly correlated to the scene, and the audio is recorded by professional audio actors or actresses under a clear background, which is designed to capture the nuances of expressive speech across multiple scenarios. It includes four distinct scenes—Chat, News, QA, and Storytelling—with approximately 15 hours of audio. This dataset aims to provide a comprehensive resource for training TTS models to generate speech with varying prosody, reflecting the variations encountered in day-to-day communication.

The dataset has four main categories, with a detailed description as follows:

  • Chat: Casual Conversations Informal dialogues, interactive discussions, and crosstalk.
  • QA: Question-and-answer interactions from online shopping platforms, and queries pertaining to website construction.
  • News: News segments from national television broadcasts in China.
  • Story: Story for children and adults, encompassing diverse storytelling styles and themes.
Please note that while the first two categories represent two-person interactions, they are recorded by one speaker.

Dataset Stats:

The detailed dataset stats are as follows.

Scene Number of speakers Total time (hr) Number of clips
Chat 4 5.32 2162
News 2 2.06 747
QA 3 2.22 907
Story 4 3.35 1286

Here is the audio attributes of different scenes. The scene is highly variant on Speed, Pitch and Energy, which can all serve as a great indicator of a highly variant prosody

Interpolate start reference image.

Here is the a comparison of Mean of pitch, Variance of pitch, and Skew of pitch across three datasets: DidiSpeech (DS), Aishell3 (AS3), and our MSceneSpeech (MS). The statistics rep- resent averaged metrics from individual speakers within each dataset. As can be observed from the figure, MSceneSpeech has a higher pitch variance and skewness compared to the other two datasets, indicating a more diverse pitch distribution.

Interpolate start reference image.

Dataset demo:

MSceneSpeech's utility lies not only in its prosodic richness but also in the uniformity of voice timbre across different prosodic contexts. This duality allows for nuanced voice synthesis, enabling TTS models to generate varied speech outputs with disentangled representations for timbre and prosody. In this part, we will show demo audios of different scenes.

Scene Audio_Speaker1 Audio_Speaker2

Organization of the dataset:

The dataset is provided as predefined train-test split.
The organization of the directory structure is hierarchical, based on the scene, then the speaker. Within the speaker folder, we provides the audio with its corresponding text and pronunciation (pinyin). The following ASCII diagram depicts this structure: │ ├── readme.txt │ └── train/ │ └── chat/ ... | └── test/ │ └── qa/ │ └── 客服问答一_犀牛有角/ │ ├── 0.wav │ └── 0.txt | └── 0_pinyin.txt ...


Dataset is available at here. If you have any question on this dataset, please contact the following email.

Model Overview

Interpolate start reference image.

The overall architecture of our baseline. Duration, pitch, and energy are extracted from the prompt (In training: unmasked part; In inference: reference speech). It serves as conditions for their respective predictors. And losses are calculated only on masked part.

Prosody Transfer In Adaptive TTS

Ref-Spk Ref-Prosody Text Our baseline


(Story Telling)

(Live Commerse)

(News Broadcasting)

(Customer Service)

(Education Teaching)


(Story Telling)

(Live Commerse)

(News Broadcasting)

(Customer Service)

(Education Teaching)

Zero-Shot Style Transfer

Note:We test the zero shot ability of style transfer to totally unseen dataset ESD as ref-prosody.

Ref-Spk Ref-Prosody Text Our baseline


(ESD Dataset Unseen)

(ESD Dataset Unseen)


(ESD Dataset Unseen)

(ESD Dataset Unseen)

Prosody Transfer In Adaptive TTS In English and English Chinese Mixture

Note:Basically,we are testing model's cross-lingual ability. As we only train English Data of VCTK in pretaining stage, finetuning style in Mandrain and fintuing speaker in Mandrain. So the english pronunciation performs no good and style transfer performs no good.

Ref-Spk Ref-Prosody Text Our baseline


(Story Telling:Mandrain)
Who is been repeating all that hard stuff to you?
I thank you for this mercy!
Your path now goes south.
AI for Data, Data for AI等技术的落地,将使我们在海量数据中探索更深层次的价值。

(Customer Service:Mandrain)
Who is been repeating all that hard stuff to you?
I thank you for this mercy!
Your path now goes south.
AI for Data, Data for AI等技术的落地,将使我们在海量数据中探索更深层次的价值。


(Story Telling:Mandrain)
Who is been repeating all that hard stuff to you?
I thank you for this mercy!
Your path now goes south.
AI for Data, Data for AI等技术的落地,将使我们在海量数据中探索更深层次的价值。

(Customer Service:Mandrain)
Who is been repeating all that hard stuff to you?
I thank you for this mercy!
Your path now goes south.
AI for Data, Data for AI等技术的落地,将使我们在海量数据中探索更深层次的价值。

Model Comparison

Ref-Spk A³T Adaspeech 4 Our baseline