We introduce MSceneSpeech, a high-quality, monolingual Mandarin multi-domain speech dataset. The texts to be recorded is highly correlated to the scene, and the audio is recorded by professional audio actors or actresses under a clear background, which is designed to capture the nuances of expressive speech across multiple scenarios. It includes four distinct scenes—Chat, News, QA, and Storytelling—with approximately 15 hours of audio. This dataset aims to provide a comprehensive resource for training TTS models to generate speech with varying prosody, reflecting the variations encountered in day-to-day communication.
The dataset has four main categories, with a detailed description as follows:
- Chat: Casual Conversations Informal dialogues, interactive discussions, and crosstalk.
- QA: Question-and-answer interactions from online shopping platforms, and queries pertaining to website construction.
- News: News segments from national television broadcasts in China.
- Story: Story for children and adults, encompassing diverse storytelling styles and themes.
Dataset Stats:
The detailed dataset stats are as follows.
Scene | Number of speakers | Total time (hr) | Number of clips |
---|---|---|---|
Chat | 4 | 5.32 | 2162 |
News | 2 | 2.06 | 747 |
QA | 3 | 2.22 | 907 |
Story | 4 | 3.35 | 1286 |
Here is the audio attributes of different scenes. The scene is highly variant on Speed, Pitch and Energy, which can all serve as a great indicator of a highly variant prosody
Here is the a comparison of Mean of pitch, Variance of pitch, and Skew of pitch across three datasets: DidiSpeech (DS), Aishell3 (AS3), and our MSceneSpeech (MS). The statistics rep- resent averaged metrics from individual speakers within each dataset. As can be observed from the figure, MSceneSpeech has a higher pitch variance and skewness compared to the other two datasets, indicating a more diverse pitch distribution.
Dataset demo:
MSceneSpeech's utility lies not only in its prosodic richness but also in the uniformity of voice timbre across different prosodic contexts. This duality allows for nuanced voice synthesis, enabling TTS models to generate varied speech outputs with disentangled representations for timbre and prosody. In this part, we will show demo audios of different scenes.
Scene | Audio_Speaker1 | Audio_Speaker2 |
---|---|---|
Chat | ||
News | ||
QA | ||
Story |
Organization of the dataset:
The dataset is provided as predefined train-test split.
The organization of the directory structure is hierarchical, based on the scene, then the speaker. Within the speaker folder, we provides the audio with its corresponding text and pronunciation (pinyin). The following ASCII diagram depicts this structure:
Access:
Dataset is available at here. If you have any question on this dataset, please contact the following email.