Skip to the content.


This paper introduces PFlow-VC, a conditional flow matching voice conversion model that leverages fine-grained discrete pitch tokens and target speaker prompt information for expressive voice conversion (VC). Previous VC works primarily focus on speaker conversion, with further exploration needed in enhancing expressiveness (such as prosody and emotion) for timbre conversion. Unlike previous methods, we adopt a simple and efficient approach to enhance the style expressiveness of voice conversion models. Specifically, we pretrain a self-supervised pitch VQVAE model to discretize speaker-irrelevant pitch information and leverage a masked pitch-conditioned flow matching model for Mel-spectrogram synthesis, which provides in-context pitch modeling capabilities for the speaker conversion model, effectively improving the voice style transfer capacity. Additionally, we improve timbre similarity by combining global timbre embeddings with time-varying timbre tokens. Experiments on unseen LibriTTS test-clean and emotional speech dataset ESD show the superiority of the PFlow-VC model in both timbre conversion and style transfer.

Model Architecture

Figure.1 The overall architecture of PFlow-VC.


Zero-Shot VC Results

Transcript Source Speech Target Speech Your-TTS Diff-HierVC SEF-VC PFlow-VC
lady larkspur starts suddenly and turns towards
you may be quite sure it's there said captain battleax and that i can so use it as to half obliterate your town within two minutes of my return on board
he shall not leave you day or night whether you are working or playing or sleeping
she wanted a glance of the new books and periodicals and talk of great philanthropies and reforms
in the present day we are well aware that an ancient philosopher is to be interpreted from himself and by the contemporary history of thought

Emotion Voice Style Transfer Evaluation

Transcript Source Speech Target Speech Your-TTS Diff-HierVC SEF-VC PFlow-VC
no one would disturb their little house even if anyone came so far into the thick forest while they were gone
in the light of the moon i saw a knife red with blood and my hand too was also discoloured
but hawkeye who was too much occupied with his own thoughts to note the movement continued speaking more to himself than to his companion
the incident was not violent nor even dramatic no posse was summoned no further effort made and reeder fearing personal violence soon fled in disguise
plato had not the command of his materials which would have enabled him to produce a perfect work of art