Speaker-independent neural formant synthesis
Pablo Pérez Zarazaga, Zofia Malisz, Gustav Eje Henter, Lauri Juvela
Citation information
@inproceedings{perez2023speaker,
author={Pablo {Pérez Zarazaga} and Zofia Malisz and Gustav Eje Henter and Lauri Juvela},
title=,
year=2023,
booktitle={Proc. INTERSPEECH 2023},
pages={5556--5560},
doi={10.21437/Interspeech.2023-1622}
}
Summary
The goal of this work is to develop a speaker-independent speech synthesis system driven by a small set of phonetically meaningful speech parameters.
The system provides a controllable environment where it is possible to manipulate the different individual speech parameters to generate a realistic speech signal.
Visual overview
We propose to use a WaveNet-like model using dilated convolutions to transform a set of phonetically meaningful parameters (log F0, F1-4, spectral tilt, spectral centroid and signal energy) combined with a voicing flag into a mel-spectrogram. The mel-spectrogram can then be used by a pre-trained vocoder to generate the output speech signal.
Code
Neural formant network code and pre-trained models will be made available shortly in our GitHub repositories.
Synthesised speech
The following audio samples have been synthesised with the different systems described in our paper.
The first set of samples have been generated for the copy synthesis MUSHRA-like listening test described in the paper. We can observe the effect of the proposed method combining the neural formant (NF) network with different vocoders (Hifi-GAN and WaveNet). The proposed method does not degrade the quality of the speech signal compared to each corresponding vocoder on its own.
System | Reference | HiFi-GAN | NF + HiFi-GAN | WaveNet | NF + WaveNet |
---|---|---|---|---|---|
Sample 1 | |||||
Sample 2 | |||||
Sample 3 |
In the manipulation task, we apply a constant scaling to a specific speech parameter over a whole synthesised utterance.
In the following audio samples, we have manipulated the fundamental frequency F0, and the formants F1 to F4 using the scaling values [0.7, 0.8, 0.9, 1.1, 1.2, 1.3].
Scale F0 | 0.7 | 0.8 | 0.9 | 1.1 | 1.2 | 1.3 |
---|---|---|---|---|---|---|
NF HiFi-GAN | ||||||
PRAAT |
Scale F1 | 0.7 | 0.8 | 0.9 | 1.1 | 1.2 | 1.3 |
---|---|---|---|---|---|---|
NF HiFi-GAN | ||||||
PRAAT |
Scale F2 | 0.7 | 0.8 | 0.9 | 1.1 | 1.2 | 1.3 |
---|---|---|---|---|---|---|
NF HiFi-GAN | ||||||
PRAAT |
Scale F3 | 0.7 | 0.8 | 0.9 | 1.1 | 1.2 | 1.3 |
---|---|---|---|---|---|---|
NF HiFi-GAN | ||||||
PRAAT |
Scale F4 | 0.7 | 0.8 | 0.9 | 1.1 | 1.2 | 1.3 |
---|---|---|---|---|---|---|
NF HiFi-GAN | ||||||
PRAAT |