Speaker-independent neural formant synthesis

Pablo Pérez Zarazaga, Zofia Malisz, Gustav Eje Henter, Lauri Juvela

Citation information

@inproceedings{perez2023speaker,
  author={Pablo {Pérez Zarazaga} and Zofia Malisz and Gustav Eje Henter and Lauri Juvela},
  title=,
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
  pages={5556--5560},
  doi={10.21437/Interspeech.2023-1622}
}

Summary

The goal of this work is to develop a speaker-independent speech synthesis system driven by a small set of phonetically meaningful speech parameters.

The system provides a controllable environment where it is possible to manipulate the different individual speech parameters to generate a realistic speech signal.

Visual overview

We propose to use a WaveNet-like model using dilated convolutions to transform a set of phonetically meaningful parameters (log F0, F1-4, spectral tilt, spectral centroid and signal energy) combined with a voicing flag into a mel-spectrogram. The mel-spectrogram can then be used by a pre-trained vocoder to generate the output speech signal.

Neural formant pipeline

Code

Neural formant network code and pre-trained models will be made available shortly in our GitHub repositories.

Synthesised speech

The following audio samples have been synthesised with the different systems described in our paper.

The first set of samples have been generated for the copy synthesis MUSHRA-like listening test described in the paper. We can observe the effect of the proposed method combining the neural formant (NF) network with different vocoders (Hifi-GAN and WaveNet). The proposed method does not degrade the quality of the speech signal compared to each corresponding vocoder on its own.

System	Reference	HiFi-GAN	NF + HiFi-GAN	WaveNet	NF + WaveNet
Sample 1
Sample 2
Sample 3

In the manipulation task, we apply a constant scaling to a specific speech parameter over a whole synthesised utterance.

In the following audio samples, we have manipulated the fundamental frequency F0, and the formants F1 to F4 using the scaling values [0.7, 0.8, 0.9, 1.1, 1.2, 1.3].