HiFi-Glot: Neural formant synthesis with differentiable resonant filters

Lauri Juvela, Pablo Pérez Zarazaga, Gustav Eje Henter, Zofia Malisz

Summary

The goal of this work is to develop a speaker-independent speech synthesis system driven by a small set of phonetically meaningful speech parameters.

The system is built with a similar structure to the source-filter model, allowing us to independently inspect and manipulate the spectral envelope and glottal excitation.

The system provides a controllable environment where it is possible to manipulate the different individual speech parameters to generate a realistic speech signal.

A pre-print from this article can be found here.

Visual overview

Proposed Neural Formant Synthesis approach

Code

Source code and pre-trained models can be found following the instructions in our repository

</link>

Synthesised speech

We first present some samples generated as copy synthesis with the proposed HiFi-Glot model compared to our previous work on neural formant synthesis (NFS), an end-to-end implementation of this model (NFS-E2E) and Praat.

System	Reference	HiFi-GLot	NFS-E2E	NFS	Praat
Sample 1
Sample 2
Sample 3

Manipulation samples are created by scaling a specific formant frequency (F1-F4) by a factor in the range 0.7 - 1.3.

Scale F1	0.7	0.8	0.9	1.0	1.1	1.2	1.3
HiFi-Glot
NFS-E2E
NFS
Praat

Scale F2	0.7	0.8	0.9	1.0	1.1	1.2	1.3
HiFi-Glot
NFS-E2E
NFS
Praat

Scale F3	0.7	0.8	0.9	1.0	1.1	1.2	1.3
HiFi-Glot
NFS-E2E
NFS
Praat

Scale F4	0.7	0.8	0.9	1.0	1.1	1.2	1.3
HiFi-Glot
NFS-E2E
NFS
Praat

Citation information

@article{juvela2024hifi,
  title={HiFi-Glot: Neural Formant Synthesis with Differentiable Resonant Filters},
  author={Juvela, Lauri and P{\'e}rez Zarazaga, Pablo and Henter, Gustav Eje and Malisz, Zofia},
  journal={arXiv preprint arXiv:2409.14823},
  year={2024}
}