Skip to the content.

HiFi-Glot: Neural formant synthesis with differentiable resonant filters

Lauri Juvela, Pablo Pérez Zarazaga, Gustav Eje Henter, Zofia Malisz

Summary

The goal of this work is to develop a speaker-independent speech synthesis system driven by a small set of phonetically meaningful speech parameters.

The system is built with a similar structure to the source-filter model, allowing us to independently inspect and manipulate the spectral envelope and glottal excitation.

The system provides a controllable environment where it is possible to manipulate the different individual speech parameters to generate a realistic speech signal.

A pre-print from this article can be found here.

Visual overview

Proposed Neural Formant Synthesis approach

Code

Source code and pre-trained models can be found following the instructions in our repository

</link>

Synthesised speech

We first present some samples generated as copy synthesis with the proposed HiFi-Glot model compared to our previous work on neural formant synthesis (NFS), an end-to-end implementation of this model (NFS-E2E) and Praat.

System Reference HiFi-GLot NFS-E2E NFS Praat
Sample 1
Sample 2
Sample 3

Manipulation samples are created by scaling a specific formant frequency (F1-F4) by a factor in the range 0.7 - 1.3.

Scale F1 0.7 0.8 0.9 1.0 1.1 1.2 1.3
HiFi-Glot
NFS-E2E
NFS
Praat
Scale F2 0.7 0.8 0.9 1.0 1.1 1.2 1.3
HiFi-Glot
NFS-E2E
NFS
Praat
Scale F3 0.7 0.8 0.9 1.0 1.1 1.2 1.3
HiFi-Glot
NFS-E2E
NFS
Praat
Scale F4 0.7 0.8 0.9 1.0 1.1 1.2 1.3
HiFi-Glot
NFS-E2E
NFS
Praat

Citation information

@article{juvela2024hifi,
  title={HiFi-Glot: Neural Formant Synthesis with Differentiable Resonant Filters},
  author={Juvela, Lauri and P{\'e}rez Zarazaga, Pablo and Henter, Gustav Eje and Malisz, Zofia},
  journal={arXiv preprint arXiv:2409.14823},
  year={2024}
}