Skip to the content.

Abstract

Although current TTS systems can already achieve human-parity voice quality, they still face a critical challenge: they require sufficient high-quality data of the target speaker, which is particularly scarce for long-form expressive speech synthesis. This paper introduces a novel context-aware voice pretraining model leveraging contrastive learning for expressive Text-to-Speech (TTS) synthesis. The model utilizes a vast amount of contextual voice data without explicit speaker labels, capturing the intricate relationship between context and speech expression. It enables the generation of coherent and expressive speech by integrating cross-modal features of context and speech into the TTS model, particularly beneficial when target speaker data is scarce. For validation, the pretraining model is 1) applied to Context-Speech retrieval tasks for parameter optimization and 2) integrated with a zero-shot TTS system to assess the assistance of learned cross-modal features in expressive TTS modeling. Experiments demonstrate that the pretraining framework is capable of learning effective Context-Speech representations, which significantly contribute to the generation of more expressive speech synthesis.

Contents

Audio Samples

Sentences

TranscriptReferenceNS2NS2+CCSP
They ventured into the temple, facing intricate puzzles and cunning traps.


People praised their bravery and selflessness.


Once upon a time, in a small town nestled amidst rolling hills, lived two friends, Alex and Emily.


Paragraphs

TranscriptNS2NS2+CCSP
Under a bruised and brooding sky, Mara pressed against the cold, unyielding stone of the city wall. She could feel the vibrations of marching feet in her bones. The enemy was at the gates, the siege had begun.

The man chuckled and said, "I can assist you, but on one condition – you must help me complete a task." After a moment's contemplation, they agreed, curious about the man's proposition. He led them through perilous mountain valleys until they arrived at an ancient temple.

The man chuckled and said, "I can assist you, but on one condition – you must help me complete a task." After a moment's contemplation, they agreed, curious about the man's proposition. He led them through perilous mountain valleys until they arrived at an ancient temple.

However, amidst their excitement and greed, the rugged man reappeared, wearing an evil grin. "Thank you for your assistance, but this treasure now belongs to me," he declared. Filled with righteous anger, they gripped the sword and prepared to battle.

FastSpeech2 Samples

TranscriptFS2FS2+CCSP
Lawson also evaluated the security hazards at the trade mart on november thirteen.

It was so used for the date of the abolition of the star chamber in the sixteenth charles the first.

The soviet authorities denied oswald permission.

And therefore approximately one point six seconds after the president was shot in the head.