Phonetic, Pitch, and Tone Tokens
Phonetic tokens are used to process and generate speech by representing the sounds of spoken language as numerical vectors. Pitch and tone tokens are used to capture the nuances of human speech, such as tone and emotion, by representing the pitch and tone of spoken language as numerical vectors.
The Spirit LM model incorporates these tokens to capture the complexities of human speech and reflect them in its generated speech. The model is trained on a combination of text and speech datasets, allowing it to learn the relationships between phonetic, pitch, and tone tokens and generate speech that is more expressive and natural-sounding.