Overview
This guide covers techniques and best practices for generating high-quality, natural-sounding speech for your applications using Tabbly TTS.General Best Practices
1. Pick a Suitable Voice
Different voices will be better suited for different applications. Choose a voice that matches the emotional range and expression you’re looking for. For example:- Meditation app: Select a steady and calm voice
- Fitness coach: Select an expressive and energetic voice
- Customer support: Select a professional and friendly voice
- Educational content: Select a clear and articulate voice
2. Pay Attention to Punctuation
Punctuation matters! Use it effectively to control speech delivery:- Exclamation points (!): Make the voice more emphatic and excited
- Ellipsis (…): Insert natural pauses
- Dashes (—): Create pauses or breaks in thought
- Periods (.): Natural sentence endings
- Commas (,): Brief pauses between phrases
3. Use Asterisks for Emphasis
You can emphasize specific words by surrounding them with asterisks. This helps clarify tone or intent in nuanced dialogue. Examples:We *need* a beach vacation- Emphasizes “need”We need a *beach* vacation- Emphasizes “beach”*This* is the most important point- Emphasizes “This”
4. Match the Voice to the Text Language
Voices perform optimally when synthesizing text in the same language as the original voice. While cross-language synthesis is possible, you’ll achieve the best quality, pronunciation, and naturalness by matching the voice’s native language to your text content.5. Normalize Complex Text
If you find that the model is mispronouncing certain complex phrases like phone numbers or dollar amounts, normalize the text. This is particularly helpful for non-English languages. Normalization Examples: Phone Numbers:(123)456-7891→one two three, four five six, seven eight nine one+1-555-123-4567→plus one, five five five, one two three, four five six seven
5/6/2025→may sixth twenty twenty five12/25/2024→december twenty fifth twenty twenty four
12:55 PM→twelve fifty-five PM9:30 AM→nine thirty AM
test@example.com→test at example dot comsupport@tabbly.io→support at tabbly dot io
$5,342.29→five thousand three hundred and forty two dollars and twenty nine cents€1,000→one thousand euros
2+2=4→two plus two equals four100%→one hundred percent#1→number one
6. Tune the Temperature
The temperature controls how random the audio output is:-
Higher values: More random outputs, more expressive results
- Best for: Barks, demo clips, non-real-time use cases
- Range: 1.0 - 1.5
-
Lower values: More deterministic output
- Best for: Real-time use cases, consistent delivery
- Range: 0.6 - 1.0 (recommended for real-time)
- Default: 1.1
Voice Tags
Voice tags provide descriptive metadata about each voice, helping you categorize and filter voices based on their characteristics. Tags describe properties like gender, age group, tone, and style, making it easier to find the right voice for your use case.Understanding Voice Tags
Each voice includes a tags array with descriptive labels such as: Gender:malefemalenon-binary
young_adultadultmiddle-agedelderly
energeticcalmprofessionalfriendlywarm
smoothclearexpressiveconversational
Using Voice Tags
When selecting a voice for your application, use tags to find voices that match your requirements:Audio Markups
Audio markups are currently experimental and only support English. They are not recommended for real-time, production use cases.
Emotion and Delivery Style
Emotion and delivery style markups control the way a given text is spoken. These work best when used at the beginning of a text and apply to the text that follows. Emotions:[happy]- Happy, cheerful tone[sad]- Sad, melancholic tone[angry]- Angry, frustrated tone[surprised]- Surprised, shocked tone[fearful]- Fearful, anxious tone[disgusted]- Disgusted, repulsed tone
[laughing]- Laughing while speaking[whispering]- Whispered delivery
Non-verbal Vocalization
Non-verbal vocalization markups add non-verbal sounds based on where they are placed in the text. Available Markups:[breathe]- Breathing sound[clear_throat]- Throat clearing[cough]- Coughing[laugh]- Laughing[sigh]- Sighing[yawn]- Yawning
Best Practices for Audio Markups
-
Choose Contextually Appropriate Markups
- Markups work best when they make sense with the text content
- Avoid contradictions between markup and text Bad Example:
The text is grateful, which contradicts the angry markup. -
Avoid Conflicting Markups
- Ensure multiple markups don’t conflict with each other Bad Example:
Yawning indicates boredom, which rarely occurs alongside anger. -
Break Up the Text
- Emotion and delivery style markups work best at the beginning with a single markup per request
- Break complex text into separate requests
Do this: -
Repeat Non-verbal Vocalizations if Necessary
- If a non-verbal vocalization is consistently being omitted, repeat the markup
- Works best for vocalizations where repetition sounds natural Examples:
Custom Pronunciation
Sometimes you may need to ensure that a word is spoken with a specific pronunciation, especially for uncommon words such as company names, brand names, nicknames, geographic locations, medical terms, or legal terms that may not appear in the model’s training data.How to Use
Tabbly TTS supports inline IPA phoneme notation for custom pronunciation. Use the International Phonetic Alphabet (IPA) format, wrapped in slashes (/ /).
Example:
Suppose you are building an AI travel agent, and it is recommending the destination Crete, which is pronounced /kriːt/ (“kreet”) in English.
You can ensure the correct pronunciation by passing it inline:
Finding the Right IPA Phonemes
If you are unsure of the correct phonemes, there are several ways to find them:-
Ask an LLM: Use ChatGPT or similar services:
- Use Reference Websites: Resources such as Vocabulary.com’s IPA Pronunciation Guide provide tables of symbols with example words.
- Online IPA Converters: Various online tools can help convert words to IPA notation.
Natural, Conversational Speech
Natural human conversation is not perfect. It’s full of filler words, pauses, and other natural speech patterns that make it sound more human. Tabbly TTS models are trained to generate the requested text as is, in order to produce the most accurate and consistent output that can be used for a wide range of applications. To generate natural, conversational speech, you can use the following techniques:1. Insert Filler Words
Add filler words likeuh, um, well, like, and you know in the text.
Instead of:
2. Use Audio Markups
Use audio markups to add non-verbal vocalizations like[sigh], [breathe], [clear_throat]. These natural speech patterns can make the speech sound more natural.
Example:
Advanced Tips
Text Chunking
For very long texts, consider splitting into smaller chunks for better streaming performance and more natural delivery:- Optimal chunk size: 50-200 words
- Natural break points: Sentence endings, paragraph breaks
- Benefits: Faster streaming, better quality, more natural pauses
Error Handling
Always implement proper error handling for TTS requests:Caching
Consider caching frequently used phrases or responses:- Cache key: Text content + voice_id + model_id
- Cache duration: Based on your use case
- Benefits: Reduced API calls, faster response times, cost savings
Performance Optimization
- Reuse HTTP Clients: Don’t create new clients for each request
- Connection Pooling: Use connection pooling for better performance
- Async Processing: Use async/await for non-blocking operations
- Batch Requests: When possible, batch multiple TTS requests
Common Use Cases
Customer Support
- Voice: Professional, friendly, calm
- Style: Clear and articulate
- Punctuation: Use commas for natural pauses
- Example:
Hello, thank you for calling. How can I help you today?
E-learning
- Voice: Clear, articulate, patient
- Style: Educational and engaging
- Pace: Slightly slower for comprehension
- Example:
Today, we're going to learn about...
Entertainment
- Voice: Expressive, energetic
- Style: Dynamic and engaging
- Markups: Use emotion markups for variety
- Example:
[excited] Welcome to the show!
Accessibility
- Voice: Clear, consistent
- Style: Predictable and easy to understand
- Normalization: Always normalize numbers and symbols
- Example:
The time is twelve thirty PM.
Troubleshooting
Mispronunciations
- Normalize text: Convert numbers, dates, symbols to words
- Use custom pronunciation: For brand names and technical terms
- Check language match: Ensure voice language matches text language
Unnatural Speech
- Add punctuation: Use commas, periods, ellipses
- Insert filler words: Add natural speech patterns
- Use audio markups: Add non-verbal vocalizations
- Adjust temperature: Try different temperature values
Performance Issues
- Chunk long texts: Split into smaller pieces
- Cache responses: Store frequently used phrases
- Reuse connections: Don’t create new HTTP clients
- Monitor API usage: Track response times and errors
Summary
Following these best practices will help you generate high-quality, natural-sounding speech with Tabbly TTS:- ✅ Choose appropriate voices for your use case
- ✅ Use punctuation effectively
- ✅ Normalize complex text (numbers, dates, symbols)
- ✅ Match voice language to text language
- ✅ Tune temperature for your use case
- ✅ Use audio markups for natural speech (experimental)
- ✅ Implement custom pronunciation for special terms
- ✅ Add filler words for conversational speech
- ✅ Handle errors gracefully
- ✅ Optimize for performance