TTS LiveKit Integration

Overview

Tabbly TTS provides a streaming Text-to-Speech API that allows you to use Tabbly TTS as a TTS provider in your LiveKit voice agents with optimized audio delivery that eliminates clicks, pops, and choppy sounds.

Prerequisites

LiveKit Agents Python SDK installed
Tabbly TTS API key
Python 3.11+
httpx library (for HTTP streaming)

API Details

Base URL: https://api.tabbly.io Streaming Endpoint: POST /tts/stream Authentication: API key via X-API-Key header Response Format: HTTP streaming response with WAV-encoded audio chunks (LINEAR16, 48kHz, mono) Protocol: HTTP streaming with WAV files embedded in the stream

Integration Steps

1. Create the TTS Class

Create a custom TTS class that inherits from livekit.agents.tts.TTS:

from livekit.agents import tts, utils
from livekit.agents.types import DEFAULT_API_CONNECT_OPTIONS
from typing import Optional, Any
import httpx
import logging

logger = logging.getLogger(__name__)

class TabblyTTS(tts.TTS):
    """Custom TTS implementation for Tabbly TTS API."""
    
    def __init__(
        self,
        api_key: str = "your-api-key",
        voice_id: str = "Ashley",
        model_id: str = "tabbly-tts",
        base_url: str = "https://api.tabbly.io",
    ):
        # Initialize TTS base class with capabilities
        # Use ChunkedStream pattern - even though API supports streaming,
        # LiveKit's ChunkedStream handles the chunking for us
        super().__init__(
            capabilities=tts.TTSCapabilities(streaming=False),
            sample_rate=48000,
            num_channels=1,
        )
        self.api_key = api_key
        self.voice_id = voice_id
        self.model_id = model_id
        self.base_url = base_url
        self._http_client = None
    
    @property
    def http_client(self):
        """Get or create HTTP client."""
        if self._http_client is None:
            self._http_client = httpx.AsyncClient(timeout=60.0, follow_redirects=True)
        return self._http_client
    
    def synthesize(
        self, 
        text: str, 
        conn_options: Optional[Any] = DEFAULT_API_CONNECT_OPTIONS,
    ) -> tts.ChunkedStream:
        """Synthesize text to speech and return a chunked stream."""
        return TabblyChunkedStream(
            tts=self,
            input_text=text,
            conn_options=conn_options,
        )

2. Create the ChunkedStream Class

Implement a ChunkedStream that handles the audio streaming with optimized buffering:

class TabblyChunkedStream(tts.ChunkedStream):
    """Chunked stream for Tabbly TTS synthesis with optimized audio delivery.
    
    This improved version eliminates audio clicks/pops by:
    - Using larger chunk sizes (10ms) to reduce overhead
    - Pre-buffering (20ms) to smooth out network jitter
    - Ensuring perfect frame alignment
    - Continuous data flow without gaps
    - Handling embedded WAV files in the stream
    """
    
    def __init__(
        self,
        *,
        tts: TabblyTTS,
        input_text: str,
        conn_options: Any,
    ) -> None:
        super().__init__(tts=tts, input_text=input_text, conn_options=conn_options)
        self._tts: TabblyTTS = tts
        self._input_text = input_text
        self._conn_options = conn_options
    
    async def _run(self, output_emitter: tts.AudioEmitter) -> None:
        """Run the chunked synthesis process with optimized buffering."""
        try:
            # Initialize AudioEmitter
            if not hasattr(output_emitter, '_initialized') or not output_emitter._initialized:
                output_emitter.initialize(
                    request_id=utils.shortuuid(),
                    sample_rate=48000,
                    num_channels=1,
                    mime_type="audio/pcm",
                )
        except RuntimeError as e:
            if "AudioEmitter already started" in str(e):
                logger.warning("AudioEmitter already initialized, continuing with existing instance")
            else:
                raise
        
        # Make the TTS API request
        url = f"{self._tts.base_url}/tts/stream"
        headers = {
            "Content-Type": "application/json",
            "X-API-Key": self._tts.api_key,
        }
        data = {
            "text": self._input_text,
            "voice_id": self._tts.voice_id,
            "model_id": self._tts.model_id,
        }
        
        try:
            async with self._tts.http_client.stream(
                "POST",
                url,
                json=data,
                headers=headers,
            ) as response:
                response.raise_for_status()
                
                # Audio format constants
                SAMPLE_RATE = 48000
                FRAME_SIZE = 2  # 16-bit mono = 2 bytes per frame
                
                # Buffering strategy for smooth playback
                # Use larger chunks (10ms) to reduce overhead and prevent clicks
                # Pre-buffer helps smooth out network jitter
                CHUNK_SIZE = 960  # 10ms at 48kHz (480 samples * 2 bytes)
                PRE_BUFFER_SIZE = 1920  # 20ms pre-buffer (960 samples * 2 bytes)
                
                buffer = bytearray()
                pre_buffer = bytearray()
                header_skipped = False
                header_size = 44  # Standard WAV header size
                
                async for chunk in response.aiter_bytes():
                    if chunk:
                        buffer.extend(chunk)
                        
                        # CRITICAL: Extract PCM data from WAV files sent by the API
                        # The API may send WAV files (with headers) in chunks, not raw PCM
                        # We need to extract the raw PCM data from each WAV chunk
                        
                        # Skip WAV header on first chunk
                        if not header_skipped:
                            if len(buffer) >= header_size:
                                # Check if it's a WAV file
                                if buffer[:4] == b'RIFF' and buffer[8:12] == b'WAVE':
                                    # Find "data" chunk marker
                                    data_start = None
                                    for i in range(12, min(len(buffer), 200)):
                                        if buffer[i:i+4] == b'data':
                                            data_start = i + 8  # Skip "data" (4 bytes) + size (4 bytes)
                                            break
                                    
                                    if data_start:
                                        buffer = buffer[data_start:]
                                        logger.debug(f"Extracted PCM from first WAV header: {len(buffer)} bytes")
                                    else:
                                        # Fallback: skip standard 44-byte header
                                        buffer = buffer[header_size:]
                                    header_skipped = True
                                else:
                                    # Not a WAV file, process as raw PCM
                                    header_skipped = True
                        
                        # Process audio data after header is skipped
                        if header_skipped and len(buffer) > 0:
                            # CRITICAL: Check if subsequent chunks are also WAV files
                            # The API may send multiple WAV files in the stream
                            processed_buffer = bytearray()
                            temp_buffer = buffer
                            
                            while len(temp_buffer) > 0:
                                # Check if this chunk starts with a WAV header
                                if len(temp_buffer) >= 12 and temp_buffer[:4] == b'RIFF' and temp_buffer[8:12] == b'WAVE':
                                    # This is a WAV file - extract PCM data
                                    data_start = None
                                    # Search for "data" chunk (can be anywhere after "WAVE")
                                    for i in range(12, len(temp_buffer)):
                                        if i + 4 <= len(temp_buffer) and temp_buffer[i:i+4] == b'data':
                                            data_start = i + 8  # Skip "data" (4 bytes) + size (4 bytes)
                                            break
                                    
                                    if data_start and data_start < len(temp_buffer):
                                        # Extract PCM data from this WAV chunk
                                        # Find where this WAV file ends (next "RIFF" or end of buffer)
                                        wav_end = len(temp_buffer)
                                        for i in range(data_start, len(temp_buffer) - 4):
                                            if temp_buffer[i:i+4] == b'RIFF':
                                                wav_end = i
                                                break
                                        
                                        # Extract PCM data from this WAV chunk
                                        pcm_data = temp_buffer[data_start:wav_end]
                                        processed_buffer.extend(pcm_data)
                                        temp_buffer = temp_buffer[wav_end:]
                                        logger.debug(f"Extracted PCM from WAV chunk: {len(pcm_data)} bytes")
                                    else:
                                        # Incomplete WAV file, keep in buffer for next iteration
                                        break
                                else:
                                    # Not a WAV file, treat as raw PCM
                                    processed_buffer.extend(temp_buffer)
                                    temp_buffer = bytearray()
                            
                            # Update buffer with remaining unprocessed data
                            buffer = temp_buffer
                            
                            # Add processed PCM data to pre-buffer for smoothing
                            if len(processed_buffer) > 0:
                                pre_buffer.extend(processed_buffer)
                            
                            # Push chunks when pre-buffer has enough data
                            # This ensures smooth, continuous playback without gaps
                            while len(pre_buffer) >= CHUNK_SIZE:
                                # Extract frame-aligned chunk
                                chunk_data = bytes(pre_buffer[:CHUNK_SIZE])
                                pre_buffer = pre_buffer[CHUNK_SIZE:]
                                
                                # Push immediately for continuous playback
                                output_emitter.push(chunk_data)
                
                # Process any remaining data in pre-buffer
                # Ensure frame alignment before pushing final chunk
                if header_skipped and len(pre_buffer) >= FRAME_SIZE:
                    # Align to frame boundary
                    aligned_size = (len(pre_buffer) // FRAME_SIZE) * FRAME_SIZE
                    if aligned_size > 0:
                        final_chunk = bytes(pre_buffer[:aligned_size])
                        output_emitter.push(final_chunk)
                
                # Process any remaining data in buffer
                if header_skipped and len(buffer) >= FRAME_SIZE:
                    aligned_size = (len(buffer) // FRAME_SIZE) * FRAME_SIZE
                    if aligned_size > 0:
                        final_chunk = bytes(buffer[:aligned_size])
                        output_emitter.push(final_chunk)
                
        except httpx.HTTPStatusError as e:
            logger.error(f"Tabbly TTS API error: {e.response.status_code} - {e.response.text}")
            raise
        except Exception as e:
            logger.error(f"Error in Tabbly TTS synthesis: {e}")
            raise
        finally:
            # Flush at the end - this ensures all buffered audio is sent
            # Only flush if we have data, otherwise it might cause issues
            try:
                output_emitter.flush()
            except Exception as e:
                logger.warning(f"Error flushing audio emitter: {e}")

3. Use in Your Agent

In your LiveKit agent entrypoint:

from livekit import agents
from livekit.agents import AgentSession

async def entrypoint(ctx: agents.JobContext):
    # Create Tabbly TTS instance
    tabbly_tts = TabblyTTS(
        api_key="your-api-key-here",
        voice_id="Ashley",  # Optional, defaults to "Ashley"
        model_id="tabbly-tts",  # Optional, defaults to "tabbly-tts"
        base_url="https://api.tabbly.io",  # Optional, uses default if not provided
    )
    
    # Create agent session with Tabbly TTS
    session = AgentSession(
        tts=tabbly_tts,
        stt=your_stt_provider,
        llm=your_llm_provider,
    )
    
    await ctx.connect()
    await session.start(room=ctx.room, agent=your_agent)

Configuration Options

TabblyTTS Parameters

api_key

string

required

Your Tabbly TTS API key

voice_id

string

default:"Ashley"

Voice ID to use (default: “Ashley”)

model_id

string

default:"tabbly-tts"

Model ID to use (default: “tabbly-tts”)

base_url

string

default:"https://api.tabbly.io"

Base URL for the API

Audio Configuration

Sample Rate

integer

48000 Hz (fixed)

Channels

integer

1 (mono)

Format

string

LINEAR16 PCM (16-bit)

MIME Type

string

audio/pcm

Buffering Strategy

The implementation uses an optimized buffering strategy to eliminate audio artifacts:

Chunk Size: 960 bytes (10ms at 48kHz) - reduces overhead and prevents clicks
Pre-buffer Size: 1920 bytes (20ms) - smooths out network jitter
Frame Alignment: All chunks are aligned to 16-bit sample boundaries (even number of bytes)

Key Features

1. WAV File Extraction

The API may send WAV files embedded in the stream. The implementation automatically:

Detects WAV headers (RIFF and WAVE markers)
Extracts raw PCM data from WAV chunks
Handles multiple WAV files in a single stream
Falls back to raw PCM if no WAV headers are detected

2. Optimized Audio Delivery

The implementation eliminates audio clicks, pops, and choppy sounds by:

Using consistent 10ms chunk sizes for steady delivery
Pre-buffering 20ms to smooth network jitter
Ensuring perfect frame alignment (16-bit sample boundaries)
Processing data continuously without gaps

3. Error Handling

The implementation handles:

HTTP errors from the API
WAV header parsing errors
Network timeouts (60 seconds)
AudioEmitter initialization errors
Incomplete WAV files in the stream

Best Practices

HTTP Streaming

Uses HTTP streaming for better reliability than WebSocket

Error Handling

Always wrap TTS calls in try-except blocks to handle network and API errors gracefully

Monitor API Usage

Track your API usage through Tabbly’s dashboard to manage costs and quotas

Voice Selection

Choose appropriate voice_id based on your use case. Different voices may have different characteristics and languages

Model Selection

Use the appropriate model_id (default: “tabbly-tts”)

WAV Processing

The implementation automatically handles WAV files - no manual processing needed

Buffering

The pre-buffer helps smooth out network delays - don’t disable it

Frame Alignment

Always ensure sample-aligned chunks to prevent audio artifacts

Example: Using with Metadata

You can configure Tabbly TTS via LiveKit job metadata:

# In your job creation
metadata = {
    "tts_library": "tabbly",
    "use_tts_voice": "Ashley",
    "tts_api_key": "your-api-key"
}

# In your entrypoint
async def entrypoint(ctx: agents.JobContext):
    metadata = ctx.job.metadata
    
    if metadata.get("tts_library", "").lower() == "tabbly":
        tts_config = TabblyTTS(
            api_key=metadata.get("tts_api_key", "default-key"),
            voice_id=metadata.get("use_tts_voice", "Ashley"),
            model_id="tabbly-tts",
        )
        
        session = AgentSession(
            tts=tts_config,
            stt=your_stt_provider,
            llm=your_llm_provider,
        )
        
        await ctx.connect()
        await session.start(room=ctx.room, agent=your_agent)

Troubleshooting

No Audio Output

Check API key is valid and has wallet balance > 0
Verify network connectivity to API endpoint (https://api.tabbly.io)
Check logs for HTTP connection errors
Verify the response status code is 200
Check if WAV headers are being detected correctly

Audio Quality Issues (Clicks, Pops, Choppy Sound)

This is the main issue this implementation solves!

Ensure the pre-buffer is working (check logs for buffer sizes)
Verify chunk size is 960 bytes (10ms)
Check that frame alignment is correct (even number of bytes)
Monitor network latency - high latency may require larger pre-buffer
Ensure WAV extraction is working (check debug logs)

HTTP Connection Issues

Check firewall/proxy settings for HTTPS connections
Verify API URL is correct (https://api.tabbly.io/tts/stream)
Check for HTTP timeout errors (default 60 seconds)
Ensure API key is properly included in X-API-Key header
Verify request format matches API requirements

Performance Issues

Monitor HTTP connection establishment time
Check for network latency to API endpoint
Monitor audio chunk arrival rate
Check if pre-buffer is filling up (may indicate slow network)
Consider adjusting CHUNK_SIZE or PRE_BUFFER_SIZE for your network conditions

WAV Processing Issues

Check logs for “Extracted PCM from WAV” messages
Verify WAV headers are being detected (RIFF and WAVE markers)
Check if multiple WAV files are being processed correctly
Ensure incomplete WAV files are being handled (kept in buffer)

Audio Artifact Prevention

This implementation specifically addresses audio artifacts (clicks, pops, choppy sounds) through:

Consistent Chunk Sizes: 10ms chunks provide steady delivery rate
Pre-buffering: 20ms buffer smooths out network jitter
Frame Alignment: All chunks aligned to 16-bit sample boundaries
Continuous Processing: No gaps between chunks
WAV Extraction: Properly extracts PCM from embedded WAV files

If you still experience audio artifacts:

Increase PRE_BUFFER_SIZE (try 2880 bytes = 30ms)
Check network latency and stability
Verify the API is sending consistent data
Check for any errors in the logs

Support

For issues or questions:

Check TTS Streaming API documentation
Review LiveKit Agents documentation
Check application logs for detailed error messages
Verify your implementation matches the code provided above

License

This integration follows the same license as your LiveKit Agents project.

TTS API

​Overview

​Prerequisites

​API Details

​Integration Steps

​1. Create the TTS Class

​2. Create the ChunkedStream Class

​3. Use in Your Agent

​Configuration Options

​TabblyTTS Parameters

​Audio Configuration

​Buffering Strategy

​Key Features

​1. WAV File Extraction

​2. Optimized Audio Delivery

​3. Error Handling

​Best Practices

​Example: Using with Metadata

​Troubleshooting

​No Audio Output

​Audio Quality Issues (Clicks, Pops, Choppy Sound)

​HTTP Connection Issues

​Performance Issues

​WAV Processing Issues

​Audio Artifact Prevention

​Support

​License

Overview

Prerequisites

API Details

Integration Steps

1. Create the TTS Class

2. Create the ChunkedStream Class

3. Use in Your Agent

Configuration Options

TabblyTTS Parameters

Audio Configuration

Buffering Strategy

Key Features

1. WAV File Extraction

2. Optimized Audio Delivery

3. Error Handling

Best Practices

Example: Using with Metadata

Troubleshooting

No Audio Output

Audio Quality Issues (Clicks, Pops, Choppy Sound)

HTTP Connection Issues

Performance Issues

WAV Processing Issues

Audio Artifact Prevention

Support

License