Skip to main content

Overview

OpenAI provides two STT service implementations:
  • OpenAISTTService for VAD-segmented speech recognition using OpenAI’s transcription API (HTTP-based), supporting GPT-4o transcription and Whisper models
  • OpenAIRealtimeSTTService for real-time streaming speech-to-text using OpenAI’s Realtime API WebSocket transcription sessions, with support for local VAD and server-side VAD modes

Installation

To use OpenAI services, install the required dependency:
pip install "pipecat-ai[openai]"

Prerequisites

OpenAI Account Setup

Before using OpenAI STT services, you need:
  1. OpenAI Account: Sign up at OpenAI Platform
  2. API Key: Generate an API key from your account dashboard
  3. Model Access: Ensure access to Whisper and GPT-4o transcription models

Required Environment Variables

  • OPENAI_API_KEY: Your OpenAI API key for authentication

OpenAISTTService

OpenAISTTService uses VAD-based audio segmentation with HTTP transcription requests. It records speech segments detected by local VAD and sends them to OpenAI’s transcription API.
from pipecat.services.openai.stt import OpenAISTTService

stt = OpenAISTTService(
    api_key=os.getenv("OPENAI_API_KEY"),
    model="gpt-4o-transcribe",
)

OpenAIRealtimeSTTService

OpenAIRealtimeSTTService provides real-time streaming speech-to-text using OpenAI’s Realtime API WebSocket transcription sessions. Audio is streamed continuously over a WebSocket connection for lower latency compared to HTTP-based transcription.

Usage Example

from pipecat.services.openai.stt import OpenAIRealtimeSTTService

# Local VAD mode (default) - use with a VAD processor in the pipeline
stt = OpenAIRealtimeSTTService(
    api_key=os.getenv("OPENAI_API_KEY"),
    model="gpt-4o-transcribe",
    noise_reduction="near_field",
)

# Server-side VAD mode - do NOT use a separate VAD processor
stt = OpenAIRealtimeSTTService(
    api_key=os.getenv("OPENAI_API_KEY"),
    model="gpt-4o-transcribe",
    turn_detection=None,  # Enable server-side VAD
)