stt

ElevenLabs speech-to-text service implementation.

This module provides integration with ElevenLabs’ Speech-to-Text API for transcription using segmented audio processing. The service uploads audio files and receives transcription results directly.

pipecat.services.elevenlabs.stt.language_to_elevenlabs_language(language: Language) → str[source]

Convert a Language enum to ElevenLabs language code.

Source:: https://elevenlabs.io/docs/capabilities/speech-to-text

Parameters:: language – The Language enum value to convert.
Returns:: The corresponding service language code. If language is not in the verified mapping, falls back to the full language code string and logs a warning (via resolve_language(..., use_base_code=False)).

class pipecat.services.elevenlabs.stt.CommitStrategy(*values)[source]

Bases: StrEnum

Commit strategies for transcript segmentation.

MANUAL = 'manual'

VAD = 'vad'

Bases: STTSettings

Settings for ElevenLabsSTTService.

Parameters:

tag_audio_events – Whether to include audio events like (laughter), (coughing) in the transcription.
keyterms – List of key terms or phrases to bias transcription towards.

tag_audio_events: bool | None | _NotGiven

keyterms: list[str] | None | _NotGiven

Bases: STTSettings

Settings for ElevenLabsRealtimeSTTService.

See ElevenLabsRealtimeSTTService.InputParams for detailed descriptions.

Parameters:

keyterms – List of key terms or phrases to bias transcription towards.
vad_silence_threshold_secs – Seconds of silence before VAD commits (0.3-3.0).
vad_threshold – VAD sensitivity (0.1-0.9, lower is more sensitive).
min_speech_duration_ms – Minimum speech duration for VAD (50-2000ms).
min_silence_duration_ms – Minimum silence duration for VAD (50-2000ms).

keyterms: list[str] | None | _NotGiven

vad_silence_threshold_secs: float | None | _NotGiven

vad_threshold: float | None | _NotGiven

min_speech_duration_ms: int | None | _NotGiven

min_silence_duration_ms: int | None | _NotGiven

class pipecat.services.elevenlabs.stt.ElevenLabsSTTService(*, api_key: str, aiohttp_session: ClientSession, base_url: str = 'https://api.elevenlabs.io', model: str | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: ElevenLabsSTTSettings | None = None, ttfs_p99_latency: float | None = 2.01, **kwargs)[source]

Bases: SegmentedSTTService

Speech-to-text service using ElevenLabs’ file-based API.

This service uses ElevenLabs’ Speech-to-Text API to perform transcription on audio segments. It inherits from SegmentedSTTService to handle audio buffering and speech detection. The service uploads audio files to ElevenLabs and receives transcription results directly.

Settings: alias of ElevenLabsSTTSettings

class InputParams(*, language: Language | None = None, tag_audio_events: bool = True)[source]

Bases: BaseModel

Configuration parameters for ElevenLabs STT API.

Deprecated since version 0.0.105: Use settings=ElevenLabsSTTService.Settings(...) instead. Will be removed in 2.0.0.

Parameters:

language – Target language for transcription.
tag_audio_events – Whether to include audio events like (laughter), (coughing), in the transcription.

language: Language | None

tag_audio_events: bool

__init__(*, api_key: str, aiohttp_session: ClientSession, base_url: str = 'https://api.elevenlabs.io', model: str | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: ElevenLabsSTTSettings | None = None, ttfs_p99_latency: float | None = 2.01, **kwargs)[source]

Initialize the ElevenLabs STT service.

Parameters:

api_key – ElevenLabs API key for authentication.
aiohttp_session – aiohttp ClientSession for HTTP requests.
base_url – Base URL for ElevenLabs API.
model –
Model ID for transcription.

Deprecated since version 0.0.105: Use settings=ElevenLabsSTTService.Settings(model=...) instead. Will be removed in 2.0.0.
sample_rate – Audio sample rate in Hz. If not provided, uses the pipeline’s rate.
params –
Configuration parameters for the STT service.

Deprecated since version 0.0.105: Use settings=ElevenLabsSTTService.Settings(...) instead. Will be removed in 2.0.0.
settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.
ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark
**kwargs – Additional arguments passed to SegmentedSTTService.

can_generate_metrics() → bool[source]

Check if the service can generate processing metrics.

Returns:: True, as ElevenLabs STT service supports metrics generation.

language_to_service_language(language: Language) → str | None[source]

Convert a Language enum to ElevenLabs service-specific language code.

Parameters:: language – The language to convert.
Returns:: The ElevenLabs-specific language code, or None if not supported.

async run_stt(audio: bytes) → AsyncGenerator[Frame | None, None][source]

Transcribe an audio segment using ElevenLabs’ STT API.

Parameters:: audio – Raw audio bytes in WAV format (already converted by base class).
Yields:: Frame – TranscriptionFrame containing the transcribed text, or ErrorFrame on failure.

Note

The audio is already in WAV format from the SegmentedSTTService. Only non-empty transcriptions are yielded.

async push_frame(frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM)

Push a frame, marking TranscriptionFrames as finalized.

Segmented STT services process complete speech segments and return a single TranscriptionFrame per segment, so every transcription is inherently finalized.

Parameters:

frame – The frame to push.
direction – The direction of frame flow in the pipeline.

async start_stt_usage_metrics(usage: STTUsage)

Start STT usage metrics collection.

Parameters:: usage – Usage information for the STT operation.

async stop_ttfb_metrics(*, end_time: float | None = None)

Stop time-to-first-byte metrics collection and push results.

Parameters:: end_time – Optional timestamp to use as the end time. If None, uses the current time.

pipecat.services.elevenlabs.stt.audio_format_from_sample_rate(sample_rate: int) → str[source]

Get the appropriate audio format string for a given sample rate.

Parameters:: sample_rate – The audio sample rate in Hz.
Returns:: The ElevenLabs audio format string.

class pipecat.services.elevenlabs.stt.ElevenLabsRealtimeSTTService(*, api_key: str, base_url: str = 'api.elevenlabs.io', commit_strategy: CommitStrategy = CommitStrategy.MANUAL, model: str | None = None, sample_rate: int | None = None, include_timestamps: bool = False, enable_logging: bool = False, include_language_detection: bool = False, params: InputParams | None = None, settings: ElevenLabsRealtimeSTTSettings | None = None, ttfs_p99_latency: float | None = 0.41, **kwargs)[source]

Bases: WebsocketSTTService

Speech-to-text service using ElevenLabs’ Realtime WebSocket API.

This service uses ElevenLabs’ Realtime Speech-to-Text API to perform transcription with ultra-low latency. It supports both partial (interim) and committed (final) transcripts, and can use either manual commit control or automatic Voice Activity Detection (VAD) for segment boundaries.

By default, uses manual commit strategy where Pipecat’s VAD controls when to commit transcript segments, providing consistency with other STT services.

Settings: alias of ElevenLabsRealtimeSTTSettings

class InputParams(*, language_code: str | None = None, commit_strategy: CommitStrategy = CommitStrategy.MANUAL, vad_silence_threshold_secs: float | None = None, vad_threshold: float | None = None, min_speech_duration_ms: int | None = None, min_silence_duration_ms: int | None = None, include_timestamps: bool = False, enable_logging: bool = False, include_language_detection: bool = False)[source]

Bases: BaseModel

Configuration parameters for ElevenLabs Realtime STT API.

Deprecated since version 0.0.105: Use settings=ElevenLabsRealtimeSTTService.Settings(...) instead. Will be removed in 2.0.0.

Parameters:

language_code – ISO-639-1 or ISO-639-3 language code. Leave None for auto-detection.
commit_strategy – How to segment speech - manual (Pipecat VAD) or vad (ElevenLabs VAD).
vad_silence_threshold_secs – Seconds of silence before VAD commits (0.3-3.0). Only used when commit_strategy is VAD. None uses ElevenLabs default.
vad_threshold – VAD sensitivity (0.1-0.9, lower is more sensitive). Only used when commit_strategy is VAD. None uses ElevenLabs default.
min_speech_duration_ms – Minimum speech duration for VAD (50-2000ms). Only used when commit_strategy is VAD. None uses ElevenLabs default.
min_silence_duration_ms – Minimum silence duration for VAD (50-2000ms). Only used when commit_strategy is VAD. None uses ElevenLabs default.
include_timestamps – Whether to include word-level timestamps in transcripts.
enable_logging – Whether to enable logging on ElevenLabs’ side.
include_language_detection – Whether to include language detection in transcripts.

language_code: str | None

commit_strategy: CommitStrategy

vad_silence_threshold_secs: float | None

vad_threshold: float | None

min_speech_duration_ms: int | None

min_silence_duration_ms: int | None

include_timestamps: bool

enable_logging: bool

include_language_detection: bool

__init__(*, api_key: str, base_url: str = 'api.elevenlabs.io', commit_strategy: CommitStrategy = CommitStrategy.MANUAL, model: str | None = None, sample_rate: int | None = None, include_timestamps: bool = False, enable_logging: bool = False, include_language_detection: bool = False, params: InputParams | None = None, settings: ElevenLabsRealtimeSTTSettings | None = None, ttfs_p99_latency: float | None = 0.41, **kwargs)[source]

Initialize the ElevenLabs Realtime STT service.

Parameters:

api_key – ElevenLabs API key for authentication.
base_url – Base URL for ElevenLabs WebSocket API.
commit_strategy – How to segment speech — CommitStrategy.MANUAL (Pipecat VAD) or CommitStrategy.VAD (ElevenLabs VAD). Defaults to CommitStrategy.MANUAL.
model –
Model ID for transcription.

Deprecated since version 0.0.105: Use settings=ElevenLabsRealtimeSTTService.Settings(model=...) instead. Will be removed in 2.0.0.
sample_rate – Audio sample rate in Hz. If not provided, uses the pipeline’s rate.
include_timestamps – Whether to include word-level timestamps in transcripts.
enable_logging – Whether to enable logging on ElevenLabs’ side.
include_language_detection – Whether to include language detection in transcripts.
params –
Configuration parameters for the STT service.

Deprecated since version 0.0.105: Use settings=ElevenLabsRealtimeSTTService.Settings(...) instead. Will be removed in 2.0.0.
settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.
ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark
**kwargs – Additional arguments passed to WebsocketSTTService.

can_generate_metrics() → bool[source]

Check if the service can generate processing metrics.

Returns:: True, as ElevenLabs Realtime STT service supports metrics generation.

async start(frame: StartFrame)[source]

Start the STT service and establish WebSocket connection.

Parameters:: frame – Frame indicating service should start.

async process_frame(frame: Frame, direction: FrameDirection)[source]

Process incoming frames and handle speech events.

Parameters:

frame – The frame to process.
direction – Direction of frame flow in the pipeline.

async run_stt(audio: bytes) → AsyncGenerator[Frame | None, None][source]

Process audio data for speech-to-text transcription.

Parameters:: audio – Raw audio bytes to transcribe.
Yields:: None - transcription results are handled via WebSocket responses.

async push_frame(frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM)

Push a frame downstream, tracking TranscriptionFrame timestamps for TTFB.

Stores the timestamp of each TranscriptionFrame for TTFB calculation. If the frame is marked as finalized (via request_finalize/confirm_finalize), reports TTFB immediately and cancels any pending timeout. Otherwise, TTFB is reported after a timeout.

Parameters:

frame – The frame to push.
direction – The direction to push the frame.

async start_stt_usage_metrics(usage: STTUsage)

Start STT usage metrics collection.

Parameters:: usage – Usage information for the STT operation.

async stop_ttfb_metrics(*, end_time: float | None = None)

Stop time-to-first-byte metrics collection and push results.

Parameters:: end_time – Optional timestamp to use as the end time. If None, uses the current time.