stt

ElevenLabs speech-to-text service implementation.

This module provides integration with ElevenLabs’ Speech-to-Text API for transcription using segmented audio processing. The service uploads audio files and receives transcription results directly.

pipecat.services.elevenlabs.stt.language_to_elevenlabs_language(language: Language) str[source]

Convert a Language enum to ElevenLabs language code.

Source:

https://elevenlabs.io/docs/capabilities/speech-to-text

Parameters:

language – The Language enum value to convert.

Returns:

The corresponding service language code. If language is not in the verified mapping, falls back to the full language code string and logs a warning (via resolve_language(..., use_base_code=False)).

class pipecat.services.elevenlabs.stt.CommitStrategy(*values)[source]

Bases: StrEnum

Commit strategies for transcript segmentation.

MANUAL = 'manual'
VAD = 'vad'
class pipecat.services.elevenlabs.stt.ElevenLabsSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, language: Language | str | None | _NotGiven = <factory>, tag_audio_events: bool | None | _NotGiven = <factory>, keyterms: list[str] | None | _NotGiven = <factory>)[source]

Bases: STTSettings

Settings for ElevenLabsSTTService.

Parameters:
  • tag_audio_events – Whether to include audio events like (laughter), (coughing) in the transcription.

  • keyterms – List of key terms or phrases to bias transcription towards.

tag_audio_events: bool | None | _NotGiven
keyterms: list[str] | None | _NotGiven
class pipecat.services.elevenlabs.stt.ElevenLabsRealtimeSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, language: Language | str | None | _NotGiven = <factory>, keyterms: list[str] | None | _NotGiven = <factory>, vad_silence_threshold_secs: float | None | _NotGiven = <factory>, vad_threshold: float | None | _NotGiven = <factory>, min_speech_duration_ms: int | None | _NotGiven = <factory>, min_silence_duration_ms: int | None | _NotGiven = <factory>)[source]

Bases: STTSettings

Settings for ElevenLabsRealtimeSTTService.

See ElevenLabsRealtimeSTTService.InputParams for detailed descriptions.

Parameters:
  • keyterms – List of key terms or phrases to bias transcription towards.

  • vad_silence_threshold_secs – Seconds of silence before VAD commits (0.3-3.0).

  • vad_threshold – VAD sensitivity (0.1-0.9, lower is more sensitive).

  • min_speech_duration_ms – Minimum speech duration for VAD (50-2000ms).

  • min_silence_duration_ms – Minimum silence duration for VAD (50-2000ms).

keyterms: list[str] | None | _NotGiven
vad_silence_threshold_secs: float | None | _NotGiven
vad_threshold: float | None | _NotGiven
min_speech_duration_ms: int | None | _NotGiven
min_silence_duration_ms: int | None | _NotGiven
class pipecat.services.elevenlabs.stt.ElevenLabsSTTService(*, api_key: str, aiohttp_session: ClientSession, base_url: str = 'https://api.elevenlabs.io', model: str | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: ElevenLabsSTTSettings | None = None, ttfs_p99_latency: float | None = 2.01, **kwargs)[source]

Bases: SegmentedSTTService

Speech-to-text service using ElevenLabs’ file-based API.

This service uses ElevenLabs’ Speech-to-Text API to perform transcription on audio segments. It inherits from SegmentedSTTService to handle audio buffering and speech detection. The service uploads audio files to ElevenLabs and receives transcription results directly.

Settings

alias of ElevenLabsSTTSettings

class InputParams(*, language: Language | None = None, tag_audio_events: bool = True)[source]

Bases: BaseModel

Configuration parameters for ElevenLabs STT API.

Deprecated since version 0.0.105: Use settings=ElevenLabsSTTService.Settings(...) instead.

Parameters:
  • language – Target language for transcription.

  • tag_audio_events – Whether to include audio events like (laughter), (coughing), in the transcription.

language: Language | None
tag_audio_events: bool
__init__(*, api_key: str, aiohttp_session: ClientSession, base_url: str = 'https://api.elevenlabs.io', model: str | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: ElevenLabsSTTSettings | None = None, ttfs_p99_latency: float | None = 2.01, **kwargs)[source]

Initialize the ElevenLabs STT service.

Parameters:
  • api_key – ElevenLabs API key for authentication.

  • aiohttp_session – aiohttp ClientSession for HTTP requests.

  • base_url – Base URL for ElevenLabs API.

  • model

    Model ID for transcription.

    Deprecated since version 0.0.105: Use settings=ElevenLabsSTTService.Settings(model=...) instead.

  • sample_rate – Audio sample rate in Hz. If not provided, uses the pipeline’s rate.

  • params

    Configuration parameters for the STT service.

    Deprecated since version 0.0.105: Use settings=ElevenLabsSTTService.Settings(...) instead.

  • settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.

  • ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark

  • **kwargs – Additional arguments passed to SegmentedSTTService.

can_generate_metrics() bool[source]

Check if the service can generate processing metrics.

Returns:

True, as ElevenLabs STT service supports metrics generation.

language_to_service_language(language: Language) str | None[source]

Convert a Language enum to ElevenLabs service-specific language code.

Parameters:

language – The language to convert.

Returns:

The ElevenLabs-specific language code, or None if not supported.

async run_stt(audio: bytes) AsyncGenerator[Frame | None, None][source]

Transcribe an audio segment using ElevenLabs’ STT API.

Parameters:

audio – Raw audio bytes in WAV format (already converted by base class).

Yields:

Frame – TranscriptionFrame containing the transcribed text, or ErrorFrame on failure.

Note

The audio is already in WAV format from the SegmentedSTTService. Only non-empty transcriptions are yielded.

async push_frame(frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM)

Push a frame, marking TranscriptionFrames as finalized.

Segmented STT services process complete speech segments and return a single TranscriptionFrame per segment, so every transcription is inherently finalized.

Parameters:
  • frame – The frame to push.

  • direction – The direction of frame flow in the pipeline.

async stop_ttfb_metrics(*, end_time: float | None = None)

Stop time-to-first-byte metrics collection and push results.

Parameters:

end_time – Optional timestamp to use as the end time. If None, uses the current time.

pipecat.services.elevenlabs.stt.audio_format_from_sample_rate(sample_rate: int) str[source]

Get the appropriate audio format string for a given sample rate.

Parameters:

sample_rate – The audio sample rate in Hz.

Returns:

The ElevenLabs audio format string.

class pipecat.services.elevenlabs.stt.ElevenLabsRealtimeSTTService(*, api_key: str, base_url: str = 'api.elevenlabs.io', commit_strategy: CommitStrategy = CommitStrategy.MANUAL, model: str | None = None, sample_rate: int | None = None, include_timestamps: bool = False, enable_logging: bool = False, include_language_detection: bool = False, params: InputParams | None = None, settings: ElevenLabsRealtimeSTTSettings | None = None, ttfs_p99_latency: float | None = 0.41, **kwargs)[source]

Bases: WebsocketSTTService

Speech-to-text service using ElevenLabs’ Realtime WebSocket API.

This service uses ElevenLabs’ Realtime Speech-to-Text API to perform transcription with ultra-low latency. It supports both partial (interim) and committed (final) transcripts, and can use either manual commit control or automatic Voice Activity Detection (VAD) for segment boundaries.

By default, uses manual commit strategy where Pipecat’s VAD controls when to commit transcript segments, providing consistency with other STT services.

Settings

alias of ElevenLabsRealtimeSTTSettings

class InputParams(*, language_code: str | None = None, commit_strategy: CommitStrategy = CommitStrategy.MANUAL, vad_silence_threshold_secs: float | None = None, vad_threshold: float | None = None, min_speech_duration_ms: int | None = None, min_silence_duration_ms: int | None = None, include_timestamps: bool = False, enable_logging: bool = False, include_language_detection: bool = False)[source]

Bases: BaseModel

Configuration parameters for ElevenLabs Realtime STT API.

Deprecated since version 0.0.105: Use settings=ElevenLabsRealtimeSTTService.Settings(...) instead.

Parameters:
  • language_code – ISO-639-1 or ISO-639-3 language code. Leave None for auto-detection.

  • commit_strategy – How to segment speech - manual (Pipecat VAD) or vad (ElevenLabs VAD).

  • vad_silence_threshold_secs – Seconds of silence before VAD commits (0.3-3.0). Only used when commit_strategy is VAD. None uses ElevenLabs default.

  • vad_threshold – VAD sensitivity (0.1-0.9, lower is more sensitive). Only used when commit_strategy is VAD. None uses ElevenLabs default.

  • min_speech_duration_ms – Minimum speech duration for VAD (50-2000ms). Only used when commit_strategy is VAD. None uses ElevenLabs default.

  • min_silence_duration_ms – Minimum silence duration for VAD (50-2000ms). Only used when commit_strategy is VAD. None uses ElevenLabs default.

  • include_timestamps – Whether to include word-level timestamps in transcripts.

  • enable_logging – Whether to enable logging on ElevenLabs’ side.

  • include_language_detection – Whether to include language detection in transcripts.

language_code: str | None
commit_strategy: CommitStrategy
vad_silence_threshold_secs: float | None
vad_threshold: float | None
min_speech_duration_ms: int | None
min_silence_duration_ms: int | None
include_timestamps: bool
enable_logging: bool
include_language_detection: bool
__init__(*, api_key: str, base_url: str = 'api.elevenlabs.io', commit_strategy: CommitStrategy = CommitStrategy.MANUAL, model: str | None = None, sample_rate: int | None = None, include_timestamps: bool = False, enable_logging: bool = False, include_language_detection: bool = False, params: InputParams | None = None, settings: ElevenLabsRealtimeSTTSettings | None = None, ttfs_p99_latency: float | None = 0.41, **kwargs)[source]

Initialize the ElevenLabs Realtime STT service.

Parameters:
  • api_key – ElevenLabs API key for authentication.

  • base_url – Base URL for ElevenLabs WebSocket API.

  • commit_strategy – How to segment speech — CommitStrategy.MANUAL (Pipecat VAD) or CommitStrategy.VAD (ElevenLabs VAD). Defaults to CommitStrategy.MANUAL.

  • model

    Model ID for transcription.

    Deprecated since version 0.0.105: Use settings=ElevenLabsRealtimeSTTService.Settings(model=...) instead.

  • sample_rate – Audio sample rate in Hz. If not provided, uses the pipeline’s rate.

  • include_timestamps – Whether to include word-level timestamps in transcripts.

  • enable_logging – Whether to enable logging on ElevenLabs’ side.

  • include_language_detection – Whether to include language detection in transcripts.

  • params

    Configuration parameters for the STT service.

    Deprecated since version 0.0.105: Use settings=ElevenLabsRealtimeSTTService.Settings(...) instead.

  • settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.

  • ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark

  • **kwargs – Additional arguments passed to WebsocketSTTService.

can_generate_metrics() bool[source]

Check if the service can generate processing metrics.

Returns:

True, as ElevenLabs Realtime STT service supports metrics generation.

async start(frame: StartFrame)[source]

Start the STT service and establish WebSocket connection.

Parameters:

frame – Frame indicating service should start.

async stop(frame: EndFrame)[source]

Stop the STT service and close WebSocket connection.

Parameters:

frame – Frame indicating service should stop.

async cancel(frame: CancelFrame)[source]

Cancel the STT service and close WebSocket connection.

Parameters:

frame – Frame indicating service should be cancelled.

async process_frame(frame: Frame, direction: FrameDirection)[source]

Process incoming frames and handle speech events.

Parameters:
  • frame – The frame to process.

  • direction – Direction of frame flow in the pipeline.

async run_stt(audio: bytes) AsyncGenerator[Frame | None, None][source]

Process audio data for speech-to-text transcription.

Parameters:

audio – Raw audio bytes to transcribe.

Yields:

None - transcription results are handled via WebSocket responses.

async push_frame(frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM)

Push a frame downstream, tracking TranscriptionFrame timestamps for TTFB.

Stores the timestamp of each TranscriptionFrame for TTFB calculation. If the frame is marked as finalized (via request_finalize/confirm_finalize), reports TTFB immediately and cancels any pending timeout. Otherwise, TTFB is reported after a timeout.

Parameters:
  • frame – The frame to push.

  • direction – The direction to push the frame.

async stop_ttfb_metrics(*, end_time: float | None = None)

Stop time-to-first-byte metrics collection and push results.

Parameters:

end_time – Optional timestamp to use as the end time. If None, uses the current time.