base_stt

Base class for Whisper-based speech-to-text services.

This module provides common functionality for services implementing the Whisper API interface, including language mapping, metrics generation, and error handling.

class pipecat.services.whisper.base_stt.BaseWhisperSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, Any]=<factory>, language: Language | str | None | _NotGiven = <factory>, prompt: str | None | _NotGiven = <factory>, temperature: float | None | _NotGiven = <factory>)[source]

Bases: STTSettings

Settings for BaseWhisperSTTService.

Parameters:
  • prompt – Optional text to guide the model’s style or continue a previous segment.

  • temperature – Sampling temperature between 0 and 1.

prompt: str | None | _NotGiven
temperature: float | None | _NotGiven
pipecat.services.whisper.base_stt.language_to_whisper_language(language: Language) str[source]

Maps pipecat Language enum to Whisper API language codes.

Language support for Whisper API. Docs: https://platform.openai.com/docs/guides/speech-to-text#supported-languages

Parameters:

language – A Language enum value representing the input language.

Returns:

The corresponding service language code. If language is not in the verified mapping, falls back to the base language code (e.g., en from en-US) and logs a warning (via resolve_language(..., use_base_code=True)).

class pipecat.services.whisper.base_stt.BaseWhisperSTTService(*, model: str | None = None, api_key: str | None = None, base_url: str | None = None, language: Language | None = None, prompt: str | None = None, temperature: float | None = None, include_prob_metrics: bool = False, push_empty_transcripts: bool = False, settings: BaseWhisperSTTSettings | None = None, ttfs_p99_latency: float | None = 1.0, **kwargs)[source]

Bases: SegmentedSTTService

Base class for Whisper-based speech-to-text services.

Provides common functionality for services implementing the Whisper API interface, including metrics generation and error handling.

Settings

alias of BaseWhisperSTTSettings

__init__(*, model: str | None = None, api_key: str | None = None, base_url: str | None = None, language: Language | None = None, prompt: str | None = None, temperature: float | None = None, include_prob_metrics: bool = False, push_empty_transcripts: bool = False, settings: BaseWhisperSTTSettings | None = None, ttfs_p99_latency: float | None = 1.0, **kwargs)[source]

Initialize the Whisper STT service.

Parameters:
  • model

    Name of the Whisper model to use.

    Deprecated since version 0.0.105: Use settings=BaseWhisperSTTService.Settings(model=...) instead.

  • api_key – Service API key. Defaults to None.

  • base_url – Service API base URL. Defaults to None.

  • language

    Language of the audio input.

    Deprecated since version 0.0.105: Use settings=BaseWhisperSTTService.Settings(language=...) instead.

  • prompt

    Optional text to guide the model’s style or continue a previous segment.

    Deprecated since version 0.0.105: Use settings=BaseWhisperSTTService.Settings(prompt=...) instead.

  • temperature

    Sampling temperature between 0 and 1.

    Deprecated since version 0.0.105: Use settings=BaseWhisperSTTService.Settings(temperature=...) instead.

  • include_prob_metrics – If True, enables probability metrics in API response. Each service implements this differently (see child classes). Defaults to False.

  • push_empty_transcripts – If true, allow empty TranscriptionFrame frames to be pushed downstream instead of discarding them. This is intended for situations where VAD fires even though the user did not speak. In these cases, it is useful to know that nothing was transcribed so that the agent can resume speaking, instead of waiting longer for a transcription. Defaults to False.

  • settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.

  • ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark

  • **kwargs – Additional arguments passed to SegmentedSTTService.

can_generate_metrics() bool[source]

Whether this service can generate processing metrics.

Returns:

True, as this service supports metric generation.

Return type:

bool

language_to_service_language(language: Language) str | None[source]

Convert from pipecat Language to service language code.

Parameters:

language – The Language enum value to convert.

Returns:

The corresponding service language code, or None if not supported.

async run_stt(audio: bytes) AsyncGenerator[Frame, None][source]

Transcribe audio data to text.

Parameters:

audio – Raw audio data to transcribe.

Yields:

Frame

Either a TranscriptionFrame containing the transcribed text

or an ErrorFrame if transcription fails.

async push_frame(frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM)

Push a frame, marking TranscriptionFrames as finalized.

Segmented STT services process complete speech segments and return a single TranscriptionFrame per segment, so every transcription is inherently finalized.

Parameters:
  • frame – The frame to push.

  • direction – The direction of frame flow in the pipeline.

async stop_ttfb_metrics(*, end_time: float | None = None)

Stop time-to-first-byte metrics collection and push results.

Parameters:

end_time – Optional timestamp to use as the end time. If None, uses the current time.