base_stt
Base class for Whisper-based speech-to-text services.
This module provides common functionality for services implementing the Whisper API interface, including language mapping, metrics generation, and error handling.
- class pipecat.services.whisper.base_stt.BaseWhisperSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, Any]=<factory>, language: Language | str | None | _NotGiven = <factory>, prompt: str | None | _NotGiven = <factory>, temperature: float | None | _NotGiven = <factory>)[source]
Bases:
STTSettingsSettings for BaseWhisperSTTService.
- Parameters:
prompt – Optional text to guide the model’s style or continue a previous segment.
temperature – Sampling temperature between 0 and 1.
- prompt: str | None | _NotGiven
- temperature: float | None | _NotGiven
- pipecat.services.whisper.base_stt.language_to_whisper_language(language: Language) str[source]
Maps pipecat Language enum to Whisper API language codes.
Language support for Whisper API. Docs: https://platform.openai.com/docs/guides/speech-to-text#supported-languages
- Parameters:
language – A Language enum value representing the input language.
- Returns:
The corresponding service language code. If
languageis not in the verified mapping, falls back to the base language code (e.g.,enfromen-US) and logs a warning (viaresolve_language(..., use_base_code=True)).
- class pipecat.services.whisper.base_stt.BaseWhisperSTTService(*, model: str | None = None, api_key: str | None = None, base_url: str | None = None, language: Language | None = None, prompt: str | None = None, temperature: float | None = None, include_prob_metrics: bool = False, push_empty_transcripts: bool = False, settings: BaseWhisperSTTSettings | None = None, ttfs_p99_latency: float | None = 1.0, **kwargs)[source]
Bases:
SegmentedSTTServiceBase class for Whisper-based speech-to-text services.
Provides common functionality for services implementing the Whisper API interface, including metrics generation and error handling.
- Settings
alias of
BaseWhisperSTTSettings
- __init__(*, model: str | None = None, api_key: str | None = None, base_url: str | None = None, language: Language | None = None, prompt: str | None = None, temperature: float | None = None, include_prob_metrics: bool = False, push_empty_transcripts: bool = False, settings: BaseWhisperSTTSettings | None = None, ttfs_p99_latency: float | None = 1.0, **kwargs)[source]
Initialize the Whisper STT service.
- Parameters:
model –
Name of the Whisper model to use.
Deprecated since version 0.0.105: Use
settings=BaseWhisperSTTService.Settings(model=...)instead.api_key – Service API key. Defaults to None.
base_url – Service API base URL. Defaults to None.
language –
Language of the audio input.
Deprecated since version 0.0.105: Use
settings=BaseWhisperSTTService.Settings(language=...)instead.prompt –
Optional text to guide the model’s style or continue a previous segment.
Deprecated since version 0.0.105: Use
settings=BaseWhisperSTTService.Settings(prompt=...)instead.temperature –
Sampling temperature between 0 and 1.
Deprecated since version 0.0.105: Use
settings=BaseWhisperSTTService.Settings(temperature=...)instead.include_prob_metrics – If True, enables probability metrics in API response. Each service implements this differently (see child classes). Defaults to False.
push_empty_transcripts – If true, allow empty TranscriptionFrame frames to be pushed downstream instead of discarding them. This is intended for situations where VAD fires even though the user did not speak. In these cases, it is useful to know that nothing was transcribed so that the agent can resume speaking, instead of waiting longer for a transcription. Defaults to False.
settings – Runtime-updatable settings. When provided alongside deprecated parameters,
settingsvalues take precedence.ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark
**kwargs – Additional arguments passed to SegmentedSTTService.
- can_generate_metrics() bool[source]
Whether this service can generate processing metrics.
- Returns:
True, as this service supports metric generation.
- Return type:
bool
- language_to_service_language(language: Language) str | None[source]
Convert from pipecat Language to service language code.
- Parameters:
language – The Language enum value to convert.
- Returns:
The corresponding service language code, or None if not supported.
- async run_stt(audio: bytes) AsyncGenerator[Frame, None][source]
Transcribe audio data to text.
- Parameters:
audio – Raw audio data to transcribe.
- Yields:
Frame –
- Either a TranscriptionFrame containing the transcribed text
or an ErrorFrame if transcription fails.
- async push_frame(frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM)
Push a frame, marking TranscriptionFrames as finalized.
Segmented STT services process complete speech segments and return a single TranscriptionFrame per segment, so every transcription is inherently finalized.
- Parameters:
frame – The frame to push.
direction – The direction of frame flow in the pipeline.
- async stop_ttfb_metrics(*, end_time: float | None = None)
Stop time-to-first-byte metrics collection and push results.
- Parameters:
end_time – Optional timestamp to use as the end time. If None, uses the current time.