stt
xAI speech-to-text service implementation.
This module provides integration with xAI’s real-time speech-to-text WebSocket API documented at https://docs.x.ai/developers/rest-api-reference/inference/voice.
- pipecat.services.xai.stt.language_to_xai_stt_language(language: Language) str[source]
Convert a Language enum to the xAI STT language code.
xAI STT accepts two-letter language codes (e.g.
en,fr,de,ja). When set, the server applies Inverse Text Normalization.- Parameters:
language – The Language enum value to convert.
- Returns:
The corresponding service language code. If
languageis not in the verified mapping, falls back to the base language code (e.g.,enfromen-US) and logs a warning (viaresolve_language(..., use_base_code=True)).
- class pipecat.services.xai.stt.XAISTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, language: Language | str | None | _NotGiven = <factory>, interim_results: bool | _NotGiven = <factory>, endpointing: int | None | _NotGiven = <factory>, multichannel: bool | None | _NotGiven = <factory>, channels: int | None | _NotGiven = <factory>, diarize: bool | None | _NotGiven = <factory>)[source]
Bases:
STTSettingsSettings for XAISTTService.
- Parameters:
interim_results – When True, partial transcripts are emitted approximately every 500ms.
endpointing – Silence duration in milliseconds that triggers a speech-final event. Range 0-5000. Server default is 10ms.
multichannel – When True, transcribes each interleaved channel independently. Requires
channels>= 2.channels – Number of interleaved channels (2-8). Required when
multichannelis True.diarize – When True, the server attaches a
speakerfield to each word identifying the detected speaker.
- interim_results: bool | _NotGiven
- endpointing: int | None | _NotGiven
- multichannel: bool | None | _NotGiven
- channels: int | None | _NotGiven
- diarize: bool | None | _NotGiven
- class pipecat.services.xai.stt.XAISTTService(*, api_key: str, ws_url: str = 'wss://api.x.ai/v1/stt', sample_rate: int = 16000, encoding: str = 'pcm', settings: XAISTTSettings | None = None, ttfs_p99_latency: float | None = 2.14, **kwargs)[source]
Bases:
WebsocketSTTServicexAI real-time speech-to-text service.
Streams audio to xAI’s WebSocket STT endpoint and emits interim and final transcription frames. The
XAI_API_KEYis passed directly as a Bearer token on the WebSocket handshake.The connection is persistent: audio is streamed continuously and the server emits
transcript.partialevents withis_finalandspeech_finalflags to mark utterance boundaries. If the connection drops mid-session, the base class reconnects automatically.- Settings
alias of
XAISTTSettings
- __init__(*, api_key: str, ws_url: str = 'wss://api.x.ai/v1/stt', sample_rate: int = 16000, encoding: str = 'pcm', settings: XAISTTSettings | None = None, ttfs_p99_latency: float | None = 2.14, **kwargs)[source]
Initialize the xAI STT service.
- Parameters:
api_key – xAI API key (used as Bearer for the WebSocket handshake).
ws_url – WebSocket endpoint URL. Defaults to
wss://api.x.ai/v1/stt.sample_rate – Audio sample rate in Hz. Supported values: 8000, 16000, 22050, 24000, 44100, 48000. Defaults to 16000.
encoding – Audio encoding. One of
"pcm"(signed 16-bit LE),"mulaw", or"alaw". Defaults to"pcm".settings – Runtime-updatable settings overriding defaults.
ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. See https://github.com/pipecat-ai/stt-benchmark.
**kwargs – Additional arguments passed to WebsocketSTTService.
- can_generate_metrics() bool[source]
Check if the service can generate metrics.
- Returns:
True if metrics generation is supported.
- language_to_service_language(language: Language) str | None[source]
Convert a Language enum to the xAI STT language code.
- async start(frame: StartFrame)[source]
Start the speech-to-text service.
- async cancel(frame: CancelFrame)[source]
Cancel the speech-to-text service.
- async run_stt(audio: bytes) AsyncGenerator[Frame | None, None][source]
Forward raw audio bytes to the xAI STT WebSocket.
Transcription frames are pushed from the receive task, not yielded from this coroutine.
- async push_frame(frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM)
Push a frame downstream, tracking TranscriptionFrame timestamps for TTFB.
Stores the timestamp of each TranscriptionFrame for TTFB calculation. If the frame is marked as finalized (via request_finalize/confirm_finalize), reports TTFB immediately and cancels any pending timeout. Otherwise, TTFB is reported after a timeout.
- Parameters:
frame – The frame to push.
direction – The direction to push the frame.
- async stop_ttfb_metrics(*, end_time: float | None = None)
Stop time-to-first-byte metrics collection and push results.
- Parameters:
end_time – Optional timestamp to use as the end time. If None, uses the current time.