stt
Cartesia Ink-2 Streaming ASR (v2 turn-based) speech-to-text service.
- class pipecat.services.cartesia.turns.stt.CartesiaTurnsSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, Any]=<factory>, language: Language | str | None | _NotGiven = <factory>)[source]
Bases:
STTSettingsSettings for CartesiaTurnsSTTService.
The ink-2 model family is English-only and does not support runtime model or language switching, so no fields are added beyond the inherited
STTSettings.
- class pipecat.services.cartesia.turns.stt.CartesiaTurnsSTTService(*, api_key: str, url: str = 'wss://api.cartesia.ai/stt/turns/websocket', sample_rate: int | None = None, should_interrupt: bool = True, watchdog_min_timeout: float = 0.5, extra_headers: dict[str, str] | None = None, settings: CartesiaTurnsSTTSettings | None = None, **kwargs)[source]
Bases:
WebsocketSTTServiceSpeech-to-text service using the Cartesia Streaming ASR v2 (Ink-2) API.
Speaks the v2 turn-based wire protocol exposed by
/stt/turns/websocket. The server drives the conversation:connected -> turn.start -> turn.update* -> (turn.eager_end -> turn.resume?)* -> turn.end -> ...Transcripts are cumulative per turn; there is no
is_finalflag and nofinalizecommand — closing the socket ends the session.Each
turn.startpushes aUserStartedSpeakingFrame; eachturn.updatepushes anInterimTranscriptionFrame;turn.endpushes a finalTranscriptionFramefollowed by aUserStoppedSpeakingFrame.turn.eager_endandturn.resumeare surfaced only via their respective event handlers.Event handlers available (in addition to the base
on_connected/on_disconnected/on_connection_error):on_turn_start(service, transcript): server detected start of a turn
on_turn_update(service, transcript): incremental transcript update
on_turn_eager_end(service, transcript): server eagerly predicted end of turn
on_turn_resume(service): user resumed speaking after an eager end
on_turn_end(service, transcript): final transcript for the completed turn
Example:
@stt.event_handler("on_turn_end") async def on_turn_end(service, transcript): ...
- Settings
alias of
CartesiaTurnsSTTSettings
- __init__(*, api_key: str, url: str = 'wss://api.cartesia.ai/stt/turns/websocket', sample_rate: int | None = None, should_interrupt: bool = True, watchdog_min_timeout: float = 0.5, extra_headers: dict[str, str] | None = None, settings: CartesiaTurnsSTTSettings | None = None, **kwargs)[source]
Initialize the Cartesia Ink-2 STT service.
- Parameters:
api_key – Cartesia API key.
url – WebSocket URL for the Cartesia Streaming ASR v2 endpoint.
sample_rate – Audio sample rate in Hz. If
None, uses the pipeline sample rate.should_interrupt – Whether to broadcast an interruption when the server signals the start of a new turn.
watchdog_min_timeout – Minimum idle timeout before sending silence to prevent dangling turns. The actual threshold is
max(chunk_duration * 2, watchdog_min_timeout). Defaults to 0.5.extra_headers – Optional additional HTTP headers to send with the WebSocket handshake.
settings – Runtime-updatable settings. The ink-2 family does not support runtime model or language switching; attempts to update either field will be reported as unhandled.
**kwargs – Additional arguments passed to the parent
WebsocketSTTService.
- can_generate_metrics() bool[source]
Check if this service can generate processing metrics.
- Returns:
True, as Cartesia Ink-2 service supports metrics generation.
- property supports_ttfs: bool
the server defines turn boundaries directly.
- Type:
TTFS doesn’t apply
- async start(frame: StartFrame)[source]
Start the STT service and establish the WebSocket connection.
- Parameters:
frame – The start frame containing initialization parameters.
- async stop(frame: EndFrame)[source]
Stop the STT service and close the WebSocket connection.
- Parameters:
frame – The end frame.
- async cancel(frame: CancelFrame)[source]
Cancel the STT service and close the WebSocket connection.
- Parameters:
frame – The cancel frame.
- async run_stt(audio: bytes) AsyncGenerator[Frame | None, None][source]
Forward raw PCM audio to the server.
Transcription results are delivered asynchronously via the receive task and are not yielded from this method.
- Parameters:
audio – Raw 16-bit signed little-endian PCM audio bytes.
- Yields:
Frame –
None(transcription results are pushed by the receivetask), or
ErrorFrameon send failure.
- async push_frame(frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM)
Push a frame downstream, tracking TranscriptionFrame timestamps for TTFB.
Stores the timestamp of each TranscriptionFrame for TTFB calculation. If the frame is marked as finalized (via request_finalize/confirm_finalize), reports TTFB immediately and cancels any pending timeout. Otherwise, TTFB is reported after a timeout.
- Parameters:
frame – The frame to push.
direction – The direction to push the frame.
- async stop_ttfb_metrics(*, end_time: float | None = None)
Stop time-to-first-byte metrics collection and push results.
- Parameters:
end_time – Optional timestamp to use as the end time. If None, uses the current time.