stt

Cartesia Ink-2 Streaming ASR (v2 turn-based) speech-to-text service.

Bases: STTSettings

Settings for CartesiaTurnsSTTService.

The ink-2 model family is English-only and does not support runtime model or language switching, so no fields are added beyond the inherited STTSettings.

class pipecat.services.cartesia.turns.stt.CartesiaTurnsSTTService(*, api_key: str, url: str = 'wss://api.cartesia.ai/stt/turns/websocket', sample_rate: int | None = None, should_interrupt: bool = True, watchdog_min_timeout: float = 0.5, extra_headers: dict[str, str] | None = None, settings: CartesiaTurnsSTTSettings | None = None, **kwargs)[source]

Bases: WebsocketSTTService

Speech-to-text service using the Cartesia Streaming ASR v2 (Ink-2) API.

Speaks the v2 turn-based wire protocol exposed by /stt/turns/websocket. The server drives the conversation:

connected -> turn.start -> turn.update* -> (turn.eager_end -> turn.resume?)*
                        -> turn.end -> ...

Transcripts are cumulative per turn; there is no is_final flag and no finalize command — closing the socket ends the session.

Each turn.start pushes a UserStartedSpeakingFrame; each turn.update pushes an InterimTranscriptionFrame; turn.end pushes a final TranscriptionFrame followed by a UserStoppedSpeakingFrame. turn.eager_end and turn.resume are surfaced only via their respective event handlers.

Event handlers available (in addition to the base on_connected / on_disconnected / on_connection_error):

on_turn_start(service, transcript): server detected start of a turn
on_turn_update(service, transcript): incremental transcript update
on_turn_eager_end(service, transcript): server eagerly predicted end of turn
on_turn_resume(service): user resumed speaking after an eager end
on_turn_end(service, transcript): final transcript for the completed turn

Example:

@stt.event_handler("on_turn_end")
async def on_turn_end(service, transcript):
    ...

Settings: alias of CartesiaTurnsSTTSettings

__init__(*, api_key: str, url: str = 'wss://api.cartesia.ai/stt/turns/websocket', sample_rate: int | None = None, should_interrupt: bool = True, watchdog_min_timeout: float = 0.5, extra_headers: dict[str, str] | None = None, settings: CartesiaTurnsSTTSettings | None = None, **kwargs)[source]

Initialize the Cartesia Ink-2 STT service.

Parameters:

api_key – Cartesia API key.
url – WebSocket URL for the Cartesia Streaming ASR v2 endpoint.
sample_rate – Audio sample rate in Hz. If None, uses the pipeline sample rate.
should_interrupt – Whether to broadcast an interruption when the server signals the start of a new turn.
watchdog_min_timeout – Minimum idle timeout before sending silence to prevent dangling turns. The actual threshold is max(chunk_duration * 2, watchdog_min_timeout). Defaults to 0.5.
extra_headers – Optional additional HTTP headers to send with the WebSocket handshake.
settings – Runtime-updatable settings. The ink-2 family does not support runtime model or language switching; attempts to update either field will be reported as unhandled.
**kwargs – Additional arguments passed to the parent WebsocketSTTService.

can_generate_metrics() → bool[source]

Check if this service can generate processing metrics.

Returns:: True, as Cartesia Ink-2 service supports metrics generation.

property supports_ttfs: bool

the server defines turn boundaries directly.

Type:: TTFS doesn’t apply

service_metadata_frame() → STTMetadataFrame[source]

Recommend external turn strategies: this service detects turns server-side.

Cartesia’s turn-detection STT defines turn boundaries on the server and emits UserStarted/StoppedSpeakingFrame, so the user aggregator defers to those rather than running local VAD/smart-turn. Applied unless the user passed their own user_turn_strategies.

async start(frame: StartFrame)[source]

Start the STT service and establish the WebSocket connection.

Parameters:: frame – The start frame containing initialization parameters.

async stop(frame: EndFrame)[source]

Stop the STT service and close the WebSocket connection.

Parameters:: frame – The end frame.

async cancel(frame: CancelFrame)[source]

Cancel the STT service and close the WebSocket connection.

Parameters:: frame – The cancel frame.

async run_stt(audio: bytes) → AsyncGenerator[Frame | None, None][source]

Forward raw PCM audio to the server.

Transcription results are delivered asynchronously via the receive task and are not yielded from this method.

Parameters:

audio – Raw 16-bit signed little-endian PCM audio bytes.

Yields:

Frame –

None (transcription results are pushed by the receive: task), or ErrorFrame on send failure.

async push_frame(frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM)

Push a frame downstream, tracking TranscriptionFrame timestamps for TTFB.

Stores the timestamp of each TranscriptionFrame for TTFB calculation. If the frame is marked as finalized (via request_finalize/confirm_finalize), reports TTFB immediately and cancels any pending timeout. Otherwise, TTFB is reported after a timeout.

Parameters:

frame – The frame to push.
direction – The direction to push the frame.

async stop_ttfb_metrics(*, end_time: float | None = None)

Stop time-to-first-byte metrics collection and push results.

Parameters:: end_time – Optional timestamp to use as the end time. If None, uses the current time.