stt

NVIDIA Nemotron ASR STT service backed by an AWS SageMaker bidirectional-stream endpoint.

Uses SageMaker’s HTTP/2 bidi-stream API to maintain a persistent connection to the wrapper’s /invocations-bidirectional-stream endpoint, which proxies to NIM’s realtime WebSocket.

Audio is streamed as base64-encoded PCM16 chunks via input_audio_buffer.append events. Transcription deltas arrive as InterimTranscriptionFrames and final results as TranscriptionFrames.

When the VAD detects the user has stopped speaking, input_audio_buffer.commit is sent to trigger NIM to finalise the current utterance.

class pipecat.services.nvidia.sagemaker.stt.NvidiaSageMakerSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, Any]=<factory>, language: Language | str | None | _NotGiven = <factory>)[source]

Bases: STTSettings

Settings for NvidiaSageMakerSTTService.

Parameters:

language – ISO-639-1 language code passed to NIM (e.g. en-US).

class pipecat.services.nvidia.sagemaker.stt.NvidiaSageMakerSTTService(*, endpoint_name: str, region: str = 'us-west-2', sample_rate: int | None = None, settings: NvidiaSageMakerSTTSettings | None = None, ttfs_p99_latency: float | None = 1.5, **kwargs)[source]

Bases: STTService

NVIDIA Nemotron ASR STT service using SageMaker bidirectional streaming.

Maintains a persistent HTTP/2 bidi-stream session to the SageMaker endpoint for the lifetime of the pipeline. Audio chunks are forwarded as base64-encoded PCM16 via NIM realtime events; transcription results arrive asynchronously and are pushed as InterimTranscriptionFrame and TranscriptionFrame frames.

Example:

stt = NvidiaSageMakerSTTService(
    endpoint_name=os.getenv("SAGEMAKER_ASR_ENDPOINT_NAME"),
    region=os.getenv("AWS_REGION", "us-west-2"),
    settings=NvidiaSageMakerSTTService.Settings(
        language="en-US",
    ),
)
Settings

alias of NvidiaSageMakerSTTSettings

__init__(*, endpoint_name: str, region: str = 'us-west-2', sample_rate: int | None = None, settings: NvidiaSageMakerSTTSettings | None = None, ttfs_p99_latency: float | None = 1.5, **kwargs)[source]

Initialize the SageMaker WebSocket STT service.

Parameters:
  • endpoint_name – Name of the deployed SageMaker endpoint.

  • region – AWS region where the endpoint lives.

  • sample_rate – Input sample rate in Hz. Defaults to pipeline rate.

  • settings – Runtime-updatable settings (language, model).

  • ttfs_p99_latency – Expected p99 time-to-first-segment latency in seconds.

  • **kwargs – Forwarded to STTService.

can_generate_metrics() bool[source]

Check if this service can generate processing metrics.

Returns:

True, as this service supports metrics generation.

async start(frame: StartFrame)[source]

Start the STT service and connect to the SageMaker endpoint.

Parameters:

frame – The start frame containing initialization parameters.

async stop(frame: EndFrame)[source]

Stop the STT service and disconnect from the SageMaker endpoint.

Parameters:

frame – The end frame.

async cancel(frame: CancelFrame)[source]

Cancel the STT service and disconnect from the SageMaker endpoint.

Parameters:

frame – The cancel frame.

async run_stt(audio: bytes) AsyncGenerator[Frame | None, None][source]

Send an audio chunk to NIM; transcription results arrive asynchronously.

Each chunk is appended and immediately committed, matching the NVIDIA reference client pattern for continuous streaming transcription.

async process_frame(frame: Frame, direction: FrameDirection)[source]

Process frames with VAD-specific handling for metrics lifecycle.

Parameters:
  • frame – The frame to process.

  • direction – The direction of frame processing.

async push_frame(frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM)

Push a frame downstream, tracking TranscriptionFrame timestamps for TTFB.

Stores the timestamp of each TranscriptionFrame for TTFB calculation. If the frame is marked as finalized (via request_finalize/confirm_finalize), reports TTFB immediately and cancels any pending timeout. Otherwise, TTFB is reported after a timeout.

Parameters:
  • frame – The frame to push.

  • direction – The direction to push the frame.

async stop_ttfb_metrics(*, end_time: float | None = None)

Stop time-to-first-byte metrics collection and push results.

Parameters:

end_time – Optional timestamp to use as the end time. If None, uses the current time.