stt
NVIDIA Nemotron ASR STT service backed by an AWS SageMaker bidirectional-stream endpoint.
Uses SageMaker’s HTTP/2 bidi-stream API to maintain a persistent connection to the wrapper’s /invocations-bidirectional-stream endpoint, which proxies to NIM’s realtime WebSocket.
Audio is streamed as base64-encoded PCM16 chunks via input_audio_buffer.append events. Transcription deltas arrive as InterimTranscriptionFrames and final results as TranscriptionFrames.
When the VAD detects the user has stopped speaking, input_audio_buffer.commit is sent to trigger NIM to finalise the current utterance.
- class pipecat.services.nvidia.sagemaker.stt.NvidiaSageMakerSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, Any]=<factory>, language: Language | str | None | _NotGiven = <factory>)[source]
Bases:
STTSettingsSettings for NvidiaSageMakerSTTService.
- Parameters:
language – ISO-639-1 language code passed to NIM (e.g.
en-US).
- class pipecat.services.nvidia.sagemaker.stt.NvidiaSageMakerSTTService(*, endpoint_name: str, region: str = 'us-west-2', sample_rate: int | None = None, settings: NvidiaSageMakerSTTSettings | None = None, ttfs_p99_latency: float | None = 1.5, **kwargs)[source]
Bases:
STTServiceNVIDIA Nemotron ASR STT service using SageMaker bidirectional streaming.
Maintains a persistent HTTP/2 bidi-stream session to the SageMaker endpoint for the lifetime of the pipeline. Audio chunks are forwarded as base64-encoded PCM16 via NIM realtime events; transcription results arrive asynchronously and are pushed as
InterimTranscriptionFrameandTranscriptionFrameframes.Example:
stt = NvidiaSageMakerSTTService( endpoint_name=os.getenv("SAGEMAKER_ASR_ENDPOINT_NAME"), region=os.getenv("AWS_REGION", "us-west-2"), settings=NvidiaSageMakerSTTService.Settings( language="en-US", ), )
- Settings
alias of
NvidiaSageMakerSTTSettings
- __init__(*, endpoint_name: str, region: str = 'us-west-2', sample_rate: int | None = None, settings: NvidiaSageMakerSTTSettings | None = None, ttfs_p99_latency: float | None = 1.5, **kwargs)[source]
Initialize the SageMaker WebSocket STT service.
- Parameters:
endpoint_name – Name of the deployed SageMaker endpoint.
region – AWS region where the endpoint lives.
sample_rate – Input sample rate in Hz. Defaults to pipeline rate.
settings – Runtime-updatable settings (language, model).
ttfs_p99_latency – Expected p99 time-to-first-segment latency in seconds.
**kwargs – Forwarded to
STTService.
- can_generate_metrics() bool[source]
Check if this service can generate processing metrics.
- Returns:
True, as this service supports metrics generation.
- async start(frame: StartFrame)[source]
Start the STT service and connect to the SageMaker endpoint.
- Parameters:
frame – The start frame containing initialization parameters.
- async stop(frame: EndFrame)[source]
Stop the STT service and disconnect from the SageMaker endpoint.
- Parameters:
frame – The end frame.
- async cancel(frame: CancelFrame)[source]
Cancel the STT service and disconnect from the SageMaker endpoint.
- Parameters:
frame – The cancel frame.
- async run_stt(audio: bytes) AsyncGenerator[Frame | None, None][source]
Send an audio chunk to NIM; transcription results arrive asynchronously.
Each chunk is appended and immediately committed, matching the NVIDIA reference client pattern for continuous streaming transcription.
- async process_frame(frame: Frame, direction: FrameDirection)[source]
Process frames with VAD-specific handling for metrics lifecycle.
- Parameters:
frame – The frame to process.
direction – The direction of frame processing.
- async push_frame(frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM)
Push a frame downstream, tracking TranscriptionFrame timestamps for TTFB.
Stores the timestamp of each TranscriptionFrame for TTFB calculation. If the frame is marked as finalized (via request_finalize/confirm_finalize), reports TTFB immediately and cancels any pending timeout. Otherwise, TTFB is reported after a timeout.
- Parameters:
frame – The frame to push.
direction – The direction to push the frame.
- async stop_ttfb_metrics(*, end_time: float | None = None)
Stop time-to-first-byte metrics collection and push results.
- Parameters:
end_time – Optional timestamp to use as the end time. If None, uses the current time.