stt
Moonshine speech-to-text service with locally-downloaded ONNX models.
Moonshine is a small, fast ASR family that runs on the CPU via ONNX Runtime – no GPU and no API key. This module transcribes audio segments with a locally-downloaded Moonshine model (downloaded once on first use and cached).
- class pipecat.services.moonshine.stt.Model(*values)[source]
Bases:
StrEnumWell-known Moonshine model architectures.
Pass a member (or the equivalent string) as
MoonshineSTTService.Settings’smodel. The larger models (SMALL_STREAMING,MEDIUM_STREAMING) ship only in streaming form, but transcribe a whole segment in batch just the same.- Parameters:
TINY – Smallest and fastest, lowest accuracy.
BASE – Good size/accuracy balance.
TINY_STREAMING – Streaming-capable
tiny.BASE_STREAMING – Streaming-capable
base(not available for every language).SMALL_STREAMING – Larger and more accurate than
base(the default).MEDIUM_STREAMING – Largest, most accurate.
- TINY = 'tiny'
- BASE = 'base'
- TINY_STREAMING = 'tiny-streaming'
- BASE_STREAMING = 'base-streaming'
- SMALL_STREAMING = 'small-streaming'
- MEDIUM_STREAMING = 'medium-streaming'
- class pipecat.services.moonshine.stt.MoonshineSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, Any]=<factory>, language: Language | str | None | _NotGiven = <factory>)[source]
Bases:
STTSettingsSettings for
MoonshineSTTService.- Parameters:
model – Moonshine model architecture, as a
Modelor the equivalent string (e.g.Model.SMALL_STREAMINGor"small-streaming"). Defaults toModel.SMALL_STREAMING.language – Language for transcription. Moonshine supports a handful of languages (English, Spanish, …); the base code is used.
- class pipecat.services.moonshine.stt.MoonshineSTTService(*, settings: MoonshineSTTSettings | None = None, **kwargs)[source]
Bases:
SegmentedSTTServiceTranscribe audio with a locally-downloaded Moonshine ONNX model.
Runs on the CPU via ONNX Runtime, so it needs no GPU and no API key. The model downloads once on first use and is cached. Each VAD-segmented utterance is transcribed in a single batch call (
transcribe_without_streaming); any model works, including the streaming-capable ones. Audio is expected as 16-bit mono PCM at 16 kHz.- Settings
alias of
MoonshineSTTSettings
- __init__(*, settings: MoonshineSTTSettings | None = None, **kwargs)[source]
Initialize the Moonshine STT service.
- Parameters:
settings – Runtime-updatable settings (
model,language).**kwargs – Additional arguments passed to
SegmentedSTTService.
- can_generate_metrics() bool[source]
Indicate whether this service can generate metrics.
- Returns:
True, as this service supports metric generation.
- async run_stt(audio: bytes) AsyncGenerator[Frame, None][source]
Transcribe audio data using Moonshine.
- Parameters:
audio – Raw 16-bit signed PCM mono audio at 16 kHz.
- Yields:
Frame – A
TranscriptionFramewith the transcribed text.
- async push_frame(frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM)
Push a frame, marking TranscriptionFrames as finalized.
Segmented STT services process complete speech segments and return a single TranscriptionFrame per segment, so every transcription is inherently finalized.
- Parameters:
frame – The frame to push.
direction – The direction of frame flow in the pipeline.
- async stop_ttfb_metrics(*, end_time: float | None = None)
Stop time-to-first-byte metrics collection and push results.
- Parameters:
end_time – Optional timestamp to use as the end time. If None, uses the current time.