stt

Moonshine speech-to-text service with locally-downloaded ONNX models.

Moonshine is a small, fast ASR family that runs on the CPU via ONNX Runtime – no GPU and no API key. This module transcribes audio segments with a locally-downloaded Moonshine model (downloaded once on first use and cached).

class pipecat.services.moonshine.stt.Model(*values)[source]

Bases: StrEnum

Well-known Moonshine model architectures.

Pass a member (or the equivalent string) as MoonshineSTTService.Settings’s model. The larger models (SMALL_STREAMING, MEDIUM_STREAMING) ship only in streaming form, but transcribe a whole segment in batch just the same.

Parameters:
  • TINY – Smallest and fastest, lowest accuracy.

  • BASE – Good size/accuracy balance.

  • TINY_STREAMING – Streaming-capable tiny.

  • BASE_STREAMING – Streaming-capable base (not available for every language).

  • SMALL_STREAMING – Larger and more accurate than base (the default).

  • MEDIUM_STREAMING – Largest, most accurate.

TINY = 'tiny'
BASE = 'base'
TINY_STREAMING = 'tiny-streaming'
BASE_STREAMING = 'base-streaming'
SMALL_STREAMING = 'small-streaming'
MEDIUM_STREAMING = 'medium-streaming'
class pipecat.services.moonshine.stt.MoonshineSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, Any]=<factory>, language: Language | str | None | _NotGiven = <factory>)[source]

Bases: STTSettings

Settings for MoonshineSTTService.

Parameters:
  • model – Moonshine model architecture, as a Model or the equivalent string (e.g. Model.SMALL_STREAMING or "small-streaming"). Defaults to Model.SMALL_STREAMING.

  • language – Language for transcription. Moonshine supports a handful of languages (English, Spanish, …); the base code is used.

class pipecat.services.moonshine.stt.MoonshineSTTService(*, settings: MoonshineSTTSettings | None = None, **kwargs)[source]

Bases: SegmentedSTTService

Transcribe audio with a locally-downloaded Moonshine ONNX model.

Runs on the CPU via ONNX Runtime, so it needs no GPU and no API key. The model downloads once on first use and is cached. Each VAD-segmented utterance is transcribed in a single batch call (transcribe_without_streaming); any model works, including the streaming-capable ones. Audio is expected as 16-bit mono PCM at 16 kHz.

Settings

alias of MoonshineSTTSettings

__init__(*, settings: MoonshineSTTSettings | None = None, **kwargs)[source]

Initialize the Moonshine STT service.

Parameters:
  • settings – Runtime-updatable settings (model, language).

  • **kwargs – Additional arguments passed to SegmentedSTTService.

can_generate_metrics() bool[source]

Indicate whether this service can generate metrics.

Returns:

True, as this service supports metric generation.

async run_stt(audio: bytes) AsyncGenerator[Frame, None][source]

Transcribe audio data using Moonshine.

Parameters:

audio – Raw 16-bit signed PCM mono audio at 16 kHz.

Yields:

Frame – A TranscriptionFrame with the transcribed text.

async push_frame(frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM)

Push a frame, marking TranscriptionFrames as finalized.

Segmented STT services process complete speech segments and return a single TranscriptionFrame per segment, so every transcription is inherently finalized.

Parameters:
  • frame – The frame to push.

  • direction – The direction of frame flow in the pipeline.

async stop_ttfb_metrics(*, end_time: float | None = None)

Stop time-to-first-byte metrics collection and push results.

Parameters:

end_time – Optional timestamp to use as the end time. If None, uses the current time.