aggregated_frame_sequencer

Ordered sequencer for AggregatedTextFrame slots through TTS processing.

class pipecat.utils.context.aggregated_frame_sequencer.AggregatedFrameSequencer(name: str = 'AggregatedFrameSequencer', streaming: bool = False)[source]

Bases: object

Sequences AggregatedTextFrame slots to preserve TTS context ordering.

Manages an ordered queue of spoken and skipped TTS slots. Spoken slots are tracked via a WordCompletionTracker; skipped slots (e.g. code blocks excluded from TTS synthesis) wait in-place until all preceding spoken slots are complete, then are flushed downstream with append_to_context=True.

Most methods are synchronous and return lists of frames the caller should push downstream, making the sequencer easily testable. The exceptions are register_spoken(), register_skipped(), and finalize(), which are async because — when the sequencer is built with streaming=True — they drive an async _ParallelSentenceAggregator to group streamed tokens into sentences.

Example:

sequencer = AggregatedFrameSequencer()
await sequencer.register_spoken(frame, ctx_id, tts_text, append_to_context=True)
for f in sequencer.process_word("hello", pts=1000, context_id=ctx_id):
    await self.push_frame(f)

__init__(name: str = 'AggregatedFrameSequencer', streaming: bool = False)[source]

Initialize the sequencer.

Parameters:

name – Label used in log messages (typically the owning TTS service name).
streaming –
True when tokens are dispatched to the TTS individually (TextAggregationMode.TOKEN). Each register_spoken() call then represents one token rather than a complete unit, so tokens are fed to a _ParallelSentenceAggregator and only turned into a real slot once a sentence boundary is detected (or forced via register_skipped()/finalize()). Fixed for the life of the sequencer — a TTS service’s aggregation mode never changes at runtime.

Requires the owning TTS service to reuse one context ID for the whole turn (reuse_context_id_within_turn=True, the default): a promoted sentence built from several tokens is registered under a single context ID, and word-timestamp events for all of its tokens must arrive tagged with that same ID. A per-token context ID would leave every token but the last unmatched, and its words dropped as stale.

async register_spoken(frame: AggregatedTextFrame, context_id: str, tts_text: str, append_to_context: bool, build_tracker: bool = True, includes_inter_frame_spaces: bool = False) → list[Frame][source]

Register a spoken AggregatedTextFrame slot.

Called from _push_tts_frames for every frame sent to the TTS service (one call per token when streaming=True, one call per complete unit otherwise). Builds the WordCompletionTracker internally when build_tracker is set — callers never construct one themselves. A registered slot is marked complete either via process_word() (word-timestamp services) or complete_spoken_slot() (push_text_frames=True services).

When the sequencer is non-streaming, or streaming without a tracker (push_text_frames=True providers), this registers a slot immediately. When streaming with a tracker, the call instead feeds this token to the _ParallelSentenceAggregator and only registers a real slot once a sentence boundary is confirmed there.

Parameters:

frame – The AggregatedTextFrame being spoken (one token when streaming).
context_id – The TTS context ID assigned to this frame.
tts_text – The text actually sent to the TTS for this call (may differ from frame.text after filters/transforms).
append_to_context – Whether word frames built for this context should carry append_to_context=True.
build_tracker – Whether to track word completion at all. False for push_text_frames=True services, which complete via complete_spoken_slot instead of word-timestamp matching.
includes_inter_frame_spaces – When True, every TTSTextFrame emitted for this slot carries includes_inter_frame_spaces=True so downstream consumers do not inject extra spaces between consecutive frames. Not used on the streaming path — there, CJK spacing is driven solely by process_word()’s per-call flag.

Returns:

Frames unblocked by this call (buffered words replayed once a pending sentence promotes). Always empty for the non-streaming and no-tracker cases.

async register_skipped(frame: AggregatedTextFrame, context_id: str, transport_destination: str | None) → list[Frame][source]

Register a skipped AggregatedTextFrame and attempt an immediate flush.

Any sentence still pending in the parallel aggregator is finalized first, so a real spoken slot exists immediately before the skipped slot in the queue — flush()’s “stop at first incomplete spoken slot” logic then blocks this skipped frame correctly until that sentence is actually spoken.

The frame is appended as a skipped slot. If no incomplete spoken slot precedes it, the frame is returned right away; otherwise it waits until a later flush() unblocks it.

Parameters:

frame – The skipped AggregatedTextFrame (e.g. a code block).
context_id – The context ID assigned in _push_tts_frames.
transport_destination – Transport routing value to attach at flush time.

Returns:

any sentence promoted by the initial finalize() (streaming mode), followed by this skipped frame once it is unblocked. The skipped frame itself is absent while a preceding spoken slot is still incomplete — the promoted-sentence frame can still be returned in that case, so the list is not necessarily empty when blocked.

Return type:

Frames to push downstream

async finalize() → list[Frame][source]

Force-promote any still-pending sentence into a real slot.

Called at true end-of-turn (no more tokens are coming), to handle a response that ends with no terminal punctuation. A no-op when nothing is pending (or the sequencer is not streaming).

Returns:: Frames unblocked by finalizing (e.g. buffered words that can now be replayed against the newly-registered slot).

process_word(word: str, pts: int, context_id: str | None, includes_inter_frame_spaces: bool = False) → list[Frame][source]

Process one word-timestamp event and return frames to push downstream.

Locates the active (first incomplete spoken) slot with a tracker, advances it by the incoming word, and builds a TTSTextFrame. Handles:

Words from a context that was never registered or was wiped by clear() on interruption: dropped as stale (returns an empty list).
Normal words that fit entirely within the active slot.
Overflow words straddling two slot boundaries.
Force-complete when the TTS drops an event (word belongs to the next slot).
Passthrough for words not recognised by any slot (buffered instead, when streaming, since the slot they belong to may simply not be promoted yet).
Flushes any skipped slots unblocked by slot completion.

Parameters:

word – A word token from the TTS service word-timestamp stream.
pts – Presentation timestamp (nanoseconds) to assign to the frame.
context_id – TTS context ID from the word-timestamp event.
includes_inter_frame_spaces – Stamped onto the emitted TTSTextFrame so downstream consumers know not to inject extra spaces between frames.

Returns:

Ordered list of frames (TTSTextFrame and/or AggregatedTextFrame) to push.

complete_spoken_slot() → list[Frame][source]

Mark the first pending spoken slot complete and flush unblocked skipped frames.

Used by push_text_frames=True services: after the TTSTextFrame has been appended to the audio context, this marks the spoken slot done and releases any skipped frames waiting behind it.

Returns:: AggregatedTextFrame(s) that are now unblocked and should be pushed.

flush(last_word_pts: int | None = None) → list[Frame][source]

Walk the slot queue and return all skipped frames that are now unblocked.

Removes complete spoken slots from the head of the queue, then emits (and removes) skipped slots whose preceding spoken slots are all done. Stops at the first incomplete spoken slot.

Parameters:: last_word_pts – When provided, skipped frames receive this PTS so they appear immediately after the last spoken word in the timeline.
Returns:: AggregatedTextFrame(s) ready to be pushed downstream.

force_complete(last_word_pts: int) → list[Frame][source]

Force-complete all incomplete spoken slots and flush skipped frames.

Called at the end of an audio context to handle TTS providers that silently drop word-timestamp events. Emits a TTSTextFrame for any remaining unspoken text in each incomplete slot, marks it complete, then flushes all now-unblocked skipped frames.

Parameters:: last_word_pts – PTS of the last received word frame, used as the PTS for force-completed frames and forwarded to flush().
Returns:: Combined list of TTSTextFrames (for incomplete spoken slots) and AggregatedTextFrames (skipped slots now unblocked), in emission order.

clear() → None[source]: Clear all slots and context metadata (called on interruption/reset).