word_completion_tracker

Word completion tracker for TTS context ordering.

class pipecat.utils.context.word_completion_tracker.WordCompletionTracker(tts_text: str, llm_text: str | None = None, user_facing_text: str | None = None)[source]

Bases: object

Tracks whether all words from a source AggregatedTextFrame have been spoken.

Delegates completion tracking and cursor advancement entirely to a TextSegmentMap built from tts_text (which may include TTS-specific SSML tags, e.g. <spell>...</spell> returned by some TTS providers in word-timestamp events). The map matches each incoming word against the remaining TTS text and reports when the frame is fully spoken, robust to punctuation, spacing, and markup – this tracker’s own bookkeeping is limited to overriding cursors when a slot is force-completed (see below).

When llm_text is provided (e.g. the original pattern-matched text including delimiters like <card>4111 1111 1111 1111</card>), the tracker additionally maps each spoken word back to its corresponding span in that LLM text. This lets callers attach the original text to TTSTextFrame entries so the conversation context receives properly-tagged content rather than the cleaned words received from the TTS provider.

For unchanged segments (no text transforms applied) both cursors advance proportionally word-by-word; for transformed segments (e.g. "$42.50" → "forty two dollars and fifty cents") both cursors are held until the entire TTS segment is consumed, then jump to the end of the original span in one step.

Background: TTS providers apply their own SSML tags to the text before synthesis and return word-timestamp events containing the raw spoken words (e.g. "4111", "1111"). Without LLM-text tracking, the conversation context would only see those cleaned words and lose the original structure (e.g. <card>4111 1111 1111 1111</card>). By mapping consumed spans back to positions in llm_text, each TTSTextFrame can carry the exact span of original text it represents.

Overflow handling: TTS providers sometimes return a single word token that spans the boundary between two AggregatedTextFrames (e.g. "1111</spell>And" when one frame ends with 1111</card> and the next begins with And). The tracker detects this and exposes the raw overflow suffix via get_overflow_word(), so callers can feed the remainder into the next frame’s tracker and emit a correctly-attributed TTSTextFrame for each part.

Example:

tracker = WordCompletionTracker("Hello, world!")
tracker.add_word_and_check_complete("Hello")   # False
tracker.add_word_and_check_complete("world")   # True  — all TTS text consumed

__init__(tts_text: str, llm_text: str | None = None, user_facing_text: str | None = None)[source]

Initialize the tracker with the text of the frame being spoken.

Parameters:

tts_text – Full text of the AggregatedTextFrame sent to TTS (may include TTS-specific SSML tags). Used as the cursor reference for the TTS word stream.
llm_text – Original LLM-produced text including pattern delimiters (e.g. <card>4111 1111 1111 1111</card>). When provided, each add_word_and_check_complete call also returns the corresponding LLM span via get_llm_consumed().
user_facing_text – The original text of the AggregatedTextFrame as shown to the user (e.g. via RTVI). Unlike tts_text, this text has no TTS-specific tags or transformations. The tracker maintains a cursor into it so callers can retrieve the spoken and unspoken portions in terms of user-visible text via get_accumulated_user_facing_text() and get_remaining_user_facing_text(). Defaults to tts_text with markup stripped when not provided – user-facing text should never carry synthesis tags.

add_word_and_check_complete(word: str) → bool[source]

Record a spoken word from a word-timestamp event.

Before advancing, checks whether the word belongs to this frame via word_belongs_here. If it does not (e.g. the TTS provider silently dropped a word-timestamp), the slot is force-completed: the remaining unspoken text from tts_text is stored in _frame_word so a TTSTextFrame can still be emitted for the dropped portion, all remaining llm_text is consumed, and the entire incoming word is set as overflow so the caller’s overflow path routes it to the next slot unchanged.

Otherwise the word is handed to the segment map, which matches it against the remaining TTS text and advances its own cursors. If llm_text was provided at construction time, also stores the corresponding LLM span in _llm_consumed. When this word completes the frame, the entire remaining LLM text (including any closing tags) is consumed so nothing is lost.

If the word overshoots the expected length (overflow – it spans the boundary into the next AggregatedTextFrame), the raw suffix of the word is stored in _overflow_word, so the caller can attribute it to the next frame.

Parameters:: word – A single word token returned by the TTS service. TTS services that emit spaces and punctuation as separate tokens (e.g. Inworld) must pre-merge those tokens into the preceding word before calling this method (see TTSService._merge_punct_tokens). May also be a fragment of a still-open SSML tag; the segment map matches such fragments against the remaining TTS text without needing to parse them as markup.
Returns:: True when all expected content has been covered.

word_belongs_here(word: str) → bool[source]

Return True if this word plausibly belongs to the remaining TTS text.

Delegates entirely to the segment map, which owns the remaining-text matching needed to decide.

Used to detect when the TTS provider silently dropped a word-timestamp event: if the incoming word does not match this slot’s remaining content, the caller should force-complete this slot and route the word to the next.

suppress_in_context() → bool[source]

True when the last word is mid-flight inside a transformed segment.

When True, the sequencer sets append_to_context=False on the emitted TTSTextFrame so intermediate TTS words (e.g. “forty”, “two”) are not written to the conversation context. Only the completing word of the segment carries raw_text with the original text (e.g. "$42.50").

get_word_for_frame() → str | None[source]

Return the portion of the last word that belongs to this frame.

Normal word (no overflow): the full word.
Straddling word: the prefix up to the frame boundary (e.g. "1111" from "1111 And").
Force-completed (word didn’t belong): the remaining unspoken text from tts_text so a TTSTextFrame can still be emitted for the dropped portion. The incoming word is routed as overflow to the next slot.

get_overflow_word() → str | None[source]

Return the raw suffix of the last word that overflows into the next frame.

Preserves the original casing and any non-alphanumeric characters so the overflow TTSTextFrame has natural word text. Returns None when there is no overflow (the word fit entirely within this frame).

get_llm_consumed() → str | None[source]

Return the LLM text span consumed for the last added word.

Returns None if no llm_text was provided at construction time.

get_accumulated_user_facing_text() → str[source]: Return all consumed text from user_facing_text up to the current cursor position.

get_remaining_user_facing_text(strip: bool = True) → str[source]

Return the unspoken portion of user_facing_text.

Parameters:: strip – When True (default), leading/trailing whitespace is removed. Set to False to preserve leading whitespace so that get_accumulated_user_facing_text() + get_remaining_user_facing_text(strip=False) reconstructs the original text exactly.

get_accumulated_tts_text() → str[source]

Return all consumed text from tts_text up to the current cursor position.

Unlike get_word_for_frame() (which reflects only the last word), this returns everything that has been consumed since construction or the last reset().

get_accumulated_llm_text() → str | None[source]

Return all consumed text from llm_text up to the current cursor position.

Unlike get_llm_consumed() (which reflects only the last word), this returns everything that has been consumed since construction or the last reset(). Returns None if no llm_text was provided at construction time.

get_remaining_tts_text(strip: bool = True) → str[source]

Return the unspoken portion of tts_text.

Parameters:: strip – When True (default), leading/trailing whitespace is removed. Set to False to preserve leading whitespace so that get_accumulated_tts_text() + get_remaining_tts_text(strip=False) reconstructs the original text exactly.

get_remaining_llm_text() → str | None[source]

Return the unspoken portion of llm_text, stripped of leading/trailing whitespace.

Returns None if no llm_text was provided at construction time. Like get_remaining_tts_text(), intended for force-completing a slot so that the conversation context receives the full original text.

property is_complete: bool: True when this frame’s TTS text has been fully accounted for.

reset()[source]: Reset all cursors and per-call outputs without changing the expected texts.