word_completion_tracker

Word completion tracker for TTS context ordering.

class pipecat.utils.context.word_completion_tracker.WordCompletionTracker(tts_text: str, llm_text: str | None = None)[source]

Bases: object

Tracks whether all words from a source AggregatedTextFrame have been spoken.

Compares normalized alphanumeric character counts between the TTS text and accumulated spoken words, making the check robust to punctuation, spacing, and XML/HTML tags (e.g. SSML tags like <spell>...</spell> returned by some TTS providers in word-timestamp events).

When llm_text is provided (e.g. the original pattern-matched text including delimiters like <card>4111 1111 1111 1111</card>), the tracker additionally maps each spoken word back to its corresponding span in that LLM text. This lets callers attach the original text to TTSTextFrame entries so the conversation context receives properly-tagged content rather than the cleaned words received from the TTS provider.

Background: TTS providers apply their own SSML tags to the text before synthesis and return word-timestamp events containing the raw spoken words (e.g. "4111", "1111"). Without LLM-text tracking, the conversation context would only see those cleaned words and lose the original structure (e.g. <card>4111 1111 1111 1111</card>). By mapping normalized char counts back to positions in llm_text, each TTSTextFrame can carry the exact span of original text it represents.

Overflow handling: TTS providers sometimes return a single word token that spans the boundary between two AggregatedTextFrames (e.g. "1111</spell>And" when one frame ends with 1111</card> and the next begins with And). The tracker detects this and exposes the raw overflow suffix via get_overflow_word(), so callers can feed the remainder into the next frame’s tracker and emit a correctly-attributed TTSTextFrame for each part.

Example:

tracker = WordCompletionTracker("Hello, world!")
tracker.add_word_and_check_complete("Hello")   # False
tracker.add_word_and_check_complete("world")   # True  — normalized "helloworld" >= "helloworld"
__init__(tts_text: str, llm_text: str | None = None)[source]

Initialize the tracker with the text of the frame being spoken.

Parameters:
  • tts_text – Full text of the AggregatedTextFrame sent to TTS (may include TTS-specific SSML tags). Used for normalized char-count completion tracking and as the cursor reference for the TTS word stream.

  • llm_text – Original LLM-produced text including pattern delimiters (e.g. <card>4111 1111 1111 1111</card>). When provided, each add_word_and_check_complete call also returns the corresponding LLM span via get_llm_consumed(). Both texts normalize to the same alphanumeric sequence, so the same char-count cursor drives position tracking in both.

add_word_and_check_complete(word: str) bool[source]

Record a spoken word from a word-timestamp event.

Normalizes word, appends it to the running total, and checks whether all expected alphanumeric characters have been covered.

Before advancing, checks whether the word belongs to this frame via word_belongs_here. If it does not (e.g. the TTS provider silently dropped a word-timestamp), the slot is force-completed: the remaining unspoken text from tts_text is stored in _frame_word so a TTSTextFrame can still be emitted for the dropped portion, all remaining llm_text is consumed, and the entire incoming word is set as overflow so the caller’s overflow path routes it to the next slot unchanged.

If llm_text was provided at construction time, also advances the LLM cursor by the same number of alphanumeric chars consumed from this word and stores the corresponding LLM span in _llm_consumed. When this word completes the frame, the entire remaining LLM text (including any closing tags) is consumed so nothing is lost.

If the word overshoots the expected length (overflow), the raw suffix of the word (everything after the last char belonging to this frame) is stored in _overflow_word, so the caller can attribute it to the next AggregatedTextFrame.

Parameters:

word – A single word token returned by the TTS service. TTS services that emit spaces and punctuation as separate tokens (e.g. Inworld) must pre-merge those tokens into the preceding word before calling this method (see TTSService._merge_punct_tokens).

Returns:

True when all expected content has been covered.

word_belongs_here(word: str) bool[source]

Return True if this word plausibly belongs to the remaining TTS text.

Dispatches to one of two checks depending on whether the word contains any alphanumeric characters after normalization:

  • Alnum words: prefix-match against the remaining expected chars.

  • Symbol/punctuation words (empty after normalization): literal substring search in the remaining raw TTS text, with a fallback for TTS providers that substitute Unicode symbols with ASCII punctuation.

Used to detect when the TTS provider silently dropped a word-timestamp event: if the incoming word does not match this slot’s remaining content, the caller should force-complete this slot and route the word to the next.

get_word_for_frame() str | None[source]

Return the portion of the last word that belongs to this frame.

  • Normal word (no overflow): the full word.

  • Straddling word: the prefix up to the frame boundary (e.g. "1111" from "1111 And").

  • Force-completed (word didn’t belong): the remaining unspoken text from tts_text so a TTSTextFrame can still be emitted for the dropped portion. The incoming word is routed as overflow to the next slot.

get_overflow_word() str | None[source]

Return the raw suffix of the last word that overflows into the next frame.

Preserves the original casing and any non-alphanumeric characters so the overflow TTSTextFrame has natural word text. Returns None when there is no overflow (the word fit entirely within this frame).

get_llm_consumed() str | None[source]

Return the LLM text span consumed for the last added word.

Returns None if no llm_text was provided at construction time.

get_accumulated_tts_text() str[source]

Return all consumed text from tts_text up to the current cursor position.

Unlike get_word_for_frame() (which reflects only the last word), this returns everything that has been consumed since construction or the last reset().

get_accumulated_llm_text() str | None[source]

Return all consumed text from llm_text up to the current cursor position.

Unlike get_llm_consumed() (which reflects only the last word), this returns everything that has been consumed since construction or the last reset(). Returns None if no llm_text was provided at construction time.

get_remaining_tts_text() str[source]

Return the unspoken portion of tts_text, stripped of leading/trailing whitespace.

This is the text that the TTS provider has not yet confirmed via word-timestamp events. Useful for force-completing a slot when the audio context ends before all word-timestamp events have arrived.

get_remaining_llm_text() str | None[source]

Return the unspoken portion of llm_text, stripped of leading/trailing whitespace.

Returns None if no llm_text was provided at construction time. Like get_remaining_tts_text(), intended for force-completing a slot so that the conversation context receives the full original text.

property is_complete: bool

True when accumulated normalized chars >= expected normalized chars.

reset()[source]

Reset received word accumulation without changing the expected text.