word_completion_tracker
Word completion tracker for TTS context ordering.
- class pipecat.utils.context.word_completion_tracker.WordCompletionTracker(tts_text: str, llm_text: str | None = None)[source]
Bases:
objectTracks whether all words from a source AggregatedTextFrame have been spoken.
Compares normalized alphanumeric character counts between the TTS text and accumulated spoken words, making the check robust to punctuation, spacing, and XML/HTML tags (e.g. SSML tags like
<spell>...</spell>returned by some TTS providers in word-timestamp events).When
llm_textis provided (e.g. the original pattern-matched text including delimiters like<card>4111 1111 1111 1111</card>), the tracker additionally maps each spoken word back to its corresponding span in that LLM text. This lets callers attach the original text toTTSTextFrameentries so the conversation context receives properly-tagged content rather than the cleaned words received from the TTS provider.Background: TTS providers apply their own SSML tags to the text before synthesis and return word-timestamp events containing the raw spoken words (e.g.
"4111","1111"). Without LLM-text tracking, the conversation context would only see those cleaned words and lose the original structure (e.g.<card>4111 1111 1111 1111</card>). By mapping normalized char counts back to positions inllm_text, each TTSTextFrame can carry the exact span of original text it represents.Overflow handling: TTS providers sometimes return a single word token that spans the boundary between two AggregatedTextFrames (e.g.
"1111</spell>And"when one frame ends with1111</card>and the next begins withAnd). The tracker detects this and exposes the raw overflow suffix viaget_overflow_word(), so callers can feed the remainder into the next frame’s tracker and emit a correctly-attributed TTSTextFrame for each part.Example:
tracker = WordCompletionTracker("Hello, world!") tracker.add_word_and_check_complete("Hello") # False tracker.add_word_and_check_complete("world") # True — normalized "helloworld" >= "helloworld"
- __init__(tts_text: str, llm_text: str | None = None)[source]
Initialize the tracker with the text of the frame being spoken.
- Parameters:
tts_text – Full text of the AggregatedTextFrame sent to TTS (may include TTS-specific SSML tags). Used for normalized char-count completion tracking and as the cursor reference for the TTS word stream.
llm_text – Original LLM-produced text including pattern delimiters (e.g.
<card>4111 1111 1111 1111</card>). When provided, eachadd_word_and_check_completecall also returns the corresponding LLM span viaget_llm_consumed(). Both texts normalize to the same alphanumeric sequence, so the same char-count cursor drives position tracking in both.
- add_word_and_check_complete(word: str) bool[source]
Record a spoken word from a word-timestamp event.
Normalizes
word, appends it to the running total, and checks whether all expected alphanumeric characters have been covered.Before advancing, checks whether the word belongs to this frame via
word_belongs_here. If it does not (e.g. the TTS provider silently dropped a word-timestamp), the slot is force-completed: the remaining unspoken text fromtts_textis stored in_frame_wordso a TTSTextFrame can still be emitted for the dropped portion, all remainingllm_textis consumed, and the entire incoming word is set as overflow so the caller’s overflow path routes it to the next slot unchanged.If
llm_textwas provided at construction time, also advances the LLM cursor by the same number of alphanumeric chars consumed from this word and stores the corresponding LLM span in_llm_consumed. When this word completes the frame, the entire remaining LLM text (including any closing tags) is consumed so nothing is lost.If the word overshoots the expected length (overflow), the raw suffix of the word (everything after the last char belonging to this frame) is stored in
_overflow_word, so the caller can attribute it to the next AggregatedTextFrame.- Parameters:
word – A single word token returned by the TTS service. TTS services that emit spaces and punctuation as separate tokens (e.g. Inworld) must pre-merge those tokens into the preceding word before calling this method (see
TTSService._merge_punct_tokens).- Returns:
True when all expected content has been covered.
- word_belongs_here(word: str) bool[source]
Return True if this word plausibly belongs to the remaining TTS text.
Dispatches to one of two checks depending on whether the word contains any alphanumeric characters after normalization:
Alnum words: prefix-match against the remaining expected chars.
Symbol/punctuation words (empty after normalization): literal substring search in the remaining raw TTS text, with a fallback for TTS providers that substitute Unicode symbols with ASCII punctuation.
Used to detect when the TTS provider silently dropped a word-timestamp event: if the incoming word does not match this slot’s remaining content, the caller should force-complete this slot and route the word to the next.
- get_word_for_frame() str | None[source]
Return the portion of the last word that belongs to this frame.
Normal word (no overflow): the full word.
Straddling word: the prefix up to the frame boundary (e.g.
"1111"from"1111 And").Force-completed (word didn’t belong): the remaining unspoken text from
tts_textso a TTSTextFrame can still be emitted for the dropped portion. The incoming word is routed as overflow to the next slot.
- get_overflow_word() str | None[source]
Return the raw suffix of the last word that overflows into the next frame.
Preserves the original casing and any non-alphanumeric characters so the overflow TTSTextFrame has natural word text. Returns None when there is no overflow (the word fit entirely within this frame).
- get_llm_consumed() str | None[source]
Return the LLM text span consumed for the last added word.
Returns None if no llm_text was provided at construction time.
- get_accumulated_tts_text() str[source]
Return all consumed text from tts_text up to the current cursor position.
Unlike
get_word_for_frame()(which reflects only the last word), this returns everything that has been consumed since construction or the lastreset().
- get_accumulated_llm_text() str | None[source]
Return all consumed text from llm_text up to the current cursor position.
Unlike
get_llm_consumed()(which reflects only the last word), this returns everything that has been consumed since construction or the lastreset(). Returns None if no llm_text was provided at construction time.
- get_remaining_tts_text() str[source]
Return the unspoken portion of tts_text, stripped of leading/trailing whitespace.
This is the text that the TTS provider has not yet confirmed via word-timestamp events. Useful for force-completing a slot when the audio context ends before all word-timestamp events have arrived.
- get_remaining_llm_text() str | None[source]
Return the unspoken portion of llm_text, stripped of leading/trailing whitespace.
Returns None if no llm_text was provided at construction time. Like
get_remaining_tts_text(), intended for force-completing a slot so that the conversation context receives the full original text.
- property is_complete: bool
True when accumulated normalized chars >= expected normalized chars.