word_timestamp_utils

Utilities for normalizing word-timestamp streams from TTS services.

pipecat.utils.text.word_timestamp_utils.merge_punct_tokens(word_times: list[tuple[str, float]]) → list[tuple[str, float]][source]

Merge punctuation/space-only tokens into the preceding word.

Some TTS services (e.g. Inworld) emit spaces and punctuation as separate word-timestamp tokens rather than attaching them to the adjacent word. This function collapses those tokens so downstream consumers always receive words with trailing punctuation already attached — identical to the format produced by ElevenLabs or Cartesia.

A token is considered punct/space-only when its text contains no alphanumeric characters after stripping XML/HTML tags. Such tokens are appended to the preceding word’s text and their timestamp is discarded (the preceding word’s timestamp is kept). Leading punct/space tokens with no preceding word are silently discarded. Every output token is stripped of leading and trailing whitespace (spaces, tabs, newlines).

Parameters:: word_times – Raw list of (word, timestamp) pairs from the TTS service.
Returns:: Merged list where every entry contains at least one alphanumeric character and has no leading or trailing whitespace.

Example:

merge_punct_tokens([("questions", 1.0), (", ", 1.2), ("explain", 1.4)])
# → [("questions,", 1.0), ("explain", 1.4)]