ui

UI worker: an LLM worker that observes and drives a client GUI over RTVI.

Composes the RTVI UI wire protocol (client events, accessibility snapshots, server UI commands) with an opt-in ReplyToolMixin for the bundled reply tool. PipelineWorker connects a UIWorker to the client automatically whenever RTVI is enabled — no decorator or separate component to wire up.

class pipecat.workers.ui.BusUICommandMessage(command_name: str = '', payload: Any = None, *, source: str, target: str | None = None)[source]

Bases: BusUIDataMessage

A UI command sent from a server-side worker to the client.

Published by UIWorker.send_command(name, payload). PipelineWorker (in on_bus_message) translates this to an RTVIUICommandFrame(command=command_name, payload=payload) and pushes it through the pipeline.

Parameters:
  • command_name – App-defined command name.

  • payload – App-defined payload (already a plain dict by the time it lands on the bus).

command_name: str = ''
payload: Any = None
class pipecat.workers.ui.BusUIEventMessage(event_name: str = '', payload: Any = None, *, source: str, target: str | None = None)[source]

Bases: BusUIDataMessage

A UI event sent from the client to a server-side worker.

Emitted by PipelineWorker when the client dispatches an event via PipecatClient.sendUIEvent(event, payload). UIWorker subclasses dispatch these to @ui_event(name) handlers.

Parameters:
  • event_name – App-defined event name.

  • payload – App-defined payload. Schemaless by design.

event_name: str = ''
payload: Any = None
class pipecat.workers.ui.BusUIJobCompletedMessage(job_id: str = '', worker_name: str = '', status: str = '', response: Any = None, at: int = 0, *, source: str, target: str | None = None)[source]

Bases: BusUIDataMessage

A worker in a user-facing job group has completed.

Forwarded by the UIWorker whenever a worker’s BusJobResponseMessage arrives for a registered user job group. PipelineWorker forwards to the client as a ui-job-group envelope with kind = "job_completed".

Parameters:
  • job_id – The shared job-group identifier.

  • worker_name – The worker that produced the response.

  • status – Completion status as a string (JobStatus value).

  • response – The worker’s response payload.

  • at – Epoch milliseconds when the response was received.

job_id: str = ''
worker_name: str = ''
status: str = ''
response: Any = None
at: int = 0
class pipecat.workers.ui.BusUIJobGroupCompletedMessage(job_id: str = '', at: int = 0, *, source: str, target: str | None = None)[source]

Bases: BusUIDataMessage

A user-facing job group has completed.

Published when UIWorker.ui_job_group(...) exits, after every worker has responded (or the group has been cancelled). PipelineWorker forwards to the client as a ui-job-group envelope with kind = "group_completed".

Parameters:
  • job_id – The shared job-group identifier.

  • at – Epoch milliseconds when the group completed.

job_id: str = ''
at: int = 0
class pipecat.workers.ui.BusUIJobGroupStartedMessage(job_id: str = '', workers: list[str] | None = None, label: str | None = None, cancellable: bool = True, at: int = 0, *, source: str, target: str | None = None)[source]

Bases: BusUIDataMessage

A user-facing job group has been dispatched.

Published by UIWorker.ui_job_group(...) on entry. PipelineWorker forwards it to the client as a ui-job-group envelope with kind = "group_started".

Parameters:
  • job_id – Shared job-group identifier for the group.

  • workers – Names of the workers the work was dispatched to.

  • label – Optional human-readable label for the group.

  • cancellable – Whether the client may request cancellation.

  • at – Epoch milliseconds when the group started.

job_id: str = ''
workers: list[str] | None = None
label: str | None = None
cancellable: bool = True
at: int = 0
class pipecat.workers.ui.BusUIJobUpdateMessage(job_id: str = '', worker_name: str = '', data: Any = None, at: int = 0, *, source: str, target: str | None = None)[source]

Bases: BusUIDataMessage

Per-worker progress for a user-facing job group.

Forwarded by the UIWorker whenever a worker emits a BusJobUpdateMessage whose job_id matches a registered user job group. PipelineWorker forwards to the client as a ui-job-group envelope with kind = "job_update".

Parameters:
  • job_id – The shared job-group identifier.

  • worker_name – The worker that produced the update.

  • data – The worker’s update payload, forwarded verbatim.

  • at – Epoch milliseconds when the update was emitted on the bus.

job_id: str = ''
worker_name: str = ''
data: Any = None
at: int = 0
class pipecat.workers.ui.ReplyToolMixin[source]

Bases: object

Expose a reply tool covering the full standard action set.

Single bundled LLM tool with a required spoken answer plus optional visual and state-changing actions. One tool call per turn, no chaining; the required answer argument is enforced by the API schema so the model cannot omit the terminator.

Compose alongside UIWorker:

class MyUIWorker(ReplyToolMixin, UIWorker):
    ...

Covers pointing apps (scroll_to + highlight), reading apps (scroll_to + select_text), form apps (fills + click), and any blend (e.g. a document review with selection-based deixis AND voice-driven note-taking). The LLM uses whichever fields fit the user’s request per turn; unused fields stay null and don’t affect behavior.

Delivers answer as verbatim TTS (respond_to_job(answer, tts_speak=True)) – the worker speaks the exact phrase. Apps that want a minimal schema (only the fields actually used, or app-specific commands), or that want the requester’s voice LLM to phrase the reply instead, write their own @tool reply on the UIWorker subclass directly. Use the helper methods on UIWorker plus send_command to dispatch the underlying UI commands.

The host class must provide scroll_to, highlight, select_text, click, set_input_value, and respond_to_job (UIWorker does) and must be the target of @tool discovery on the LLM pipeline.

async reply(params: FunctionCallParams, answer: str, scroll_to: str | None = None, highlight: list[str] | None = None, select_text: str | None = None, fills: list[dict] | None = None, click: list[str] | None = None)[source]

Reply to the user. Optionally point at content and act on inputs.

Always called exactly once per turn. answer is required; the action fields are optional and may be combined.

Visual / pointing actions (draw the user’s attention):

  • scroll_to brings an element into view (single ref).

  • highlight flashes elements briefly (list of refs). Best for short emphasis like a button or a fact.

  • select_text puts the page’s text selection on an element (single ref). Best for “this paragraph” / “the section about X” so the user sees exactly what was meant. Persists until the user clicks elsewhere.

State-changing actions (modify form / app state):

  • fills writes values into inputs (list of {"ref", "value"} objects, multi-fill in one turn).

  • click clicks elements (list of refs in order). Use for checkboxes, radios, submit buttons.

Order of dispatch within a turn: scroll_to, then highlight, then select_text, then fills, then click, then speak the answer.

Parameters:
  • params – Framework-provided tool invocation context.

  • answer – The spoken reply in plain language. One short sentence. No markdown, no symbols.

  • scroll_to – Optional snapshot ref. Scrolls the element into view before speaking.

  • highlight – Optional list of snapshot refs. Visually pulses each element.

  • select_text – Optional snapshot ref. Places the page’s text selection on that element.

  • fills – Optional list of {"ref": "eN", "value": "..."} objects. Writes each value into the input at ref.

  • click – Optional list of snapshot refs to click in order.

class pipecat.workers.ui.UIWorker(name: str, *, llm: LLMService[Any], context: LLMContext | None = None, assistant_params: LLMAssistantAggregatorParams | None = None, inject_events: bool = True, auto_inject_ui_state: bool = True, keep_history: bool = False, prompt_guide: str | None = '## UI context\n\nYour developer context includes two kinds of SDK-managed messages:\n\n- ``<ui_event name="..." >payload</ui_event>``: an event the user just triggered on the client (click, tab switch, navigation, etc.). The payload is JSON for that event.\n- ``<ui_state>...</ui_state>``: an accessibility snapshot of the current screen, injected at the start of every turn. Indented tree in Playwright-MCP style. Each line is ``- role "name" [state] [ref=eN]`` with children nested one level deeper. A line can also carry ``= "value"`` (an element\'s current value, e.g. text already typed into an input) and ``[level=N]`` (heading depth).\n\nState tags include ``[focused]``, ``[selected]``, ``[disabled]``, and ``[offscreen]``. A node tagged ``[offscreen]`` exists on the page but is not currently in the user\'s viewport; only visible (non-offscreen) nodes count for position-based references.\n\nGrids carry a ``[cols=N]`` tag. Their cells are listed in reading order (left-to-right, top-to-bottom); with N columns, cell K sits at row ``ceil(K/N)``, column ``((K-1) mod N) + 1``. Example with ``[cols=8]`` and 16 children: "top right" is cell 8, "bottom left" is cell 9.\n\nResolve position references ("top right", "the first one", "the third new release") against the most recent ``<ui_state>`` tree. Sibling order matches reading order on screen (top-to-bottom, left-to-right within each region).\n\nWhen the user has text selected on the page, the snapshot ends with a ``<selection ref="eN">selected text</selection>`` block inside ``<ui_state>``. Treat the selection as the deictic referent for "this", "that", "what I selected", and similar phrases. The ``ref`` identifies the closest enclosing element that has a ref in the tree; the inner text is the actual selected content (truncated if very long). Text inside ``<input>`` or ``<textarea>`` selections is faithful to ``selectionStart``/``selectionEnd`` on the element.\n\nRefs (``e42``) are stable handles for acting on elements: pass the ``ref`` from the most recent ``<ui_state>`` to any tool that operates on a node. The same element keeps its ref across snapshots while it stays on the page, so you can refer back to it across turns. Always resolve refs against the latest snapshot, and bring an ``[offscreen]`` element into view before acting on it.')[source]

Bases: LLMContextWorker

LLM worker that reads and drives a client GUI over the RTVI UI channel.

A UIWorker connects an LLM to whatever the user is looking at: it sees the screen as accessibility snapshots, reacts to the user’s UI events, and acts on the page by sending commands to the client. It is the delegate side of a voice/UI split – a voice layer (the main pipeline’s LLM, or a separate LLMWorker) handles speech and hands screen-relevant work to this worker.

Capabilities:

  • See the screen. The latest accessibility snapshot is rendered as <ui_state> and auto-injected into the LLM context before each inference.

  • React to UI events, dispatched to @ui_event(name) handlers.

  • Drive the UI with send_command and the scroll_to / highlight / select_text / click / set_input_value helpers.

  • Answer as a delegate. The built-in single-flight respond job runs one screen-grounded LLM turn that a @tool ends by calling respond_to_job (which decides how the answer reaches the user).

  • Surface long work. ui_job_group / start_ui_job_group fan work out to peer workers as cancellable job-group cards on the client.

PipelineWorker connects a UIWorker to the client automatically when RTVI is enabled – no extra wiring. A working subclass needs only an LLM and a @tool that calls respond_to_job; override render_query to read a non-default job payload.

Example:

class MyUIWorker(UIWorker):
    @ui_event("nav_click")
    async def on_nav(self, message):
        view = message.payload.get("view")
        ...

    @tool
    async def answer(self, params, text: str):
        await self.respond_to_job(text)
        await params.result_callback(None)

worker = MyUIWorker("ui", llm=OpenAILLMService(api_key="..."))

Note

With client trackViewport on (the default), off-screen nodes carry [offscreen] in <ui_state>; scroll_to before acting on them.

__init__(name: str, *, llm: LLMService[Any], context: LLMContext | None = None, assistant_params: LLMAssistantAggregatorParams | None = None, inject_events: bool = True, auto_inject_ui_state: bool = True, keep_history: bool = False, prompt_guide: str | None = '## UI context\n\nYour developer context includes two kinds of SDK-managed messages:\n\n- ``<ui_event name="..." >payload</ui_event>``: an event the user just triggered on the client (click, tab switch, navigation, etc.). The payload is JSON for that event.\n- ``<ui_state>...</ui_state>``: an accessibility snapshot of the current screen, injected at the start of every turn. Indented tree in Playwright-MCP style. Each line is ``- role "name" [state] [ref=eN]`` with children nested one level deeper. A line can also carry ``= "value"`` (an element\'s current value, e.g. text already typed into an input) and ``[level=N]`` (heading depth).\n\nState tags include ``[focused]``, ``[selected]``, ``[disabled]``, and ``[offscreen]``. A node tagged ``[offscreen]`` exists on the page but is not currently in the user\'s viewport; only visible (non-offscreen) nodes count for position-based references.\n\nGrids carry a ``[cols=N]`` tag. Their cells are listed in reading order (left-to-right, top-to-bottom); with N columns, cell K sits at row ``ceil(K/N)``, column ``((K-1) mod N) + 1``. Example with ``[cols=8]`` and 16 children: "top right" is cell 8, "bottom left" is cell 9.\n\nResolve position references ("top right", "the first one", "the third new release") against the most recent ``<ui_state>`` tree. Sibling order matches reading order on screen (top-to-bottom, left-to-right within each region).\n\nWhen the user has text selected on the page, the snapshot ends with a ``<selection ref="eN">selected text</selection>`` block inside ``<ui_state>``. Treat the selection as the deictic referent for "this", "that", "what I selected", and similar phrases. The ``ref`` identifies the closest enclosing element that has a ref in the tree; the inner text is the actual selected content (truncated if very long). Text inside ``<input>`` or ``<textarea>`` selections is faithful to ``selectionStart``/``selectionEnd`` on the element.\n\nRefs (``e42``) are stable handles for acting on elements: pass the ``ref`` from the most recent ``<ui_state>`` to any tool that operates on a node. The same element keeps its ref across snapshots while it stays on the page, so you can refer back to it across turns. Always resolve refs against the latest snapshot, and bring an ``[offscreen]`` element into view before acting on it.')[source]

Initialize the UIWorker.

Parameters:
  • name – Unique name for this worker.

  • llm – The LLM service.

  • context – Optional pre-built LLMContext. Seeded messages are part of the mutable history and are cleared on each keep_history=False reset; put durable instructions in the LLM’s system_instruction instead.

  • assistant_params – Optional assistant-aggregator parameters, e.g. to enable context summarization for keep_history=True workers.

  • inject_events – When True (the default), append each UI event to the context as a <ui_event> developer message. Override render_ui_event to change the content, or set False to disable.

  • auto_inject_ui_state – When True (the default), append the latest <ui_state> snapshot to the context before every inference (via the LLM’s on_before_process_frame hook). Set False to inject manually with inject_ui_state().

  • keep_history – When False (the default), the context is cleared at the start of every job, so each turn sees only the current <ui_state> and query – best for the stateless-delegate role. When True, history accumulates across jobs so the LLM can resolve multi-turn references (“the next one”, “the Pro version”), at the cost of more tokens and possible confusion from stale <ui_state> blocks. Use context summarization to prune the history when it gets too large.

  • prompt_guide – Wire-format guide appended to the LLM’s system_instruction so it can parse the <ui_state> / <ui_event> messages. Defaults to UI_STATE_PROMPT_GUIDE; pass a string to override or None to disable. Living in system_instruction, it survives context resets.

async send_command(name: str, payload: Any = None) None[source]

Send a named UI command to the client.

Publishes a BusUICommandMessage; when RTVI is enabled, PipelineWorker translates it into an RTVIUICommandFrame on the pipeline. Client-side handlers subscribed to RTVIEvent.UICommand (or React’s useUICommandHandler) dispatch on the command name.

Parameters:
  • name – App-defined command name (e.g. "toast", "navigate", or any app-specific name).

  • payload

    One of:

    • A pydantic BaseModel instance (including the built-in command models in pipecat.processors.frameworks.rtvi.models). Converted to a plain dict with model_dump().

    • A dataclass instance. Converted to a plain dict with dataclasses.asdict.

    • A dict forwarded as-is.

    • None, forwarded as an empty dict.

async scroll_to(ref: str) None[source]

Send a scroll_to UI command to bring an element into view.

Convenience wrapper around send_command("scroll_to", ScrollTo(ref=ref)). These scroll_to / highlight / select_text / click / set_input_value helpers are plain methods, not LLM tools: compose them inside a custom @tool body, or use ReplyToolMixin for the standard shape.

Parameters:

ref – Snapshot ref (e.g. "e42") from the latest <ui_state>.

async highlight(ref: str) None[source]

Send a highlight UI command to briefly flash an element.

Parameters:

ref – Snapshot ref (e.g. "e42") from the latest <ui_state>.

async select_text(ref: str, *, start_offset: int | None = None, end_offset: int | None = None) None[source]

Send a select_text UI command to select an element’s text.

Selects the whole element by default, or the start_offset.. end_offset character sub-range (over the element’s concatenated textContent) when both are given. Used for deixis – pointing at content via the page’s text selection.

Parameters:
  • ref – Snapshot ref (e.g. "e42") from the latest <ui_state>.

  • start_offset – Optional start character offset of the selection.

  • end_offset – Optional end character offset (exclusive).

async click(ref: str) None[source]

Send a click UI command (checkboxes, radios, submit buttons).

The standard client handler no-ops on disabled targets, so the worker can’t bypass affordances meant to be user-controlled.

Parameters:

ref – Snapshot ref (e.g. "e42") from the latest <ui_state>.

async set_input_value(ref: str, value: str, *, replace: bool = True) None[source]

Send a set_input_value UI command to fill a text input/textarea.

Parameters:
  • ref – Snapshot ref (e.g. "e42") of the input or textarea.

  • value – Text to write into the field.

  • replace – When True (the default), overwrite the field; when False, append (e.g. to continue a long answer in a textarea).

async on_bus_message(message: BusMessage) None[source]

Dispatch UI events alongside base lifecycle handling.

property current_job: BusJobRequestMessage | None

The job this worker is currently processing, or None when idle.

Set when a respond turn starts and cleared when the job completes. Lets @tool methods inspect the in-flight job without threading the message through every call.

Returns:

The in-flight BusJobRequestMessage, or None when idle.

render_query(message: BusJobRequestMessage) str[source]

Extract the user’s query text from a job request.

Override to read a different payload shape. The returned string is appended to the LLM context as a user message before the LLM runs. The default reads payload["query"].

Parameters:

message – The inbound job request.

Returns:

The query text to feed into the LLM.

async respond_to_job(answer: str | None = None, *, tts_speak: bool = False, status: JobStatus = JobStatus.COMPLETED) None[source]

Complete the in-flight job with the worker’s answer.

Called from a @tool once the worker has decided how to answer. tts_speak picks the delivery; the two modes are mutually exclusive (one voice per turn):

  • default: the job responds with {"answer": answer} for the requester’s voice LLM to phrase.

  • tts_speak=True: answer is spoken verbatim by the requester’s TTS (via BusTTSSpeakMessage, and added to its context) while the job responds None so the voice LLM doesn’t also speak.

A falsy answer completes the turn silently. No-op when no job is in flight or it was already answered.

Parameters:
  • answer – The worker’s answer – spoken verbatim (tts_speak=True) or handed to the requester’s voice LLM to phrase (default).

  • tts_speak – Speak answer verbatim via the requester’s TTS instead of returning it for the requester’s voice LLM to phrase.

  • status – Completion status. Defaults to JobStatus.COMPLETED.

ui_job_group(*worker_names: str, name: str | None = None, payload: dict | None = None, timeout: float | None = None, cancel_on_error: bool = True, label: str | None = None, cancellable: bool = True) UIJobGroupContext[source]

Dispatch a job group whose lifecycle is forwarded to the client.

Like job_group(...), but also forwards the group’s lifecycle to the client as ui-job-group envelopes so the user can watch (and optionally cancel) the work. See UIJobGroupContext for the forwarding details.

Parameters:
  • *worker_names – Names of the workers to send the job to.

  • name – Optional job name for routing to named @job handlers on the workers.

  • payload – Optional structured data describing the work.

  • timeout – Optional timeout in seconds covering both the ready-wait and job execution.

  • cancel_on_error – Whether to cancel the group if a worker errors. Defaults to True.

  • label – Optional human-readable label surfaced to the client. The client UI uses it to title the in-flight job-group card.

  • cancellable – Whether the client may request cancellation of this group via the reserved __cancel_job_group event. Defaults to True.

Returns:

A UIJobGroupContext to use with async with.

Example:

async with self.ui_job_group(
    "researcher_a", "researcher_b",
    payload={"query": query},
    label=f"Research: {query}",
) as tg:
    async for event in tg:
        ...
async start_ui_job_group(*worker_names: str, name: str | None = None, payload: dict | None = None, timeout: float | None = None, cancel_on_error: bool = True, label: str | None = None, cancellable: bool = True) str[source]

Fire-and-forget version of ui_job_group.

Dispatches the group in the background and returns immediately (the lifecycle still forwards to the client). Use it when a @tool wants to kick off work and unblock the voice worker; use ui_job_group to consume worker events inline. Worker exceptions are logged, not propagated.

Parameters:
  • *worker_names – Names of the workers to send the job to.

  • name – Optional job name for routing to named @job handlers on the workers.

  • payload – Optional structured data describing the work.

  • timeout – Optional timeout in seconds covering both the ready-wait and job execution.

  • cancel_on_error – Whether to cancel the group if a worker errors. Defaults to True.

  • label – Optional human-readable label surfaced to the client. The client UI uses it to title the in-flight job-group card.

  • cancellable – Whether the client may request cancellation of this group via the reserved __cancel_job_group event. Defaults to True.

Returns:

The job_id of the dispatched group. Useful if the caller wants to track it (e.g. to cancel programmatically via cancel_job_group(job_id)).

Example:

@tool
async def reply(self, params, answer, research_query=None):
    if research_query:
        await self.start_ui_job_group(
            "wikipedia", "news", "scholar",
            payload={"query": research_query},
            label=f"Research: {research_query}",
        )
    await self.respond_to_job(answer)
    await params.result_callback(None)
render_ui_state() str[source]

Render the latest accessibility snapshot as a <ui_state> block.

Produces Playwright-MCP-style indented text with stable element refs. Apps inject the output via inject_ui_state() when they want the LLM to see what’s on screen.

When the snapshot carries a current text selection, a nested <selection ref="...">...</selection> block is appended inside <ui_state> so the LLM can resolve deictic references (“this paragraph”, “what I selected”) against on-page content.

Override to customize the rendered form.

Returns:

The <ui_state> block, or an empty string if no snapshot has been received yet.

async inject_ui_state() None[source]

Append the latest <ui_state> block to the LLM context.

No-op when no snapshot has been received. Frame has run_llm=False — the snapshot is context, not a user turn.

render_ui_event(message: BusUIEventMessage) str[source]

Render a UI event as a string for LLM context injection.

Override to customize the injected content. The default wraps the event in a single <ui_event> XML tag with a name attribute and a JSON-encoded payload as inner text.

Parameters:

message – The UI event to render.

Returns:

A string to append to the LLM context as a developer message.

pipecat.workers.ui.ui_event(name: str)[source]

Mark a worker method as a handler for a named UI event.

On UIWorker subclasses, decorated methods are automatically dispatched when a BusUIEventMessage with a matching name arrives.

Example:

class MyUIWorker(UIWorker):
    @ui_event("nav_click")
    async def on_nav(self, message):
        view = message.payload.get("view")
        ...
Parameters:

name – The UI event name to match.

Submodules