ui_worker
UIWorker: an LLM worker that observes and drives a client GUI over RTVI.
- class pipecat.workers.ui.ui_worker.UIWorker(name: str, *, llm: LLMService[Any], context: LLMContext | None = None, assistant_params: LLMAssistantAggregatorParams | None = None, inject_events: bool = True, auto_inject_ui_state: bool = True, keep_history: bool = False, prompt_guide: str | None = '## UI context\n\nYour developer context includes two kinds of SDK-managed messages:\n\n- ``<ui_event name="..." >payload</ui_event>``: an event the user just triggered on the client (click, tab switch, navigation, etc.). The payload is JSON for that event.\n- ``<ui_state>...</ui_state>``: an accessibility snapshot of the current screen, injected at the start of every turn. Indented tree in Playwright-MCP style. Each line is ``- role "name" [state] [ref=eN]`` with children nested one level deeper. A line can also carry ``= "value"`` (an element\'s current value, e.g. text already typed into an input) and ``[level=N]`` (heading depth).\n\nState tags include ``[focused]``, ``[selected]``, ``[disabled]``, and ``[offscreen]``. A node tagged ``[offscreen]`` exists on the page but is not currently in the user\'s viewport; only visible (non-offscreen) nodes count for position-based references.\n\nGrids carry a ``[cols=N]`` tag. Their cells are listed in reading order (left-to-right, top-to-bottom); with N columns, cell K sits at row ``ceil(K/N)``, column ``((K-1) mod N) + 1``. Example with ``[cols=8]`` and 16 children: "top right" is cell 8, "bottom left" is cell 9.\n\nResolve position references ("top right", "the first one", "the third new release") against the most recent ``<ui_state>`` tree. Sibling order matches reading order on screen (top-to-bottom, left-to-right within each region).\n\nWhen the user has text selected on the page, the snapshot ends with a ``<selection ref="eN">selected text</selection>`` block inside ``<ui_state>``. Treat the selection as the deictic referent for "this", "that", "what I selected", and similar phrases. The ``ref`` identifies the closest enclosing element that has a ref in the tree; the inner text is the actual selected content (truncated if very long). Text inside ``<input>`` or ``<textarea>`` selections is faithful to ``selectionStart``/``selectionEnd`` on the element.\n\nRefs (``e42``) are stable handles for acting on elements: pass the ``ref`` from the most recent ``<ui_state>`` to any tool that operates on a node. The same element keeps its ref across snapshots while it stays on the page, so you can refer back to it across turns. Always resolve refs against the latest snapshot, and bring an ``[offscreen]`` element into view before acting on it.')[source]
Bases:
LLMContextWorkerLLM worker that reads and drives a client GUI over the RTVI UI channel.
A
UIWorkerconnects an LLM to whatever the user is looking at: it sees the screen as accessibility snapshots, reacts to the user’s UI events, and acts on the page by sending commands to the client. It is the delegate side of a voice/UI split – a voice layer (the main pipeline’s LLM, or a separateLLMWorker) handles speech and hands screen-relevant work to this worker.Capabilities:
See the screen. The latest accessibility snapshot is rendered as
<ui_state>and auto-injected into the LLM context before each inference.React to UI events, dispatched to
@ui_event(name)handlers.Drive the UI with
send_commandand thescroll_to/highlight/select_text/click/set_input_valuehelpers.Answer as a delegate. The built-in single-flight
respondjob runs one screen-grounded LLM turn that a@toolends by callingrespond_to_job(which decides how the answer reaches the user).Surface long work.
ui_job_group/start_ui_job_groupfan work out to peer workers as cancellable job-group cards on the client.
PipelineWorkerconnects a UIWorker to the client automatically when RTVI is enabled – no extra wiring. A working subclass needs only an LLM and a@toolthat callsrespond_to_job; overriderender_queryto read a non-default job payload.Example:
class MyUIWorker(UIWorker): @ui_event("nav_click") async def on_nav(self, message): view = message.payload.get("view") ... @tool async def answer(self, params, text: str): await self.respond_to_job(text) await params.result_callback(None) worker = MyUIWorker("ui", llm=OpenAILLMService(api_key="..."))
Note
With client
trackViewporton (the default), off-screen nodes carry[offscreen]in<ui_state>;scroll_tobefore acting on them.- __init__(name: str, *, llm: LLMService[Any], context: LLMContext | None = None, assistant_params: LLMAssistantAggregatorParams | None = None, inject_events: bool = True, auto_inject_ui_state: bool = True, keep_history: bool = False, prompt_guide: str | None = '## UI context\n\nYour developer context includes two kinds of SDK-managed messages:\n\n- ``<ui_event name="..." >payload</ui_event>``: an event the user just triggered on the client (click, tab switch, navigation, etc.). The payload is JSON for that event.\n- ``<ui_state>...</ui_state>``: an accessibility snapshot of the current screen, injected at the start of every turn. Indented tree in Playwright-MCP style. Each line is ``- role "name" [state] [ref=eN]`` with children nested one level deeper. A line can also carry ``= "value"`` (an element\'s current value, e.g. text already typed into an input) and ``[level=N]`` (heading depth).\n\nState tags include ``[focused]``, ``[selected]``, ``[disabled]``, and ``[offscreen]``. A node tagged ``[offscreen]`` exists on the page but is not currently in the user\'s viewport; only visible (non-offscreen) nodes count for position-based references.\n\nGrids carry a ``[cols=N]`` tag. Their cells are listed in reading order (left-to-right, top-to-bottom); with N columns, cell K sits at row ``ceil(K/N)``, column ``((K-1) mod N) + 1``. Example with ``[cols=8]`` and 16 children: "top right" is cell 8, "bottom left" is cell 9.\n\nResolve position references ("top right", "the first one", "the third new release") against the most recent ``<ui_state>`` tree. Sibling order matches reading order on screen (top-to-bottom, left-to-right within each region).\n\nWhen the user has text selected on the page, the snapshot ends with a ``<selection ref="eN">selected text</selection>`` block inside ``<ui_state>``. Treat the selection as the deictic referent for "this", "that", "what I selected", and similar phrases. The ``ref`` identifies the closest enclosing element that has a ref in the tree; the inner text is the actual selected content (truncated if very long). Text inside ``<input>`` or ``<textarea>`` selections is faithful to ``selectionStart``/``selectionEnd`` on the element.\n\nRefs (``e42``) are stable handles for acting on elements: pass the ``ref`` from the most recent ``<ui_state>`` to any tool that operates on a node. The same element keeps its ref across snapshots while it stays on the page, so you can refer back to it across turns. Always resolve refs against the latest snapshot, and bring an ``[offscreen]`` element into view before acting on it.')[source]
Initialize the UIWorker.
- Parameters:
name – Unique name for this worker.
llm – The LLM service.
context – Optional pre-built
LLMContext. Seeded messages are part of the mutable history and are cleared on eachkeep_history=Falsereset; put durable instructions in the LLM’ssystem_instructioninstead.assistant_params – Optional assistant-aggregator parameters, e.g. to enable context summarization for
keep_history=Trueworkers.inject_events – When True (the default), append each UI event to the context as a
<ui_event>developer message. Overriderender_ui_eventto change the content, or set False to disable.auto_inject_ui_state – When True (the default), append the latest
<ui_state>snapshot to the context before every inference (via the LLM’son_before_process_framehook). Set False to inject manually withinject_ui_state().keep_history – When False (the default), the context is cleared at the start of every job, so each turn sees only the current
<ui_state>and query – best for the stateless-delegate role. When True, history accumulates across jobs so the LLM can resolve multi-turn references (“the next one”, “the Pro version”), at the cost of more tokens and possible confusion from stale<ui_state>blocks. Use context summarization to prune the history when it gets too large.prompt_guide – Wire-format guide appended to the LLM’s
system_instructionso it can parse the<ui_state>/<ui_event>messages. Defaults toUI_STATE_PROMPT_GUIDE; pass a string to override orNoneto disable. Living insystem_instruction, it survives context resets.
- async send_command(name: str, payload: Any = None) None[source]
Send a named UI command to the client.
Publishes a
BusUICommandMessage; when RTVI is enabled,PipelineWorkertranslates it into anRTVIUICommandFrameon the pipeline. Client-side handlers subscribed toRTVIEvent.UICommand(or React’suseUICommandHandler) dispatch on the command name.- Parameters:
name – App-defined command name (e.g.
"toast","navigate", or any app-specific name).payload –
One of:
A pydantic
BaseModelinstance (including the built-in command models inpipecat.processors.frameworks.rtvi.models). Converted to a plain dict withmodel_dump().A dataclass instance. Converted to a plain dict with
dataclasses.asdict.A
dictforwarded as-is.None, forwarded as an empty dict.
- async scroll_to(ref: str) None[source]
Send a
scroll_toUI command to bring an element into view.Convenience wrapper around
send_command("scroll_to", ScrollTo(ref=ref)). Thesescroll_to/highlight/select_text/click/set_input_valuehelpers are plain methods, not LLM tools: compose them inside a custom@toolbody, or useReplyToolMixinfor the standard shape.- Parameters:
ref – Snapshot ref (e.g.
"e42") from the latest<ui_state>.
- async highlight(ref: str) None[source]
Send a
highlightUI command to briefly flash an element.- Parameters:
ref – Snapshot ref (e.g.
"e42") from the latest<ui_state>.
- async select_text(ref: str, *, start_offset: int | None = None, end_offset: int | None = None) None[source]
Send a
select_textUI command to select an element’s text.Selects the whole element by default, or the
start_offset..end_offsetcharacter sub-range (over the element’s concatenatedtextContent) when both are given. Used for deixis – pointing at content via the page’s text selection.- Parameters:
ref – Snapshot ref (e.g.
"e42") from the latest<ui_state>.start_offset – Optional start character offset of the selection.
end_offset – Optional end character offset (exclusive).
- async click(ref: str) None[source]
Send a
clickUI command (checkboxes, radios, submit buttons).The standard client handler no-ops on
disabledtargets, so the worker can’t bypass affordances meant to be user-controlled.- Parameters:
ref – Snapshot ref (e.g.
"e42") from the latest<ui_state>.
- async set_input_value(ref: str, value: str, *, replace: bool = True) None[source]
Send a
set_input_valueUI command to fill a text input/textarea.- Parameters:
ref – Snapshot ref (e.g.
"e42") of the input or textarea.value – Text to write into the field.
replace – When True (the default), overwrite the field; when False, append (e.g. to continue a long answer in a textarea).
- async on_bus_message(message: BusMessage) None[source]
Dispatch UI events alongside base lifecycle handling.
- property current_job: BusJobRequestMessage | None
The job this worker is currently processing, or
Nonewhen idle.Set when a respond turn starts and cleared when the job completes. Lets
@toolmethods inspect the in-flight job without threading the message through every call.- Returns:
The in-flight
BusJobRequestMessage, orNonewhen idle.
- render_query(message: BusJobRequestMessage) str[source]
Extract the user’s query text from a job request.
Override to read a different payload shape. The returned string is appended to the LLM context as a user message before the LLM runs. The default reads
payload["query"].- Parameters:
message – The inbound job request.
- Returns:
The query text to feed into the LLM.
- async respond_to_job(answer: str | None = None, *, tts_speak: bool = False, status: JobStatus = JobStatus.COMPLETED) None[source]
Complete the in-flight job with the worker’s answer.
Called from a
@toolonce the worker has decided how to answer.tts_speakpicks the delivery; the two modes are mutually exclusive (one voice per turn):default: the job responds with
{"answer": answer}for the requester’s voice LLM to phrase.tts_speak=True:answeris spoken verbatim by the requester’s TTS (viaBusTTSSpeakMessage, and added to its context) while the job respondsNoneso the voice LLM doesn’t also speak.
A falsy
answercompletes the turn silently. No-op when no job is in flight or it was already answered.- Parameters:
answer – The worker’s answer – spoken verbatim (
tts_speak=True) or handed to the requester’s voice LLM to phrase (default).tts_speak – Speak
answerverbatim via the requester’s TTS instead of returning it for the requester’s voice LLM to phrase.status – Completion status. Defaults to
JobStatus.COMPLETED.
- ui_job_group(*worker_names: str, name: str | None = None, payload: dict | None = None, timeout: float | None = None, cancel_on_error: bool = True, label: str | None = None, cancellable: bool = True) UIJobGroupContext[source]
Dispatch a job group whose lifecycle is forwarded to the client.
Like
job_group(...), but also forwards the group’s lifecycle to the client asui-job-groupenvelopes so the user can watch (and optionally cancel) the work. SeeUIJobGroupContextfor the forwarding details.- Parameters:
*worker_names – Names of the workers to send the job to.
name – Optional job name for routing to named
@jobhandlers on the workers.payload – Optional structured data describing the work.
timeout – Optional timeout in seconds covering both the ready-wait and job execution.
cancel_on_error – Whether to cancel the group if a worker errors. Defaults to True.
label – Optional human-readable label surfaced to the client. The client UI uses it to title the in-flight job-group card.
cancellable – Whether the client may request cancellation of this group via the reserved
__cancel_job_groupevent. Defaults to True.
- Returns:
A
UIJobGroupContextto use withasync with.
Example:
async with self.ui_job_group( "researcher_a", "researcher_b", payload={"query": query}, label=f"Research: {query}", ) as tg: async for event in tg: ...
- async start_ui_job_group(*worker_names: str, name: str | None = None, payload: dict | None = None, timeout: float | None = None, cancel_on_error: bool = True, label: str | None = None, cancellable: bool = True) str[source]
Fire-and-forget version of
ui_job_group.Dispatches the group in the background and returns immediately (the lifecycle still forwards to the client). Use it when a
@toolwants to kick off work and unblock the voice worker; useui_job_groupto consume worker events inline. Worker exceptions are logged, not propagated.- Parameters:
*worker_names – Names of the workers to send the job to.
name – Optional job name for routing to named
@jobhandlers on the workers.payload – Optional structured data describing the work.
timeout – Optional timeout in seconds covering both the ready-wait and job execution.
cancel_on_error – Whether to cancel the group if a worker errors. Defaults to True.
label – Optional human-readable label surfaced to the client. The client UI uses it to title the in-flight job-group card.
cancellable – Whether the client may request cancellation of this group via the reserved
__cancel_job_groupevent. Defaults to True.
- Returns:
The
job_idof the dispatched group. Useful if the caller wants to track it (e.g. to cancel programmatically viacancel_job_group(job_id)).
Example:
@tool async def reply(self, params, answer, research_query=None): if research_query: await self.start_ui_job_group( "wikipedia", "news", "scholar", payload={"query": research_query}, label=f"Research: {research_query}", ) await self.respond_to_job(answer) await params.result_callback(None)
- render_ui_state() str[source]
Render the latest accessibility snapshot as a
<ui_state>block.Produces Playwright-MCP-style indented text with stable element refs. Apps inject the output via
inject_ui_state()when they want the LLM to see what’s on screen.When the snapshot carries a current text selection, a nested
<selection ref="...">...</selection>block is appended inside<ui_state>so the LLM can resolve deictic references (“this paragraph”, “what I selected”) against on-page content.Override to customize the rendered form.
- Returns:
The
<ui_state>block, or an empty string if no snapshot has been received yet.
- async inject_ui_state() None[source]
Append the latest
<ui_state>block to the LLM context.No-op when no snapshot has been received. Frame has
run_llm=False— the snapshot is context, not a user turn.
- render_ui_event(message: BusUIEventMessage) str[source]
Render a UI event as a string for LLM context injection.
Override to customize the injected content. The default wraps the event in a single
<ui_event>XML tag with anameattribute and a JSON-encoded payload as inner text.- Parameters:
message – The UI event to render.
- Returns:
A string to append to the LLM context as a developer message.