Overview
Parse the current screen using OmniParser v2 to detect all visible UI elements. Returns structured data including element types, text content, interactivity levels, and bounding box coordinates. This method analyzes the entire screen and returns every detected element. It’s useful for:- Understanding the full UI layout of a screen
- Finding all clickable or interactive elements
- Building custom element-based logic
- Debugging what elements TestDriver can detect
- Accessibility auditing
Availability:
parse() requires an enterprise or self-hosted plan. It uses OmniParser v2 server-side for element detection.Syntax
Parameters
None.Returns
Promise<ParseResult> - Object containing detected UI elements
ParseResult
| Property | Type | Description |
|---|---|---|
elements | ParsedElement[] | Array of detected UI elements |
annotatedImageUrl | string | URL of the annotated screenshot with bounding boxes |
imageWidth | number | Width of the analyzed screenshot |
imageHeight | number | Height of the analyzed screenshot |
ParsedElement
| Property | Type | Description |
|---|---|---|
index | number | Element index |
type | string | Element type (e.g. "text", "icon", "button") |
content | string | Text content or description of the element |
interactivity | string | Interactivity level (e.g. "clickable", "non-interactive") |
bbox | object | Bounding box in pixel coordinates {x0, y0, x1, y1} |
boundingBox | object | Bounding box as {left, top, width, height} |
Examples
Get All Elements on Screen
Find Clickable Elements
Find and Click an Element by Content
Filter by Element Type
Build Custom Assertions
Use Bounding Box Coordinates
View Annotated Screenshot
How It Works
- TestDriver captures a screenshot of the current screen
- The image is sent to the TestDriver API
- OmniParser v2 analyzes the image to detect all UI elements
- Each element is classified by type (text, icon, button, etc.) and interactivity
- Bounding box coordinates are returned in pixel coordinates matching the screen resolution
OmniParser detects elements visually — it works with any UI framework, native apps, and even non-standard interfaces. It does not rely on DOM or accessibility trees.
Best Practices
Use find() for targeting specific elements
Use find() for targeting specific elements
For locating and interacting with a specific element, prefer
find() which uses AI vision. Use parse() when you need a complete inventory of all elements on screen.Filter by interactivity
Filter by interactivity
Use the
interactivity field to distinguish between clickable and non-interactive elements.Wait for content to load
Wait for content to load
If elements aren’t being detected, the page may not be fully loaded. Add a wait first.
Use the annotated image for debugging
Use the annotated image for debugging
The
annotatedImageUrl provides a visual overlay showing all detected elements with their bounding boxes — great for debugging.Related
- find() - AI-powered element location
- assert() - Make AI-powered assertions about screen state
- screenshot() - Capture screenshots
- Elements Reference - Complete Element API

