Badge
Website • Paper • Doc • Data • Data Viewer • Discord • Cache
Suppose you are operating on a system that has not been virtualized (e.g. your desktop, laptop, bare metal machine), meaning you are not utilizing a virtualized environment like AWS, Azure, or k8s. If this is the case, proceed with the instructions below. However, if you are on a virtualized platform, please refer to the Docker section.
cd
into it. Then, install the dependencies listed in requirements.txt
. It is recommended that you use the latest version of Conda to manage the environment, but you can also choose to manually install the dependencies. Please ensure that the version of Python is >= 3.10.# Clone the OSWorld repository
git clone https://github.com/xlang-ai/OSWorld
# Change directory into the cloned repository
cd OSWorld
# Optional: Create a Conda environment for OSWorld
# conda create -n osworld python=3.10
# conda activate osworld
# Install required dependencies
pip install -r requirements.txt
Alternatively, you can install the environment without any benchmark tasks:
pip install desktop-env
vmrun
command. The installation process can refer to How to install VMware Workstation Pro. Verify the successful installation by running the following:vmrun -T ws list
If the installation along with the environment variable set is successful, you will see the message showing the current running virtual machines.
Note: We also support using VirtualBox if you have issues with VMware Pro. However, features such as parallelism and macOS on Apple chips might not be well-supported.
All set! Our setup script will automatically download the necessary virtual machines and configure the environment for you.
If you are running on a non-bare metal server, or prefer not to use VMware and VirtualBox platforms, we recommend using our Docker support.
We recommend running the VM with KVM support. To check if your hosting platform supports KVM, run
egrep -c '(vmx|svm)' /proc/cpuinfo
on Linux. If the return value is greater than zero, the processor should be able to support KVM.
Note: macOS hosts generally do not support KVM. You are advised to use VMware if you would like to run OSWorld on macOS.
If your hosting platform supports a graphical user interface (GUI), you may refer to Install Docker Desktop on Linux or Install Docker Desktop on Windows based on your OS. Otherwise, you may Install Docker Engine.
Add the following arguments when initializing DesktopEnv
:
provider_name
: docker
os_type
: Ubuntu
or Windows
, depending on the OS of the VMNote: If the experiment is interrupted abnormally (e.g., by interrupting signals), there may be residual docker containers which could affect system performance over time. Please run
docker stop $(docker ps -q) && docker rm $(docker ps -a -q)
to clean up.
Using cloud services for parallel evaluation can significantly accelerate evaluation efficiency (can reduce evaluation time to within 1 hour through parallelization!) and can even be used as infrastructure for training. We provide comprehensive AWS support with a Host-Client architecture that enables large-scale parallel evaluation of OSWorld tasks. For detailed setup instructions, see Public Evaluation Guideline and AWS Configuration Guide.
We are working on supporting more 👷. Please hold tight!
Run the following minimal example to interact with the environment:
from desktop_env.desktop_env import DesktopEnv
example = {
"id": "94d95f96-9699-4208-98ba-3c3119edf9c2",
"instruction": "I want to install Spotify on my current system. Could you please help me?",
"config": [
{
"type": "execute",
"parameters": {
"command": [
"python",
"-c",
"import pyautogui; import time; pyautogui.click(960, 540); time.sleep(0.5);"
]
}
}
],
"evaluator": {
"func": "check_include_exclude",
"result": {
"type": "vm_command_line",
"command": "which spotify"
},
"expected": {
"type": "rule",
"rules": {
"include": ["spotify"],
"exclude": ["not found"]
}
}
}
}
env = DesktopEnv(action_space="pyautogui")
obs = env.reset(task_config=example)
obs, reward, done, info = env.step("pyautogui.rightClick()")
You will see all the logs of the system running normally, including the successful creation of the environment, completion of setup, and successful execution of actions. In the end, you will observe a successful right-click on the screen, which means you are ready to go.
⚠️ Important Configuration Requirements:
- Google Account Tasks: Some tasks require Google account access and OAuth2.0 configuration. Please refer to Google Account Guideline for detailed setup instructions.
- Proxy Configuration: Some tasks may require proxy settings to function properly (this depends on the strength of website defenses against your network location). Please refer to your system's proxy configuration documentation.
- Impact of Missing Configuration: If these configurations are not properly set up, the corresponding tasks will fail to execute correctly, leading to lower evaluation scores.
If you wish to run the baseline agent used in our paper, you can execute the following command as an example under the GPT-4o pure-screenshot setting:
Set OPENAI_API_KEY environment variable with your API key
export OPENAI_API_KEY='changeme'
Optionally, set OPENAI_BASE_URL to use a custom OpenAI-compatible API endpoint
export OPENAI_BASE_URL='http://your-custom-endpoint.com/v1' # Optional: defaults to https://api.openai.com
Single-threaded execution (deprecated, using vmware
provider as example)
python run.py \
--provider_name vmware \
--path_to_vm Ubuntu/Ubuntu.vmx \
--headless \
--observation_type screenshot \
--model gpt-4o \
--sleep_after_execution 3 \
--max_steps 15 \
--result_dir ./results \
--client_password password
Parallel execution (example showing switching provider to docker
)
python run_multienv.py \
--provider_name docker \
--headless \
--observation_type screenshot \
--model gpt-4o \
--sleep_after_execution 3 \
--max_steps 15 \
--num_envs 10 \
--client_password password
The results, which include screenshots, actions, and video recordings of the agent's task completion, will be saved in the ./results
(or other result_dir
you specified) directory in this case.
You can then run the following command to obtain the result:
python show_result.py
Please start by reading through the agent interface and the environment interface.
Correctly implement the agent interface and import your customized version in the run.py
or run_multienv.py
file.
Afterward, you can execute a command similar to the one in the previous section to run the benchmark on your agent.
If you want your results to be verified and displayed on the verified leaderboard, you need to schedule a meeting with us (current maintainer: tianbaoxiexxx@gmail.com, yuanmengqi732@gmail.com) to run your agent code on our side and have us report the results. You need to upload and allow us to disclose your agent implementation under the OSWorld framework (you may choose not to expose your model API to the public), along with a report that allows the public to understand what's happening behind the scenes. Alternatively, if you are from a trusted institution, you can share your monitoring data and trajectories with us. Please carefully follow the Public Evaluation Guideline to get results.
The username and password for the virtual machines are as follows (for provider vmware
, virtualbox
and docker
): we set the account credentials for Ubuntu as user
/ password
.
For cloud service providers like aws
, to prevent attacks due to weak passwords, we default to osworld-public-evaluation
.
If you make further modifications, remember to set the client_password variable and pass it to DesktopEnv and Agent (if supported) when running experiments.
Some features like setting up proxy require the environment to have the client VM password to obtain sudo privileges, and for some OSWorld tasks, the agent needs the password to obtain sudo privileges to complete them.
See Account Guideline.
If you want to set it up yourself, please refer to Proxy Guideline. We also provide a pre-configured solution based on dataimpulse, please refer to proxy-setup section in PUBLIC_EVALUATION_GUIDELINE.
Thanks to all the contributors!
If you find this environment useful, please consider citing our work:
@misc{OSWorld,
title={OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments},
author={Tianbao Xie and Danyang Zhang and Jixuan Chen and Xiaochuan Li and Siheng Zhao and Ruisheng Cao and Toh Jing Hua and Zhoujun Cheng and Dongchan Shin and Fangyu Lei and Yitao Liu and Yiheng Xu and Shuyan Zhou and Silvio Savarese and Caiming Xiong and Victor Zhong and Tao Yu},
year={2024},
eprint={2404.07972},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
Information
Organization
xiangyi-li