You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+41-14
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@
4
4
<strong>A framework to enable multimodal models to operate a computer.</strong>
5
5
</p>
6
6
<palign="center">
7
-
Using the same inputs and outputs as a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective.
7
+
Using the same inputs and outputs as a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective.
8
8
</p>
9
9
10
10
<divalign="center">
@@ -16,35 +16,40 @@
16
16
**This model is currently experiencing an outage so the self-operating computer may not work as expected.**
17
17
-->
18
18
19
-
20
19
## Key Features
20
+
21
21
-**Compatibility**: Designed for various multimodal models.
22
22
-**Integration**: Currently integrated with **GPT-4v** as the default model, with extended support for Gemini Pro Vision.
23
23
-**Future Plans**: Support for additional models.
24
24
25
25
## Ongoing Development
26
+
26
27
At [HyperwriteAI](https://www.hyperwriteai.com/), we are developing Agent-1-Vision a multimodal model with more accurate click location predictions.
27
28
28
29
## Agent-1-Vision Model API Access
30
+
29
31
We will soon be offering API access to our Agent-1-Vision model.
30
32
31
33
If you're interested in gaining access to this API, sign up [here](https://othersideai.typeform.com/to/FszaJ1k8?typeform-source=www.hyperwriteai.com).
An additional model is now compatible with the Self Operating Computer Framework. Try Google's `gemini-pro-vision` by following the instructions below.
89
+
90
+
An additional model is now compatible with the Self Operating Computer Framework. Try Google's `gemini-pro-vision` by following the instructions below.
83
91
84
92
Start `operate` with the Gemini model
93
+
85
94
```
86
95
operate -m gemini-pro-vision
87
96
```
88
97
89
98
**Enter your Google AI Studio API key when terminal prompts you for it** If you don't have one, you can obtain a key [here](https://makersuite.google.com/app/apikey) after setting up your Google AI Studio account. You may also need [authorize credentials for a desktop application](https://ai.google.dev/palm_docs/oauth_quickstart). It took me a bit of time to get it working, if anyone knows a simpler way, please make a PR:
90
99
91
100
### Optical Character Recognition Mode `-m gpt-4-with-ocr`
92
-
The Self-Operating Computer Framework now integrates Optical Character Recognition (OCR) capabilities with the `gpt-4-with-ocr` mode. This mode gives GPT-4 a hash map of clickable elements by coordinates. GPT-4 can decide to `click` elements by text and then the code references the hash map to get the coordinates for that element GPT-4 wanted to click.
93
101
94
-
Based on recent tests, OCR performs better than `som` and vanilla GPT-4 so we made it the default for the project. To use the OCR mode you can simply write:
102
+
The Self-Operating Computer Framework now integrates Optical Character Recognition (OCR) capabilities with the `gpt-4-with-ocr` mode. This mode gives GPT-4 a hash map of clickable elements by coordinates. GPT-4 can decide to `click` elements by text and then the code references the hash map to get the coordinates for that element GPT-4 wanted to click.
103
+
104
+
Based on recent tests, OCR performs better than `som` and vanilla GPT-4 so we made it the default for the project. To use the OCR mode you can simply write:
95
105
96
-
`operate` or `operate -m gpt-4-with-ocr` will also work.
106
+
`operate` or `operate -m gpt-4-with-ocr` will also work.
97
107
98
108
### Set-of-Mark Prompting `-m gpt-4-with-som`
109
+
99
110
The Self-Operating Computer Framework now supports Set-of-Mark (SoM) Prompting with the `gpt-4-with-som` command. This new visual prompting method enhances the visual grounding capabilities of large multimodal models.
100
111
101
112
Learn more about SoM Prompting in the detailed arXiv paper: [here](https://arxiv.org/abs/2310.11441).
@@ -109,56 +120,72 @@ operate -m gpt-4-with-som
109
120
```
110
121
111
122
### Voice Mode `--voice`
112
-
The framework supports voice inputs for the objective. Try voice by following the instructions below.
123
+
124
+
The framework supports voice inputs for the objective. Try voice by following the instructions below.
113
125
**Clone the repo** to a directory on your computer:
If you want to contribute yourself, see [CONTRIBUTING.md](https://github.com/OthersideAI/self-operating-computer/blob/main/CONTRIBUTING.md).
142
165
143
166
## Feedback
144
167
145
-
For any input on improving this project, feel free to reach out to [Josh](https://twitter.com/josh_bickett) on Twitter.
168
+
For any input on improving this project, feel free to reach out to [Josh](https://twitter.com/josh_bickett) on Twitter.
146
169
147
170
## Join Our Discord Community
148
171
149
-
For real-time discussions and community support, join our Discord server.
172
+
For real-time discussions and community support, join our Discord server.
173
+
150
174
- If you're already a member, join the discussion in [#self-operating-computer](https://discord.com/channels/877638638001877052/1181241785834541157).
151
175
- If you're new, first [join our Discord Server](https://discord.gg/YqaKtyBEzM) and then navigate to the [#self-operating-computer](https://discord.com/channels/877638638001877052/1181241785834541157).
152
176
153
177
## Follow HyperWriteAI for More Updates
154
178
155
179
Stay updated with the latest developments:
180
+
156
181
- Follow HyperWriteAI on [Twitter](https://twitter.com/HyperWriteAI).
157
182
- Follow HyperWriteAI on [LinkedIn](https://www.linkedin.com/company/othersideai/).
158
183
159
184
## Compatibility
185
+
160
186
- This project is compatible with Mac OS, Windows, and Linux (with X server installed).
161
187
162
188
## OpenAI Rate Limiting Note
163
-
The ```gpt-4-vision-preview``` model is required. To unlock access to this model, your account needs to spend at least \$5 in API credits. Pre-paying for these credits will unlock access if you haven't already spent the minimum \$5.
189
+
190
+
The ```gpt-4-vision-preview``` model is required. To unlock access to this model, your account needs to spend at least \$5 in API credits. Pre-paying for these credits will unlock access if you haven't already spent the minimum \$5.
164
191
Learn more **[here](https://platform.openai.com/docs/guides/rate-limits?context=tier-one)**
0 commit comments