|
| 1 | +ome |
1 | 2 | <h1 align="center">Self-Operating Computer Framework</h1>
|
2 | 3 |
|
3 | 4 | <p align="center">
|
4 | 5 | <strong>A framework to enable multimodal models to operate a computer.</strong>
|
5 | 6 | </p>
|
6 | 7 | <p align="center">
|
7 |
| - Using the same inputs and outputs as a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective. |
| 8 | + Using the same inputs and outputs as a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective. Self-Operating Computer was the first project to use a VLM to operate a computer. |
8 | 9 | </p>
|
9 | 10 |
|
10 | 11 | <div align="center">
|
|
19 | 20 |
|
20 | 21 | ## Key Features
|
21 | 22 | - **Compatibility**: Designed for various multimodal models.
|
22 |
| -- **Integration**: Currently integrated with **GPT-4o, Gemini Pro Vision, Claude 3 and LLaVa.** |
| 23 | +- **Integration**: Currently integrated with **GPT-4o, o1, Gemini Pro Vision, Claude 3 and LLaVa.** |
23 | 24 | - **Future Plans**: Support for additional models.
|
24 | 25 |
|
25 |
| -## Ongoing Development |
26 |
| -At [HyperwriteAI](https://www.hyperwriteai.com/), we are developing Agent-1-Vision a multimodal model with more accurate click location predictions. |
27 |
| - |
28 |
| -## Agent-1-Vision Model API Access |
29 |
| -We will soon be offering API access to our Agent-1-Vision model. |
30 |
| - |
31 |
| -If you're interested in gaining access to this API, sign up [here](https://othersideai.typeform.com/to/FszaJ1k8?typeform-source=www.hyperwriteai.com). |
32 |
| - |
33 | 26 | ## Demo
|
34 |
| - |
35 | 27 | https://github.com/OthersideAI/self-operating-computer/assets/42594239/9e8abc96-c76a-46fb-9b13-03678b3c67e0
|
36 | 28 |
|
37 | 29 |
|
@@ -60,10 +52,17 @@ operate
|
60 | 52 |
|
61 | 53 | ## Using `operate` Modes
|
62 | 54 |
|
63 |
| -### Multimodal Models `-m` |
64 |
| -An additional model is now compatible with the Self Operating Computer Framework. Try Google's `gemini-pro-vision` by following the instructions below. |
| 55 | +#### OpenAI models |
| 56 | + |
| 57 | +The default model for the project is gpt-4o which you can use by simply typing `operate`. To try running OpenAI's new `o1` model, use the command below. |
65 | 58 |
|
66 |
| -Start `operate` with the Gemini model |
| 59 | +``` |
| 60 | +operate -m o1-with-ocr |
| 61 | +``` |
| 62 | + |
| 63 | + |
| 64 | +### Multimodal Models `-m` |
| 65 | +Try Google's `gemini-pro-vision` by following the instructions below. Start `operate` with the Gemini model |
67 | 66 | ```
|
68 | 67 | operate -m gemini-pro-vision
|
69 | 68 | ```
|
|
0 commit comments