Skip to content

ADDED WAIT OR NON OPERATION #229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 37 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
89c5b36
Update operate.py
Koolkatze Feb 8, 2025
1760d98
Update operate.py
Koolkatze Feb 9, 2025
bd294ad
Update prompts.py
Koolkatze Feb 28, 2025
803cdc8
Update prompts.py
Koolkatze Feb 28, 2025
8ac908d
Update prompts.py
Koolkatze Feb 28, 2025
9832885
Update README.md
Koolkatze Mar 1, 2025
64929e2
Update config.py
Koolkatze Mar 1, 2025
e9fdb3a
Update config.py
Koolkatze Mar 1, 2025
2d1c6ad
Update apis.py
Koolkatze Mar 1, 2025
f1101af
Update prompts.py
Koolkatze Mar 1, 2025
4a6744b
Update screenshot.py
Koolkatze Mar 1, 2025
980e4ee
Update setup.py
Koolkatze Mar 1, 2025
cdb67af
Update config.py
Koolkatze Mar 1, 2025
eac3aee
Update config.py
Koolkatze Mar 1, 2025
24ef0d9
Update config.py
Koolkatze Mar 1, 2025
6277aa6
Update config.py
Koolkatze Mar 1, 2025
5c09de0
Update apis.py
Koolkatze Mar 1, 2025
2aa8aa4
Update apis.py
Koolkatze Mar 1, 2025
9b23a09
Update apis.py
Koolkatze Mar 1, 2025
20fdace
Update config.py
Koolkatze Mar 1, 2025
e703cb0
Update operate.py
Koolkatze Mar 1, 2025
34c7e21
Update apis.py
Koolkatze Mar 1, 2025
969cd07
Update operate.py
Koolkatze Mar 1, 2025
6c1c015
Update apis.py
Koolkatze Mar 1, 2025
5d55d53
Update prompts.py
Koolkatze Mar 1, 2025
40523ef
Update config.py
Koolkatze Mar 1, 2025
28272bb
A icon library for computer vision
Koolkatze Mar 1, 2025
6987cdc
Update README.md
Koolkatze Mar 1, 2025
dad71cf
Add files via upload
Koolkatze Mar 2, 2025
296ee78
Update operate.py
Koolkatze Mar 2, 2025
8724033
Update prompts.py
Koolkatze Mar 2, 2025
bc980c2
Update README.md
Koolkatze Mar 2, 2025
963b866
Update operate.py
Koolkatze Mar 3, 2025
e1d92bb
Update operate.py
Koolkatze Mar 3, 2025
fa9f9f5
Update operate.py
Koolkatze Mar 3, 2025
0ce1ff3
Update operate.py
Koolkatze Mar 3, 2025
3a4d0d9
Update prompts.py
Koolkatze Mar 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 121 additions & 0 deletions GUI_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# Self-Operating Computer GUI

A graphical user interface for the Self-Operating Computer, allowing easy interaction with AI models to automate computer tasks.

## Features

- **Intuitive Chat Interface**: Communicate with the Self-Operating Computer through a familiar chat interface
- **Live Screenshot Preview**: See what the AI sees in real-time
- **Model Selection**: Choose from multiple AI models including GPT-4, Claude, Qwen, and more
- **Voice Control**: Speak your commands using the built-in voice recognition (requires whisper_mic)
- **Real-time Logs**: Monitor detailed logs of operations in real-time
- **Multi-platform**: Works on Windows, macOS, and Linux

## Installation

### Prerequisites

- Python 3.8 or higher
- Self-Operating Computer installed and configured
- pip (Python package manager)

### Required Packages

```bash
pip install PyQt5
pip install whisper_mic # Optional, for voice commands
```

## Usage

### Running the GUI

From the Self-Operating Computer directory:

```bash
python gui_main.py
```

### Command Line Options

```
usage: gui_main.py [-h] [-m MODEL] [--verbose] [--light]

Run the Self-Operating Computer GUI with a specified model.

optional arguments:
-h, --help show this help message and exit
-m MODEL, --model MODEL
Specify the default model to use
--verbose Run with verbose logging
--light Use light mode instead of dark mode
```

### Examples

```bash
# Run with GPT-4 model and verbose logging
python gui_main.py -m gpt-4-vision --verbose

# Run with Claude 3 model in light mode
python gui_main.py -m claude-3 --light
```

## Interface Guide

The GUI is divided into several sections:

1. **Top Bar**: Contains model selection dropdown and verbose mode toggle
2. **Left Panel**: Displays the current screenshot that the AI sees
3. **Right Panel - Top**: Chat history showing your requests and system messages
4. **Right Panel - Bottom**: Detailed logs of operations in real-time
5. **Bottom Input**: Text field for typing tasks, Send button, and voice recording button

## Model Support

The GUI supports all models that the Self-Operating Computer supports:

- GPT-4 Vision
- GPT-4 with SOM (Spatial Object Memory)
- GPT-4 with OCR
- Claude 3
- Claude 3.7
- Qwen-VL
- O1 with OCR
- Gemini Pro Vision
- LLaVA

## API Keys

The GUI uses the same API key configuration as the main Self-Operating Computer. If a required API key is missing, a prompt will appear asking you to enter it.

## Troubleshooting

### Voice Recognition Not Working

Make sure you have installed whisper_mic:
```bash
pip install whisper_mic
```

### GUI Not Launching

Check that PyQt5 is properly installed:
```bash
pip install PyQt5
```

### Model Not Responding

Ensure your API keys are properly configured in the Self-Operating Computer settings.

## Integration with Existing Codebase

The GUI integrates seamlessly with the existing Self-Operating Computer codebase:

- It uses the same `operate.py` functions for executing tasks
- It leverages the same model APIs from `apis.py`
- It inherits configuration from `config.py`
- It preserves the same prompt formats from `prompts.py`

The UI simply provides a graphical wrapper around these core components, making them more accessible to users who prefer not to use the comman
19 changes: 17 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ ome
<h1 align="center">Self-Operating Computer Framework</h1>

<p align="center">
<strong>A framework to enable multimodal models to operate a computer.</strong>
<strong>A framework to enable multimodal models to operate a computer GUI INCLUDED and double click, right click, scroll and wait operations defined.</strong>
</p>
<p align="center">
Using the same inputs and outputs as a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective. Released Nov 2023, the Self-Operating Computer Framework was one of the first examples of using a multimodal model to view the screen and operate a computer.
Expand All @@ -20,7 +20,7 @@ ome

## Key Features
- **Compatibility**: Designed for various multimodal models.
- **Integration**: Currently integrated with **GPT-4o, o1, Gemini Pro Vision, Claude 3 and LLaVa.**
- **Integration**: Currently integrated with **GPT-4o, o1, Claude 3.7, Gemini Pro Vision, Claude 3, qwuen-VL and LLaVa.**
- **Future Plans**: Support for additional models.

## Demo
Expand Down Expand Up @@ -62,6 +62,14 @@ operate -m o1-with-ocr


### Multimodal Models `-m`

#### Try claude 3.7 `-m claude-3.7`
Use Clude 3.7 with Vision to see how it stacks up to GPT-4-Vision at operating a computer. Navigate to the [Anthropic dashboard](https://console.anthropic.com/dashboard) to get an API key and run the command below to try it.

```
operate -m claude-3.7
```

Try Google's `gemini-pro-vision` by following the instructions below. Start `operate` with the Gemini model
```
operate -m gemini-pro-vision
Expand All @@ -76,6 +84,13 @@ Use Claude 3 with Vision to see how it stacks up to GPT-4-Vision at operating a
operate -m claude-3
```

#### Try qwen `-m qwen-vl`
Use Qwen-vl with Vision to see how it stacks up to GPT-4-Vision at operating a computer. Navigate to the [Qwen dashboard](https://bailian.console.aliyun.com/) to get an API key and run the command below to try it.

```
operate -m qwen-vl
```

#### Try LLaVa Hosted Through Ollama `-m llava`
If you wish to experiment with the Self-Operating Computer Framework using LLaVA on your own machine, you can with Ollama!
*Note: Ollama currently only supports MacOS and Linux. Windows now in Preview*
Expand Down
Loading