A real-time multimodal AI agent that watches your environment through your webcam and warns you before accidents happen β using Gemini Vision, COCO-SSD object detection, and voice alerts.
Every year, thousands of preventable accidents happen at home and at work β a knife left too close to the edge, a distracted moment with a sharp tool, a hazard nobody noticed. Second Pair of Eyes is an always-on AI safety agent that acts as a second observer in your environment, detecting dangerous objects and situations in real time and warning you before harm occurs.
| Feature | How it works |
|---|---|
| Real-time object detection | COCO-SSD (MobileNet v2) runs on-device via TensorFlow.js β no latency, no data leaves your browser |
| Bounding boxes + confidence | Each detected object gets a labeled box with confidence % |
| Distance estimation | Pinhole camera model estimates how far each object is from the camera |
| Gemini multimodal vision | Every 8 seconds, a JPEG frame is sent to Gemini β it literally looks at your scene and describes hazards it sees |
| Gemini text safety advice | When dangerous objects are detected, Gemini generates specific, actionable safety advice |
| Voice alerts | Web Speech API speaks warnings aloud β hands-free safety |
| Risk scoring | HIGH / MEDIUM / LOW risk computed from detected object combination |
| Event timeline | Scrollable log of all detections and AI responses |
| Interrupt system | Memory context tracks patterns and can interrupt with proactive warnings |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BROWSER (Frontend) β
β β
β Webcam β COCO-SSD (TF.js) β Canvas Overlay β
β β objects JSON β
β WebSocket ββββββββββββββββββββββββββββ WebSocket β
β β advice JSON β
β Web Speech API (voice alerts) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β WebSocket (ws://)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GOOGLE CLOUD RUN (Backend) β
β β
β FastAPI + uvicorn β
β βββ /voice WebSocket endpoint β
β βββ gemini_agent.py β OpenRouter β Gemini 2.0 Flash β
β βββ vision_agent.py β Gemini multimodal vision β
β βββ memory.py β Context + interrupt logic β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HTTPS
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β OPENROUTER API β
β google/gemini-2.0-flash-exp:free β
β Multimodal vision + text generation β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Cloud Run β serverless container hosting for the FastAPI backend
- Cloud Build β automatic container image build on
gcloud run deploy --source - Artifact Registry β stores built container images
- Secret Manager β stores the Gemini API key securely (recommended for production)
- Python 3.11+
- Node.js / any static file server (e.g. VS Code Live Server)
- A Gemini API key from aistudio.google.com OR an OpenRouter key
git clone https://github.com/YOUR_USERNAME/second-pair-of-eyes
cd second-pair-of-eyescd backend
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txtcp .env.example .env
# Edit .env and add your API key:
# GEMINI_API_KEY=sk-or-v1-YOUR_OPENROUTER_KEYuvicorn main:app --reload --port 8000Open frontend/index.html with VS Code Live Server (or any static server on port 5500), then click βΆ START AGENT.
- Google Cloud CLI installed and authenticated
- A Google Cloud project with billing enabled
cd backend
GEMINI_API_KEY=your_key bash ../deploy.shOr manually:
gcloud run deploy second-pair-of-eyes \
--source ./backend \
--region us-central1 \
--platform managed \
--allow-unauthenticated \
--port 8000 \
--set-env-vars "GEMINI_API_KEY=your_key"After deployment, update the WS_URL in frontend/index.html:
// Change from:
const WS_URL = "ws://127.0.0.1:8000/voice"
// To your Cloud Run URL:
const WS_URL = "wss://second-pair-of-eyes-XXXXX-uc.a.run.app/voice"second-pair-of-eyes/
βββ frontend/
β βββ index.html # Single-file frontend (TF.js + WebSocket + UI)
βββ backend/
β βββ main.py # FastAPI app + WebSocket handler
β βββ gemini_agent.py # Gemini text safety analysis
β βββ vision_agent.py # Gemini multimodal vision analysis
β βββ memory.py # Context memory + interrupt logic
β βββ requirements.txt # Python dependencies
β βββ Dockerfile # Container definition for Cloud Run
β βββ .env.example # Environment variable template
βββ deploy.sh # One-command Cloud Run deployment
βββ README.md
When COCO-SSD detects objects, their labels + computed risk level are sent to Gemini with a safety-focused prompt. Gemini returns 1-2 sentences of specific, actionable advice.
Every 8 seconds, a JPEG snapshot of the live camera feed is sent to Gemini Vision. Gemini literally looks at the image and describes any hazards it can visually identify β objects COCO-SSD might miss, unsafe postures, environmental hazards.
This dual approach (local COCO-SSD for speed + Gemini Vision for accuracy) means the system catches both common dangerous objects in real time AND subtle hazards that require visual understanding.
| Layer | Technology |
|---|---|
| Object detection | TensorFlow.js + COCO-SSD (MobileNet v2) |
| AI vision + advice | Google Gemini 2.0 Flash (via OpenRouter) |
| Backend | FastAPI + WebSocket (Python) |
| Hosting | Google Cloud Run |
| Voice output | Web Speech API (browser-native) |
| Distance estimation | Pinhole camera model |
| Frontend | Vanilla HTML/CSS/JS |
COCO-SSD + Gemini Vision together can detect: knife, scissors, fork, baseball bat, bottle, gun, cell phone, remote, and any other object Gemini Vision identifies visually.
- All video processing happens on-device (COCO-SSD runs in your browser)
- Only small JPEG snapshots (320Γ240) are sent to Gemini Vision every 8 seconds
- Object labels (not video) are sent to Gemini for text advice
- No video is stored anywhere
Gemini Live Agent Challenge β Live Agents category
Real-time interaction with audio/vision, multimodal Gemini integration, hosted on Google Cloud.
MIT
- Detecting sharp objects near users
- Alerting users when distracted by phones
- Preventing unsafe actions in workspaces
- Safety monitoring in workshops or labs
- YOLOv8 object detection for higher accuracy
- Edge AI acceleration
- Multi-camera monitoring
- Predictive hazard detection using spatial analysis
- Mobile deployment
The idea behind Second Pair of Eyes is to create an AI assistant that proactively protects users by continuously analyzing the surrounding environment.
Krishna Vasnani
B.Tech Computer Science Engineering
JECRC University
Save and exit:
CTRL + X Y ENTER