A Streamlit-based benchmarking dashboard for comparing LLMs like OpenAI GPT, Google Gemini, Cohere, and Anthropic Claude.
The LLM Comparison Tool allows users to benchmark, analyze, and compare various Large Language Models (LLMs) across latency, accuracy, and cost per 1K tokens.
✅ Compare multiple LLMs from OpenAI, Google Gemini, Cohere, and Anthropic
✅ Interactive UI using Streamlit for seamless benchmarking
✅ Performance Metrics: Latency, cost, and accuracy visualization
✅ User Feedback Collection & Cloud Storage Logging
✅ Fully Deployable on Cloud Run
llm-comparison-tool/
│── backend/ # Backend API
│ ├── .gcloudignore
│ ├── Dockerfile
│ ├── llm_benchmark_api.py # Backend API script
│ ├── requirements.txt
│
│── frontend/ # Streamlit Dashboard UI
│ ├── Dockerfile
│ ├── llm_benchmark_dashboard.py # UI script
│ ├── requirements.txt
│
│── scripts/ # Deployment Scripts
│ ├── deploy_llm_comparison_api.sh
│ ├── deploy_llm_comparison_dashboard.sh
│
│── README.md
git clone https://github.com/saurabhmi2212/llm-comparison-tool
cd llm-comparison-tool
cd backend
pip install -r requirements.txt
cd frontend
pip install -r requirements.txt
cd backend
python llm_benchmark_api.py
cd frontend
streamlit run llm_benchmark_dashboard.py
You can deploy both frontend (Streamlit UI) and backend (Flask API) to Cloud Run.
Run the following deployment script:
./scripts/deploy_llm_comparison_api.sh
This will:
- Build & push the backend API Docker image
- Deploy it to Google Cloud Run
Run the following deployment script:
./scripts/deploy_llm_comparison_dashboard.sh
This will:
- Build & push the frontend Streamlit UI Docker image
- Deploy it to Google Cloud Run
Endpoint | Method | Description |
---|---|---|
/benchmark |
POST | Run a benchmark for a model |
/update-feedback |
POST | Update user feedback |
/past-results |
GET | Retrieve past benchmark results |
Example cURL request:
curl -X POST "https://YOUR_BACKEND_URL/benchmark" \
-H "Content-Type: application/json" \
-d '{"model_name": "gemini-1.5-pro-001", "prompt": "What is Generative AI?"}'
- Compare multiple models side-by-side
- Measure latency, accuracy, and cost per 1K tokens
- Interactive bar & box plots
- Rate model responses (1-5)
- Store & update feedback in Google Cloud Storage
- View previous benchmarking data
- Sort & filter by model type, latency, and cost
- Add More LLM Providers (Meta Llama, Mistral)
- Live Model Cost Tracking
- Auto-generated Model Insights
- Multi-user login for personalized tracking
Contributions are welcome! Feel free to fork, submit PRs, or open issues.
Saurabh Mishra
This project is licensed under the MIT License.