AI CSV Column Matching Tool

Overview

This tool uses AI embeddings to detect matching columns between two CSV files and automatically merges them. It's designed to solve the common problem of semantically similar columns with different names (e.g., "operational_expenses" vs "cost_of_operation").

Features

AI-based column matching: Uses lightweight Sentence Transformers to find semantically similar columns
Manual matching interface: Visual interface to manually match unmatched columns using mouse selection
Sorted recommendations: "Best Potential Matches" section with columns sorted by similarity scores
Real-time similarity scoring: Shows similarity scores with sample values for better decisions
Performance optimized: Caching system prevents redundant similarity calculations
Sorted dropdowns: Dropdowns show columns sorted by similarity when making manual matches
Auto-refreshing UI: Dropdowns automatically update to remove matched columns after applying matches
Automatic merging: Merges CSV files based on both AI and manual matches
Schema generation: Creates a JSON schema describing the unified columns
Web interface: Simple Streamlit UI for easy file upload and processing
Export functionality: Exports both merged CSV and schema JSON
Two output modes: "Mapped columns only" and "Full data with mapped + unmapped columns"
Proper column naming: Unmapped columns get proper suffixes with actual file names
Correct column ordering: Mapped → unmapped_file1 → unmapped_file2

Installation

For Python Developers

Clone or download the repository
Install dependencies:
```
pip install -r requirements.txt
```

For End Users (Recommended)

Download the standalone executable CSV_Column_Matcher.exe
No installation needed - simply run the executable
The application will launch automatically in your browser

Usage

Method 1: Run the Web Interface from Source

streamlit run main.py

Method 2: Run the Standalone Executable

Download the CSV_Column_Matcher.exe file
Double-click to run (first run may take longer as AI models load)
The application will automatically open in your default browser at http://localhost:8501

Note: On first execution, Windows may show a security warning - this is normal for unsigned executables.

Method 3: Install via Installer (if available)

Run the installer CSV_Column_Matcher_Installer.exe
Follow the installation wizard
Launch the application from the Start Menu

How It Works

Upload: Upload two CSV files with potentially different column names
Analysis: The tool analyzes column names and sample values using AI embeddings
Matching: Identifies semantically similar columns across the two files
Merging: Creates a unified dataset based on the similarity-mapped columns
Export: Generates both merged CSV and schema JSON files

Example

File 1 has columns: Cost_Center, Expense_Total
File 2 has columns: Org_Code, Actual_Spend
Tool recognizes: Cost_Center ≈ Org_Code and Expense_Total ≈ Actual_Spend
Merges the files accordingly

Project Structure

ai_csv_column_matching_tool/
├── main.py                 # Streamlit web interface
├── csv_matcher.py          # CSV ingestion and parsing
├── embeddings_matcher.py   # Sentence Transformers matching logic
├── csv_merger.py           # Merge and export functionality
├── run_app.py              # Application launcher script
├── streamlit_wrapper.py    # Wrapper for creating executable
├── csv_matcher_app.spec    # PyInstaller spec file
├── csv_matcher_installer.iss # Inno Setup installer script
├── USAGE_EXE.md            # Executable usage instructions
├── government_projects.csv # Sample test file 1
├── commercial_engagements.csv # Sample test file 2
├── merged.csv              # Sample merged output
├── schema.json             # Sample schema output
├── generate_test_data.py   # Generate synthetic test files
├── requirements.txt        # Python dependencies
├── README.md               # This file
└── project_progress.md     # Project progress tracking

Configuration

Similarity Threshold: Adjust the minimum score for column matches (0.0 to 1.0)
Primary Model: By default, uses "sentence-transformers/all-MiniLM-L6-v2" model

Output Files

merged.csv: Unified dataset with matched columns
schema.json: Detailed schema describing unified columns and their sources

Requirements

Python 3.8+ (Recommended v3.11)
See requirements.txt for full dependency list

Testing

The project includes synthetic test data to verify functionality. Run tests with:

python test_app.py

How Our Solution Exceeds Requirements

The original requirements called for a simple tool to match CSV columns using AI embeddings with a basic UI. Our solution exceeds these requirements by providing:

Enhanced AI matching: Not only basic embeddings, adjustable threshold, but also manual verification
Human in the Loop: [Optional] based on best possible suggestions, with manual override capabilities
Superior UI experience: Sorted recommendations, sample previews, and real-time similarity scoring
Performance optimization: Caching system that prevents redundant calculations
Intuitive workflow: Best matches highlighted with sample values for better decision-making
Robust architecture: Modular design that's easy to extend and maintain

Creating the Executable

To create the executable version of the application:

Install PyInstaller: pip install pyinstaller
Run: pyinstaller csv_matcher_app.spec
The executable will be in the dist folder

To create an installer:

Install Inno Setup
Compile the csv_matcher_installer.iss script using Inno Setup Compiler

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI CSV Column Matching Tool

Overview

Features

Installation

For Python Developers

For End Users (Recommended)

Usage

Method 1: Run the Web Interface from Source

Method 2: Run the Standalone Executable

Method 3: Install via Installer (if available)

How It Works

Example

Project Structure

Configuration

Output Files

Requirements

Testing

How Our Solution Exceeds Requirements

Creating the Executable

License & attribution

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
USAGE_INSTRUCTIONS.md		USAGE_INSTRUCTIONS.md
commercial_engagements.csv		commercial_engagements.csv
csv_matcher.py		csv_matcher.py
csv_matcher_app.spec		csv_matcher_app.spec
csv_matcher_installer.iss		csv_matcher_installer.iss
csv_merger.py		csv_merger.py
embeddings_matcher.py		embeddings_matcher.py
fallback_matcher.py		fallback_matcher.py
generate_test_data.py		generate_test_data.py
government_projects.csv		government_projects.csv
main.py		main.py
merged_output.csv		merged_output.csv
requirements.txt		requirements.txt
run_app.py		run_app.py
schema_output.json		schema_output.json
streamlit_wrapper.py		streamlit_wrapper.py

Folders and files

Latest commit

History

Repository files navigation

AI CSV Column Matching Tool

Overview

Features

Installation

For Python Developers

For End Users (Recommended)

Usage

Method 1: Run the Web Interface from Source

Method 2: Run the Standalone Executable

Method 3: Install via Installer (if available)

How It Works

Example

Project Structure

Configuration

Output Files

Requirements

Testing

How Our Solution Exceeds Requirements

Creating the Executable

License & attribution

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages