Skip to content

amitportal/similar_column_merge_tool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI CSV Column Matching Tool

Overview

This tool uses AI embeddings to detect matching columns between two CSV files and automatically merges them. It's designed to solve the common problem of semantically similar columns with different names (e.g., "operational_expenses" vs "cost_of_operation").

Features

  • AI-based column matching: Uses lightweight Sentence Transformers to find semantically similar columns
  • Manual matching interface: Visual interface to manually match unmatched columns using mouse selection
  • Sorted recommendations: "Best Potential Matches" section with columns sorted by similarity scores
  • Real-time similarity scoring: Shows similarity scores with sample values for better decisions
  • Performance optimized: Caching system prevents redundant similarity calculations
  • Sorted dropdowns: Dropdowns show columns sorted by similarity when making manual matches
  • Auto-refreshing UI: Dropdowns automatically update to remove matched columns after applying matches
  • Automatic merging: Merges CSV files based on both AI and manual matches
  • Schema generation: Creates a JSON schema describing the unified columns
  • Web interface: Simple Streamlit UI for easy file upload and processing
  • Export functionality: Exports both merged CSV and schema JSON
  • Two output modes: "Mapped columns only" and "Full data with mapped + unmapped columns"
  • Proper column naming: Unmapped columns get proper suffixes with actual file names
  • Correct column ordering: Mapped → unmapped_file1 → unmapped_file2

Installation

For Python Developers

  1. Clone or download the repository
  2. Install dependencies:
    pip install -r requirements.txt

For End Users (Recommended)

  1. Download the standalone executable CSV_Column_Matcher.exe
  2. No installation needed - simply run the executable
  3. The application will launch automatically in your browser

Usage

Method 1: Run the Web Interface from Source

streamlit run main.py

Method 2: Run the Standalone Executable

  1. Download the CSV_Column_Matcher.exe file
  2. Double-click to run (first run may take longer as AI models load)
  3. The application will automatically open in your default browser at http://localhost:8501

Note: On first execution, Windows may show a security warning - this is normal for unsigned executables.

Method 3: Install via Installer (if available)

  1. Run the installer CSV_Column_Matcher_Installer.exe
  2. Follow the installation wizard
  3. Launch the application from the Start Menu

How It Works

  1. Upload: Upload two CSV files with potentially different column names
  2. Analysis: The tool analyzes column names and sample values using AI embeddings
  3. Matching: Identifies semantically similar columns across the two files
  4. Merging: Creates a unified dataset based on the similarity-mapped columns
  5. Export: Generates both merged CSV and schema JSON files

Example

  • File 1 has columns: Cost_Center, Expense_Total
  • File 2 has columns: Org_Code, Actual_Spend
  • Tool recognizes: Cost_CenterOrg_Code and Expense_TotalActual_Spend
  • Merges the files accordingly

Project Structure

ai_csv_column_matching_tool/
├── main.py                 # Streamlit web interface
├── csv_matcher.py          # CSV ingestion and parsing
├── embeddings_matcher.py   # Sentence Transformers matching logic
├── csv_merger.py           # Merge and export functionality
├── run_app.py              # Application launcher script
├── streamlit_wrapper.py    # Wrapper for creating executable
├── csv_matcher_app.spec    # PyInstaller spec file
├── csv_matcher_installer.iss # Inno Setup installer script
├── USAGE_EXE.md            # Executable usage instructions
├── government_projects.csv # Sample test file 1
├── commercial_engagements.csv # Sample test file 2
├── merged.csv              # Sample merged output
├── schema.json             # Sample schema output
├── generate_test_data.py   # Generate synthetic test files
├── requirements.txt        # Python dependencies
├── README.md               # This file
└── project_progress.md     # Project progress tracking

Configuration

  • Similarity Threshold: Adjust the minimum score for column matches (0.0 to 1.0)
  • Primary Model: By default, uses "sentence-transformers/all-MiniLM-L6-v2" model

Output Files

  • merged.csv: Unified dataset with matched columns
  • schema.json: Detailed schema describing unified columns and their sources

Requirements

  • Python 3.8+ (Recommended v3.11)
  • See requirements.txt for full dependency list

Testing

The project includes synthetic test data to verify functionality. Run tests with:

python test_app.py

How Our Solution Exceeds Requirements

The original requirements called for a simple tool to match CSV columns using AI embeddings with a basic UI. Our solution exceeds these requirements by providing:

  • Enhanced AI matching: Not only basic embeddings, adjustable threshold, but also manual verification
  • Human in the Loop: [Optional] based on best possible suggestions, with manual override capabilities
  • Superior UI experience: Sorted recommendations, sample previews, and real-time similarity scoring
  • Performance optimization: Caching system that prevents redundant calculations
  • Intuitive workflow: Best matches highlighted with sample values for better decision-making
  • Robust architecture: Modular design that's easy to extend and maintain

Creating the Executable

To create the executable version of the application:

  1. Install PyInstaller: pip install pyinstaller
  2. Run: pyinstaller csv_matcher_app.spec
  3. The executable will be in the dist folder

To create an installer:

  1. Install Inno Setup
  2. Compile the csv_matcher_installer.iss script using Inno Setup Compiler

License & attribution

© 2025 Amit Kumar. MIT License.

About

This tool uses word embeddings to detect matching columns between two CSV files and automatically merges them. It's designed to solve the common problem of semantically similar columns with different names (e.g., "oprational_expenses" vs "cost_of_operation").

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors