This tool uses AI embeddings to detect matching columns between two CSV files and automatically merges them. It's designed to solve the common problem of semantically similar columns with different names (e.g., "operational_expenses" vs "cost_of_operation").
- AI-based column matching: Uses lightweight Sentence Transformers to find semantically similar columns
- Manual matching interface: Visual interface to manually match unmatched columns using mouse selection
- Sorted recommendations: "Best Potential Matches" section with columns sorted by similarity scores
- Real-time similarity scoring: Shows similarity scores with sample values for better decisions
- Performance optimized: Caching system prevents redundant similarity calculations
- Sorted dropdowns: Dropdowns show columns sorted by similarity when making manual matches
- Auto-refreshing UI: Dropdowns automatically update to remove matched columns after applying matches
- Automatic merging: Merges CSV files based on both AI and manual matches
- Schema generation: Creates a JSON schema describing the unified columns
- Web interface: Simple Streamlit UI for easy file upload and processing
- Export functionality: Exports both merged CSV and schema JSON
- Two output modes: "Mapped columns only" and "Full data with mapped + unmapped columns"
- Proper column naming: Unmapped columns get proper suffixes with actual file names
- Correct column ordering: Mapped → unmapped_file1 → unmapped_file2
- Clone or download the repository
- Install dependencies:
pip install -r requirements.txt
- Download the standalone executable
CSV_Column_Matcher.exe - No installation needed - simply run the executable
- The application will launch automatically in your browser
streamlit run main.py- Download the
CSV_Column_Matcher.exefile - Double-click to run (first run may take longer as AI models load)
- The application will automatically open in your default browser at
http://localhost:8501
Note: On first execution, Windows may show a security warning - this is normal for unsigned executables.
- Run the installer
CSV_Column_Matcher_Installer.exe - Follow the installation wizard
- Launch the application from the Start Menu
- Upload: Upload two CSV files with potentially different column names
- Analysis: The tool analyzes column names and sample values using AI embeddings
- Matching: Identifies semantically similar columns across the two files
- Merging: Creates a unified dataset based on the similarity-mapped columns
- Export: Generates both merged CSV and schema JSON files
- File 1 has columns:
Cost_Center,Expense_Total - File 2 has columns:
Org_Code,Actual_Spend - Tool recognizes:
Cost_Center≈Org_CodeandExpense_Total≈Actual_Spend - Merges the files accordingly
ai_csv_column_matching_tool/
├── main.py # Streamlit web interface
├── csv_matcher.py # CSV ingestion and parsing
├── embeddings_matcher.py # Sentence Transformers matching logic
├── csv_merger.py # Merge and export functionality
├── run_app.py # Application launcher script
├── streamlit_wrapper.py # Wrapper for creating executable
├── csv_matcher_app.spec # PyInstaller spec file
├── csv_matcher_installer.iss # Inno Setup installer script
├── USAGE_EXE.md # Executable usage instructions
├── government_projects.csv # Sample test file 1
├── commercial_engagements.csv # Sample test file 2
├── merged.csv # Sample merged output
├── schema.json # Sample schema output
├── generate_test_data.py # Generate synthetic test files
├── requirements.txt # Python dependencies
├── README.md # This file
└── project_progress.md # Project progress tracking
- Similarity Threshold: Adjust the minimum score for column matches (0.0 to 1.0)
- Primary Model: By default, uses "sentence-transformers/all-MiniLM-L6-v2" model
merged.csv: Unified dataset with matched columnsschema.json: Detailed schema describing unified columns and their sources
- Python 3.8+ (Recommended v3.11)
- See
requirements.txtfor full dependency list
The project includes synthetic test data to verify functionality. Run tests with:
python test_app.pyThe original requirements called for a simple tool to match CSV columns using AI embeddings with a basic UI. Our solution exceeds these requirements by providing:
- Enhanced AI matching: Not only basic embeddings, adjustable threshold, but also manual verification
- Human in the Loop: [Optional] based on best possible suggestions, with manual override capabilities
- Superior UI experience: Sorted recommendations, sample previews, and real-time similarity scoring
- Performance optimization: Caching system that prevents redundant calculations
- Intuitive workflow: Best matches highlighted with sample values for better decision-making
- Robust architecture: Modular design that's easy to extend and maintain
To create the executable version of the application:
- Install PyInstaller:
pip install pyinstaller - Run:
pyinstaller csv_matcher_app.spec - The executable will be in the
distfolder
To create an installer:
- Install Inno Setup
- Compile the
csv_matcher_installer.issscript using Inno Setup Compiler
© 2025 Amit Kumar. MIT License.