Intelligent CSV search using sentence embeddings and FAISS
Features • Installation • Quick Start • Usage • Documentation
Semantic CSV is a CLI tool that enables semantic search over CSV files. It automatically detects text columns, generates embeddings using sentence transformers, builds a FAISS similarity index, and provides fast, intuitive search capabilities.
# Index your CSV file
$ semantic-csv index products.csv --output-dir ./index
Indexing: products.csv
Embedding 10 rows... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100%
Index created successfully!
# Search semantically
$ semantic-csv search "computer peripherals" --index ./index --k 3
Query: computer peripherals
Found 3 results in 624ms
┌────┬────────┬─────────────────────┬─────────────────────────┬─────────────┐
│ # │ Score │ title │ description │ category │
├────┼────────┼─────────────────────┼─────────────────────────┼─────────────┤
│ 1 │ 0.519 │ Wireless Mouse │ Ergonomic wireless... │ Electronics │
│ 2 │ 0.476 │ Mechanical Keyboard│ RGB backlit gaming... │ Electronics │
│ 3 │ 0.361 │ Monitor Arm │ Dual monitor desk... │ Accessories │
└────┴────────┴─────────────────────┴─────────────────────────┴─────────────┘- Automatic text detection: Intelligently identifies columns suitable for semantic indexing
- Fast similarity search: Uses FAISS for efficient approximate nearest neighbor search
- Rich CLI: Beautiful terminal output with progress bars and formatted tables
- Flexible: Support for custom embedding models and configurable search parameters
- Lightweight: Efficient storage with Parquet format and optimized indexing
- Type-safe: Full type hints and clean, modular architecture
Traditional CSV search relies on exact keyword matching. Semantic CSV understands meaning, not just keywords:
| Query | Traditional Search | Semantic Search ✨ |
|---|---|---|
| "computer peripherals" | ❌ No matches (exact keyword not in CSV) | ✅ Finds: mouse, keyboard, webcam |
| "ergonomic workspace setup" | ❌ Only finds "ergonomic mouse" | ✅ Finds: ergonomic mouse, laptop stand, monitor arm |
| "audio devices" | ❌ Misses "headphones" | ✅ Finds: headphones, webcam with mic |
- Product catalogs: Find similar products by description
- Customer support: Search tickets by intent, not keywords
- Research datasets: Discover related papers or data points
- Documentation: Find relevant docs even with different wording
- Data exploration: Understand what's in your CSV without manual inspection
# Clone the repository
git clone https://github.com/yourusername/semantic-csv.git
cd semantic-csv
# Install the package
make install
# or
pip install -e .# Install with development dependencies
make install-dev
# or
pip install -e ".[dev]"# Install with GPU-accelerated FAISS
make install-gpu
# or
pip install -e ".[gpu]"semantic-csv index data.csv --output-dir ./my-indexThis will:
- Detect text columns automatically
- Generate sentence embeddings
- Build a FAISS index
- Save everything to
./my-index
semantic-csv search "your query here" --index ./my-index --k 5This returns the 5 most similar rows with similarity scores.
semantic-csv info --index ./my-indexShows metadata about the index including size, columns, and configuration.
# Index a products CSV
semantic-csv index products.csv
# Search for similar products
semantic-csv search "wireless headphones" --k 10
# View index details
semantic-csv info# Specify which columns to index
semantic-csv index data.csv \
--text-columns title description \
--output-dir ./custom-index
# Use a different embedding model
semantic-csv index data.csv \
--model sentence-transformers/all-mpnet-base-v2 \
--output-dir ./index
# Search with custom output columns
semantic-csv search "laptop" \
--index ./index \
--k 20 \
--columns title price category# Create a sample CSV
cat > products.csv << EOF
id,title,description,category,price
1,Wireless Mouse,Ergonomic wireless mouse with USB receiver,Electronics,29.99
2,Laptop Stand,Adjustable aluminum laptop stand,Accessories,49.99
3,Mechanical Keyboard,RGB backlit gaming keyboard,Electronics,89.99
4,USB-C Cable,High-speed charging cable,Cables,12.99
5,Desk Lamp,LED desk lamp with adjustable brightness,Lighting,34.99
EOF
# Index the CSV
semantic-csv index products.csv --output-dir ./products-index
# Search for related items
semantic-csv search "computer peripherals" --index ./products-index
# Search with specific output columns
semantic-csv search "ergonomic setup" --index ./products-index --columns title priceCreate a searchable index from a CSV file.
semantic-csv index <CSV_FILE> [OPTIONS]Options:
--output-dir, -o PATH: Directory to save index files (default:./index)--text-columns, -c TEXT: Specific columns to embed (auto-detect if not specified)--model, -m TEXT: Sentence transformer model name (default:all-MiniLM-L6-v2)--batch-size, -b INT: Batch size for embedding generation (default:32)
Example:
semantic-csv index data.csv \
--output-dir ./my-index \
--text-columns title description \
--model all-mpnet-base-v2 \
--batch-size 64Search the index for similar rows.
semantic-csv search <QUERY> [OPTIONS]Options:
--index, -i PATH: Directory containing index files (default:./index)--k INT: Number of results to return (default:5)--columns, -c TEXT: Specific columns to display in results
Example:
semantic-csv search "wireless devices" \
--index ./my-index \
--k 10 \
--columns title category priceShow information about an index.
semantic-csv info [OPTIONS]Options:
--index, -i PATH: Directory containing index files (default:./index)
Example:
semantic-csv info --index ./my-indexShow version information.
semantic-csv versionThe project follows a clean, modular architecture:
semantic_csv/
├── __init__.py # Package initialization
├── cli.py # CLI layer (Typer commands)
├── indexer.py # CSV indexing and embedding logic
├── searcher.py # Similarity search logic
├── models.py # Data models (dataclasses)
└── utils.py # Helper functions
When you create an index, the following files are generated:
index/
├── index.faiss # FAISS similarity index
├── rows.parquet # Original CSV data in Parquet format
└── meta.json # Index metadata (columns, model, dimensions)
# Install with dev dependencies
make install-dev
# Format code
make format
# Run linting
make lint
# Type checking
make type-check
# Run tests
make test
# Run tests with coverage
make test-covsemantic-csv/
├── semantic_csv/ # Main package
│ ├── __init__.py
│ ├── cli.py
│ ├── indexer.py
│ ├── searcher.py
│ ├── models.py
│ └── utils.py
├── tests/ # Test suite (to be added)
├── examples/ # Example files
├── pyproject.toml # Project configuration
├── Makefile # Development commands
└── README.md # This file
make run-exampleThis creates a sample products CSV, indexes it, and runs example searches.
-
Text Detection: The indexer automatically identifies columns containing meaningful text (skipping IDs, codes, etc.)
-
Embedding Generation: Text from selected columns is combined and embedded using a sentence transformer model (default:
all-MiniLM-L6-v2) -
FAISS Indexing: Embeddings are normalized and stored in a FAISS index optimized for cosine similarity search
-
Search: Query text is embedded with the same model, and FAISS finds the most similar vectors
-
Results: Original CSV rows corresponding to similar vectors are returned with similarity scores
- Indexing: ~1000 rows/second (varies by text length and model)
- Search: <10ms for most queries (with IndexFlatIP)
- Storage: Typically 5-10MB per 10k rows (depends on number of columns)
Any sentence-transformers model from HuggingFace Hub works. Popular choices:
all-MiniLM-L6-v2(default): Fast, lightweight, 384-dimall-mpnet-base-v2: Higher quality, 768-dimmulti-qa-mpnet-base-dot-v1: Optimized for question-answeringparaphrase-multilingual-mpnet-base-v2: Multi-language support
See SBERT Models for more options.
- Python 3.11+
- Dependencies:
polars- Fast CSV loadingsentence-transformers- Embedding generationfaiss-cpu- Similarity searchtyper- CLI frameworkrich- Terminal formattingnumpy- Numerical operationspyarrow- Parquet support
If auto-detection fails, manually specify columns:
semantic-csv index data.csv --text-columns col1 col2 col3For very large CSVs (>1M rows), consider:
- Increasing batch size:
--batch-size 128 - Using a smaller embedding model
- Processing in chunks (split the CSV)
- Reduce batch size:
--batch-size 16 - Use a smaller embedding model
- Close other applications
MIT License - See LICENSE file for details
This project demonstrates several software engineering best practices:
- Clean separation of concerns: Business logic separated from CLI layer
- Type-safe: Full type hints throughout the codebase
- Modular design: Easy to extend with new models or index types
- Data classes: Immutable, well-structured data models
- Fast CSV processing: Polars for efficient data manipulation
- Optimized storage: Parquet format reduces disk usage
- FAISS indexing: Sub-millisecond similarity search
- Lazy loading: Resources loaded only when needed
- Rich CLI: Beautiful terminal output with progress bars
- Comprehensive docs: Docstrings, type hints, and usage examples
- Development tools: Makefile for common tasks, pre-configured linting
- Error handling: Graceful failures with helpful error messages
- polars: High-performance DataFrame library (faster than pandas)
- sentence-transformers: State-of-the-art text embeddings
- FAISS: Facebook's similarity search library
- typer: Modern CLI framework with type hints
- rich: Terminal formatting and progress bars
Contributions are welcome! Please see CONTRIBUTING.md for detailed guidelines.
Quick start:
- Fork the repository
- Create a feature branch
- Make your changes
- Run
make formatandmake lint - Submit a pull request
- Core semantic search functionality
- Automatic text column detection
- FAISS similarity indexing
- Rich CLI with progress bars
- Configurable embedding models
- Efficient Parquet storage
- Full type hints and documentation
- Comprehensive unit test suite
- Support for incremental indexing (add rows to existing index)
- Multi-language text detection
- Query caching for repeated searches
- Web interface (optional)
- Export search results to CSV/JSON
- IVF indices for billion-scale datasets
- Batch search API
For questions, issues, or suggestions, please open an issue on GitHub.