Skip to content

Semantic CSV transforms any CSV into a local semantic search engine. It detects text columns, builds embeddings, and lets you query rows using natural language—all from a fast, simple CLI powered by Polars, FAISS, and sentence-transformers.

License

Notifications You must be signed in to change notification settings

adzynia/semantic-csv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Semantic CSV

Intelligent CSV search using sentence embeddings and FAISS

Python 3.11+ License: MIT Code style: black

FeaturesInstallationQuick StartUsageDocumentation


Semantic CSV is a CLI tool that enables semantic search over CSV files. It automatically detects text columns, generates embeddings using sentence transformers, builds a FAISS similarity index, and provides fast, intuitive search capabilities.

Demo

# Index your CSV file
$ semantic-csv index products.csv --output-dir ./index
Indexing: products.csv
  Embedding 10 rows... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100%
Index created successfully!

# Search semantically
$ semantic-csv search "computer peripherals" --index ./index --k 3

Query: computer peripherals
Found 3 results in 624ms

┌────┬────────┬─────────────────────┬─────────────────────────┬─────────────┐
│ #  │ Score  │ title              │ description             │ category    │
├────┼────────┼─────────────────────┼─────────────────────────┼─────────────┤
│ 1  │ 0.519  │ Wireless Mouse     │ Ergonomic wireless...   │ Electronics │
│ 2  │ 0.476  │ Mechanical Keyboard│ RGB backlit gaming...   │ Electronics │
│ 3  │ 0.361  │ Monitor Arm        │ Dual monitor desk...    │ Accessories │
└────┴────────┴─────────────────────┴─────────────────────────┴─────────────┘

Features

  • Automatic text detection: Intelligently identifies columns suitable for semantic indexing
  • Fast similarity search: Uses FAISS for efficient approximate nearest neighbor search
  • Rich CLI: Beautiful terminal output with progress bars and formatted tables
  • Flexible: Support for custom embedding models and configurable search parameters
  • Lightweight: Efficient storage with Parquet format and optimized indexing
  • Type-safe: Full type hints and clean, modular architecture

Why Semantic CSV?

Traditional CSV search relies on exact keyword matching. Semantic CSV understands meaning, not just keywords:

Query Traditional Search Semantic Search ✨
"computer peripherals" ❌ No matches (exact keyword not in CSV) ✅ Finds: mouse, keyboard, webcam
"ergonomic workspace setup" ❌ Only finds "ergonomic mouse" ✅ Finds: ergonomic mouse, laptop stand, monitor arm
"audio devices" ❌ Misses "headphones" ✅ Finds: headphones, webcam with mic

Use Cases

  • Product catalogs: Find similar products by description
  • Customer support: Search tickets by intent, not keywords
  • Research datasets: Discover related papers or data points
  • Documentation: Find relevant docs even with different wording
  • Data exploration: Understand what's in your CSV without manual inspection

Installation

Basic Installation

# Clone the repository
git clone https://github.com/yourusername/semantic-csv.git
cd semantic-csv

# Install the package
make install
# or
pip install -e .

Development Installation

# Install with development dependencies
make install-dev
# or
pip install -e ".[dev]"

GPU Support (Optional)

# Install with GPU-accelerated FAISS
make install-gpu
# or
pip install -e ".[gpu]"

Quick Start

1. Index a CSV file

semantic-csv index data.csv --output-dir ./my-index

This will:

  • Detect text columns automatically
  • Generate sentence embeddings
  • Build a FAISS index
  • Save everything to ./my-index

2. Search the index

semantic-csv search "your query here" --index ./my-index --k 5

This returns the 5 most similar rows with similarity scores.

3. View index information

semantic-csv info --index ./my-index

Shows metadata about the index including size, columns, and configuration.

Usage Examples

Basic Workflow

# Index a products CSV
semantic-csv index products.csv

# Search for similar products
semantic-csv search "wireless headphones" --k 10

# View index details
semantic-csv info

Custom Options

# Specify which columns to index
semantic-csv index data.csv \
  --text-columns title description \
  --output-dir ./custom-index

# Use a different embedding model
semantic-csv index data.csv \
  --model sentence-transformers/all-mpnet-base-v2 \
  --output-dir ./index

# Search with custom output columns
semantic-csv search "laptop" \
  --index ./index \
  --k 20 \
  --columns title price category

Complete Example

# Create a sample CSV
cat > products.csv << EOF
id,title,description,category,price
1,Wireless Mouse,Ergonomic wireless mouse with USB receiver,Electronics,29.99
2,Laptop Stand,Adjustable aluminum laptop stand,Accessories,49.99
3,Mechanical Keyboard,RGB backlit gaming keyboard,Electronics,89.99
4,USB-C Cable,High-speed charging cable,Cables,12.99
5,Desk Lamp,LED desk lamp with adjustable brightness,Lighting,34.99
EOF

# Index the CSV
semantic-csv index products.csv --output-dir ./products-index

# Search for related items
semantic-csv search "computer peripherals" --index ./products-index

# Search with specific output columns
semantic-csv search "ergonomic setup" --index ./products-index --columns title price

CLI Commands

index

Create a searchable index from a CSV file.

semantic-csv index <CSV_FILE> [OPTIONS]

Options:

  • --output-dir, -o PATH: Directory to save index files (default: ./index)
  • --text-columns, -c TEXT: Specific columns to embed (auto-detect if not specified)
  • --model, -m TEXT: Sentence transformer model name (default: all-MiniLM-L6-v2)
  • --batch-size, -b INT: Batch size for embedding generation (default: 32)

Example:

semantic-csv index data.csv \
  --output-dir ./my-index \
  --text-columns title description \
  --model all-mpnet-base-v2 \
  --batch-size 64

search

Search the index for similar rows.

semantic-csv search <QUERY> [OPTIONS]

Options:

  • --index, -i PATH: Directory containing index files (default: ./index)
  • --k INT: Number of results to return (default: 5)
  • --columns, -c TEXT: Specific columns to display in results

Example:

semantic-csv search "wireless devices" \
  --index ./my-index \
  --k 10 \
  --columns title category price

info

Show information about an index.

semantic-csv info [OPTIONS]

Options:

  • --index, -i PATH: Directory containing index files (default: ./index)

Example:

semantic-csv info --index ./my-index

version

Show version information.

semantic-csv version

Architecture

The project follows a clean, modular architecture:

semantic_csv/
├── __init__.py       # Package initialization
├── cli.py            # CLI layer (Typer commands)
├── indexer.py        # CSV indexing and embedding logic
├── searcher.py       # Similarity search logic
├── models.py         # Data models (dataclasses)
└── utils.py          # Helper functions

Index Structure

When you create an index, the following files are generated:

index/
├── index.faiss       # FAISS similarity index
├── rows.parquet      # Original CSV data in Parquet format
└── meta.json         # Index metadata (columns, model, dimensions)

Development

Setup Development Environment

# Install with dev dependencies
make install-dev

# Format code
make format

# Run linting
make lint

# Type checking
make type-check

# Run tests
make test

# Run tests with coverage
make test-cov

Project Structure

semantic-csv/
├── semantic_csv/          # Main package
│   ├── __init__.py
│   ├── cli.py
│   ├── indexer.py
│   ├── searcher.py
│   ├── models.py
│   └── utils.py
├── tests/                 # Test suite (to be added)
├── examples/              # Example files
├── pyproject.toml         # Project configuration
├── Makefile               # Development commands
└── README.md              # This file

Running the Example

make run-example

This creates a sample products CSV, indexes it, and runs example searches.

How It Works

  1. Text Detection: The indexer automatically identifies columns containing meaningful text (skipping IDs, codes, etc.)

  2. Embedding Generation: Text from selected columns is combined and embedded using a sentence transformer model (default: all-MiniLM-L6-v2)

  3. FAISS Indexing: Embeddings are normalized and stored in a FAISS index optimized for cosine similarity search

  4. Search: Query text is embedded with the same model, and FAISS finds the most similar vectors

  5. Results: Original CSV rows corresponding to similar vectors are returned with similarity scores

Performance

  • Indexing: ~1000 rows/second (varies by text length and model)
  • Search: <10ms for most queries (with IndexFlatIP)
  • Storage: Typically 5-10MB per 10k rows (depends on number of columns)

Supported Embedding Models

Any sentence-transformers model from HuggingFace Hub works. Popular choices:

  • all-MiniLM-L6-v2 (default): Fast, lightweight, 384-dim
  • all-mpnet-base-v2: Higher quality, 768-dim
  • multi-qa-mpnet-base-dot-v1: Optimized for question-answering
  • paraphrase-multilingual-mpnet-base-v2: Multi-language support

See SBERT Models for more options.

Requirements

  • Python 3.11+
  • Dependencies:
    • polars - Fast CSV loading
    • sentence-transformers - Embedding generation
    • faiss-cpu - Similarity search
    • typer - CLI framework
    • rich - Terminal formatting
    • numpy - Numerical operations
    • pyarrow - Parquet support

Troubleshooting

"No suitable text columns detected"

If auto-detection fails, manually specify columns:

semantic-csv index data.csv --text-columns col1 col2 col3

Large CSV files

For very large CSVs (>1M rows), consider:

  • Increasing batch size: --batch-size 128
  • Using a smaller embedding model
  • Processing in chunks (split the CSV)

Out of memory

  • Reduce batch size: --batch-size 16
  • Use a smaller embedding model
  • Close other applications

License

MIT License - See LICENSE file for details

Technical Highlights

This project demonstrates several software engineering best practices:

Architecture & Design

  • Clean separation of concerns: Business logic separated from CLI layer
  • Type-safe: Full type hints throughout the codebase
  • Modular design: Easy to extend with new models or index types
  • Data classes: Immutable, well-structured data models

Performance & Efficiency

  • Fast CSV processing: Polars for efficient data manipulation
  • Optimized storage: Parquet format reduces disk usage
  • FAISS indexing: Sub-millisecond similarity search
  • Lazy loading: Resources loaded only when needed

Developer Experience

  • Rich CLI: Beautiful terminal output with progress bars
  • Comprehensive docs: Docstrings, type hints, and usage examples
  • Development tools: Makefile for common tasks, pre-configured linting
  • Error handling: Graceful failures with helpful error messages

Technology Stack

  • polars: High-performance DataFrame library (faster than pandas)
  • sentence-transformers: State-of-the-art text embeddings
  • FAISS: Facebook's similarity search library
  • typer: Modern CLI framework with type hints
  • rich: Terminal formatting and progress bars

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for detailed guidelines.

Quick start:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run make format and make lint
  5. Submit a pull request

Roadmap

Implemented ✅

  • Core semantic search functionality
  • Automatic text column detection
  • FAISS similarity indexing
  • Rich CLI with progress bars
  • Configurable embedding models
  • Efficient Parquet storage
  • Full type hints and documentation

Planned 🚧

  • Comprehensive unit test suite
  • Support for incremental indexing (add rows to existing index)
  • Multi-language text detection
  • Query caching for repeated searches
  • Web interface (optional)
  • Export search results to CSV/JSON
  • IVF indices for billion-scale datasets
  • Batch search API

Contact

For questions, issues, or suggestions, please open an issue on GitHub.

About

Semantic CSV transforms any CSV into a local semantic search engine. It detects text columns, builds embeddings, and lets you query rows using natural language—all from a fast, simple CLI powered by Polars, FAISS, and sentence-transformers.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published