Semantic CSV

Intelligent CSV search using sentence embeddings and FAISS

Features • Installation • Quick Start • Usage • Documentation

Semantic CSV is a CLI tool that enables semantic search over CSV files. It automatically detects text columns, generates embeddings using sentence transformers, builds a FAISS similarity index, and provides fast, intuitive search capabilities.

Demo

# Index your CSV file
$ semantic-csv index products.csv --output-dir ./index
Indexing: products.csv
  Embedding 10 rows... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100%
Index created successfully!

# Search semantically
$ semantic-csv search "computer peripherals" --index ./index --k 3

Query: computer peripherals
Found 3 results in 624ms

┌────┬────────┬─────────────────────┬─────────────────────────┬─────────────┐
│ #  │ Score  │ title              │ description             │ category    │
├────┼────────┼─────────────────────┼─────────────────────────┼─────────────┤
│ 1  │ 0.519  │ Wireless Mouse     │ Ergonomic wireless...   │ Electronics │
│ 2  │ 0.476  │ Mechanical Keyboard│ RGB backlit gaming...   │ Electronics │
│ 3  │ 0.361  │ Monitor Arm        │ Dual monitor desk...    │ Accessories │
└────┴────────┴─────────────────────┴─────────────────────────┴─────────────┘

Features

Automatic text detection: Intelligently identifies columns suitable for semantic indexing
Fast similarity search: Uses FAISS for efficient approximate nearest neighbor search
Rich CLI: Beautiful terminal output with progress bars and formatted tables
Flexible: Support for custom embedding models and configurable search parameters
Lightweight: Efficient storage with Parquet format and optimized indexing
Type-safe: Full type hints and clean, modular architecture

Why Semantic CSV?

Traditional CSV search relies on exact keyword matching. Semantic CSV understands meaning, not just keywords:

Query	Traditional Search	Semantic Search ✨
"computer peripherals"	❌ No matches (exact keyword not in CSV)	✅ Finds: mouse, keyboard, webcam
"ergonomic workspace setup"	❌ Only finds "ergonomic mouse"	✅ Finds: ergonomic mouse, laptop stand, monitor arm
"audio devices"	❌ Misses "headphones"	✅ Finds: headphones, webcam with mic

Use Cases

Product catalogs: Find similar products by description
Customer support: Search tickets by intent, not keywords
Research datasets: Discover related papers or data points
Documentation: Find relevant docs even with different wording
Data exploration: Understand what's in your CSV without manual inspection

Installation

Basic Installation

# Clone the repository
git clone https://github.com/yourusername/semantic-csv.git
cd semantic-csv

# Install the package
make install
# or
pip install -e .

Development Installation

# Install with development dependencies
make install-dev
# or
pip install -e ".[dev]"

GPU Support (Optional)

# Install with GPU-accelerated FAISS
make install-gpu
# or
pip install -e ".[gpu]"

Quick Start

1. Index a CSV file

semantic-csv index data.csv --output-dir ./my-index

This will:

Detect text columns automatically
Generate sentence embeddings
Build a FAISS index
Save everything to ./my-index

2. Search the index

semantic-csv search "your query here" --index ./my-index --k 5

This returns the 5 most similar rows with similarity scores.

3. View index information

semantic-csv info --index ./my-index

Shows metadata about the index including size, columns, and configuration.

Usage Examples

Basic Workflow

# Index a products CSV
semantic-csv index products.csv

# Search for similar products
semantic-csv search "wireless headphones" --k 10

# View index details
semantic-csv info

Custom Options

# Specify which columns to index
semantic-csv index data.csv \
  --text-columns title description \
  --output-dir ./custom-index

# Use a different embedding model
semantic-csv index data.csv \
  --model sentence-transformers/all-mpnet-base-v2 \
  --output-dir ./index

# Search with custom output columns
semantic-csv search "laptop" \
  --index ./index \
  --k 20 \
  --columns title price category

Complete Example

# Create a sample CSV
cat > products.csv << EOF
id,title,description,category,price
1,Wireless Mouse,Ergonomic wireless mouse with USB receiver,Electronics,29.99
2,Laptop Stand,Adjustable aluminum laptop stand,Accessories,49.99
3,Mechanical Keyboard,RGB backlit gaming keyboard,Electronics,89.99
4,USB-C Cable,High-speed charging cable,Cables,12.99
5,Desk Lamp,LED desk lamp with adjustable brightness,Lighting,34.99
EOF

# Index the CSV
semantic-csv index products.csv --output-dir ./products-index

# Search for related items
semantic-csv search "computer peripherals" --index ./products-index

# Search with specific output columns
semantic-csv search "ergonomic setup" --index ./products-index --columns title price

CLI Commands

`index`

Create a searchable index from a CSV file.

semantic-csv index <CSV_FILE> [OPTIONS]

Options:

--output-dir, -o PATH: Directory to save index files (default: ./index)
--text-columns, -c TEXT: Specific columns to embed (auto-detect if not specified)
--model, -m TEXT: Sentence transformer model name (default: all-MiniLM-L6-v2)
--batch-size, -b INT: Batch size for embedding generation (default: 32)

Example:

semantic-csv index data.csv \
  --output-dir ./my-index \
  --text-columns title description \
  --model all-mpnet-base-v2 \
  --batch-size 64

`search`

Search the index for similar rows.

semantic-csv search <QUERY> [OPTIONS]

Options:

--index, -i PATH: Directory containing index files (default: ./index)
--k INT: Number of results to return (default: 5)
--columns, -c TEXT: Specific columns to display in results

Example:

semantic-csv search "wireless devices" \
  --index ./my-index \
  --k 10 \
  --columns title category price

`info`

Show information about an index.

semantic-csv info [OPTIONS]

Options:

--index, -i PATH: Directory containing index files (default: ./index)

Example:

semantic-csv info --index ./my-index

`version`

Show version information.

semantic-csv version

Architecture

The project follows a clean, modular architecture:

semantic_csv/
├── __init__.py       # Package initialization
├── cli.py            # CLI layer (Typer commands)
├── indexer.py        # CSV indexing and embedding logic
├── searcher.py       # Similarity search logic
├── models.py         # Data models (dataclasses)
└── utils.py          # Helper functions

Index Structure

When you create an index, the following files are generated:

index/
├── index.faiss       # FAISS similarity index
├── rows.parquet      # Original CSV data in Parquet format
└── meta.json         # Index metadata (columns, model, dimensions)

Development

Setup Development Environment

# Install with dev dependencies
make install-dev

# Format code
make format

# Run linting
make lint

# Type checking
make type-check

# Run tests
make test

# Run tests with coverage
make test-cov

Project Structure

semantic-csv/
├── semantic_csv/          # Main package
│   ├── __init__.py
│   ├── cli.py
│   ├── indexer.py
│   ├── searcher.py
│   ├── models.py
│   └── utils.py
├── tests/                 # Test suite (to be added)
├── examples/              # Example files
├── pyproject.toml         # Project configuration
├── Makefile               # Development commands
└── README.md              # This file

Running the Example

make run-example

This creates a sample products CSV, indexes it, and runs example searches.

How It Works

Text Detection: The indexer automatically identifies columns containing meaningful text (skipping IDs, codes, etc.)
Embedding Generation: Text from selected columns is combined and embedded using a sentence transformer model (default: all-MiniLM-L6-v2)
FAISS Indexing: Embeddings are normalized and stored in a FAISS index optimized for cosine similarity search
Search: Query text is embedded with the same model, and FAISS finds the most similar vectors
Results: Original CSV rows corresponding to similar vectors are returned with similarity scores

Performance

Indexing: ~1000 rows/second (varies by text length and model)
Search: <10ms for most queries (with IndexFlatIP)
Storage: Typically 5-10MB per 10k rows (depends on number of columns)

Supported Embedding Models

Any sentence-transformers model from HuggingFace Hub works. Popular choices:

all-MiniLM-L6-v2 (default): Fast, lightweight, 384-dim
all-mpnet-base-v2: Higher quality, 768-dim
multi-qa-mpnet-base-dot-v1: Optimized for question-answering
paraphrase-multilingual-mpnet-base-v2: Multi-language support

See SBERT Models for more options.

Requirements

Python 3.11+
Dependencies:
- polars - Fast CSV loading
- sentence-transformers - Embedding generation
- faiss-cpu - Similarity search
- typer - CLI framework
- rich - Terminal formatting
- numpy - Numerical operations
- pyarrow - Parquet support

Troubleshooting

"No suitable text columns detected"

If auto-detection fails, manually specify columns:

semantic-csv index data.csv --text-columns col1 col2 col3

Large CSV files

For very large CSVs (>1M rows), consider:

Increasing batch size: --batch-size 128
Using a smaller embedding model
Processing in chunks (split the CSV)

Out of memory

Reduce batch size: --batch-size 16
Use a smaller embedding model
Close other applications

License

MIT License - See LICENSE file for details

Technical Highlights

This project demonstrates several software engineering best practices:

Architecture & Design

Clean separation of concerns: Business logic separated from CLI layer
Type-safe: Full type hints throughout the codebase
Modular design: Easy to extend with new models or index types
Data classes: Immutable, well-structured data models

Performance & Efficiency

Fast CSV processing: Polars for efficient data manipulation
Optimized storage: Parquet format reduces disk usage
FAISS indexing: Sub-millisecond similarity search
Lazy loading: Resources loaded only when needed

Developer Experience

Rich CLI: Beautiful terminal output with progress bars
Comprehensive docs: Docstrings, type hints, and usage examples
Development tools: Makefile for common tasks, pre-configured linting
Error handling: Graceful failures with helpful error messages

Technology Stack

polars: High-performance DataFrame library (faster than pandas)
sentence-transformers: State-of-the-art text embeddings
FAISS: Facebook's similarity search library
typer: Modern CLI framework with type hints
rich: Terminal formatting and progress bars

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for detailed guidelines.

Quick start:

Fork the repository
Create a feature branch
Make your changes
Run make format and make lint
Submit a pull request

Roadmap

Implemented ✅

Core semantic search functionality
Automatic text column detection
FAISS similarity indexing
Rich CLI with progress bars
Configurable embedding models
Efficient Parquet storage
Full type hints and documentation

Planned 🚧

Comprehensive unit test suite
Support for incremental indexing (add rows to existing index)
Multi-language text detection
Query caching for repeated searches
Web interface (optional)
Export search results to CSV/JSON
IVF indices for billion-scale datasets
Batch search API

Contact

For questions, issues, or suggestions, please open an issue on GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
semantic_csv		semantic_csv
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

License

adzynia/semantic-csv

Folders and files

Latest commit

History

Repository files navigation

Semantic CSV

Demo

Features

Why Semantic CSV?

Use Cases

Installation

Basic Installation

Development Installation

GPU Support (Optional)

Quick Start

1. Index a CSV file

2. Search the index

3. View index information

Usage Examples

Basic Workflow

Custom Options

Complete Example

CLI Commands

index

search

info

version

Architecture

Index Structure

Development

Setup Development Environment

Project Structure

Running the Example

How It Works

Performance

Supported Embedding Models

Requirements

Troubleshooting

"No suitable text columns detected"

Large CSV files

Out of memory

License

Technical Highlights

Architecture & Design

Performance & Efficiency

Developer Experience

Technology Stack

Contributing

Roadmap

Implemented ✅

Planned 🚧

Contact

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`index`

`search`

`info`

`version`

Packages