Skip to content

Latest commit

 

History

History
149 lines (113 loc) · 4.88 KB

04-embeddings-intro.org

File metadata and controls

149 lines (113 loc) · 4.88 KB

LLM Embeddings Tutorial

Initial Setup

First, ensure we have the necessary embedding models installed.

Register Embedding Model

We’ll use the MiniLM model for local embeddings.

llm sentence-transformers register all-MiniLM-L6-v2

Basic Embeddings

Let’s start with some simple embedding examples.

Generate Single Embedding

Create an embedding for a basic string.

llm embed -m sentence-transformers/all-MiniLM-L6-v2 -c "Hello world"

Create Collection

Store embeddings in a named collection.

 # Store first phrase
llm embed phrases hello -m sentence-transformers/all-MiniLM-L6-v2 -c "Hello world"

# Store second phrase
llm embed phrases goodbye -c "Goodbye world"

# View collections
llm embed-db collections

File Processing

Process multiple files and create embeddings.

Embed README Files

Process all README files in the repository.

llm embed-multi readmes \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --files . '**/*.org'

Search Similar Content

Find similar content in our embedded documents.

llm similar readmes -c "llm commands"

Clustering Examples

Demonstrate clustering capabilities with embeddings.

Set Up Clustering

First, install the clustering plugin.

llm install llm-cluster

Process Repository Issues

Get and embed GitHub issues.

curl -s "https://api.github.com/repos/defrecord/llm-lab/issues" | \
  jq '[.[] | {id: .id, title: .title}]' | \
  llm embed-multi llm-lab-issues - \
    --database data/embeddings/issues.db \
    --model sentence-transformers/all-MiniLM-L6-v2 \
    --store

# Run clustering analysis
llm cluster llm-lab-issues --database data/embeddings/issues.db 5 --summary

Python Integration

Example of using embeddings in Python code.

#!/usr/bin/env python3
import llm

def embed_text():
    """Example of embedding text with Python."""
    model = llm.get_embedding_model("sentence-transformers/all-MiniLM-L6-v2")
    vector = model.embed("This is text to embed")
    print(f"Embedding vector: {vector[:5]}...")  # Show first 5 elements

def work_with_collections():
    """Example of working with collections."""
    collection = llm.Collection("entries", 
                              model_id="sentence-transformers/all-MiniLM-L6-v2")
    
    # Store items with metadata
    collection.embed_multi(
        [
            ("code", "Python implementation details"),
            ("docs", "Documentation and examples"),
            ("test", "Test suite and coverage"),
        ],
        store=True,
    )
    
    # Find similar items
    results = collection.similar("implementation guide")
    for result in results:
        print(f"Match: {result.id} - Score: {result.score}")

if __name__ == "__main__":
    embed_text()
    work_with_collections()

Advanced Examples

JSON Processing

Process embedding output as JSON.

llm embed -m sentence-transformers/all-MiniLM-L6-v2 -c "Advanced example" | \
 jq -r '.embedding | length'

Clustering Analysis

Run clustering on a collection.

llm cluster entries --database data/embeddings/vector.db 3 --summary

Export Data

Export embeddings for external use.

llm embed-db export entries

Implementation Notes

  • All outputs are stored in data/embeddings/
  • Using sentence-transformers/all-MiniLM-L6-v2 for local embeddings
  • Python examples are tangled to data/embeddings/
  • Clustering requires the llm-cluster plugin