chonkie-ai · Dhan996 · Jan 25, 2025 · Jan 29, 2025 · Jan 29, 2025 · bhavnicksm
diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -4,71 +4,149 @@
 
 Ever wondered how much CHONKier other text splitting libraries are? Well, wonder no more! We've put Chonkie up against some of the most popular RAG libraries out there, and the results are... well, let's just say Moto Moto might need to revise his famous quote! 
 
-## 📊 Size Comparison (Package Size)
+## ⚡ Speed Benchmarks
 
-### Default Installation (Basic Chunking)
+> ZOOOOOM! Watch Chonkie run! 🏃‍♂️💨
 
-| Library | Size | Chonk Factor |
-|---------|------|--------------|
-| 🦛 Chonkie | 9.7 MiB | 1x (base CHONK) |
-| 🔗 LangChain | 80 MiB | ~8.3x CHONKier |
-| 📚 LlamaIndex | 171 MiB | ~17.6x CHONKier |
+### 100K Wikipedia Articles
 
-### With Semantic Features
+The following benchmarks were run on 100,000 Wikipedia articles from the 
+[`chonkie-ai/wikipedia-100k`](https://huggingface.co/datasets/chonkie-ai/wikipedia-100k) dataset
 
-| Library | Size | Chonk Factor |
-|---------|------|--------------|
-| 🦛 Chonkie | 585 MiB | 1x (semantic CHONK) |
-| 🔗 LangChain | 625 MiB | ~1.07x CHONKier |
-| 📚 LlamaIndex | 678 MiB | ~1.16x CHONKier |
+All tests were run on a Google Colab A100 instance.
 
-## ⚡ Speed Benchmarks
+#### Token Chunking
 
-> ZOOOOOM! Watch Chonkie run! 🏃‍♂️💨
+| Library | Time | Speed Factor |
+|---------|-----------|--------------|
+| 🦛 Chonkie | 58 sec | 1x  |
+| 🔗 LangChain | 1 min 10 sec | 1.21x slower |
+| 📚 LlamaIndex | 50 min | 51.7x slower |
+
+#### Sentence Chunking
+
+| Library | Time | Speed Factor |
+|---------|-----------|--------------|
+| 🦛 Chonkie | 59 sec | 1x |
+| 📚 LlamaIndex | 3 min 59 sec | 4.05x slower |
+| 🔗 LangChain | N/A | Doesn't exist |
 
-All benchmarks were run on the Paul Graham Essays Dataset using the GPT-2 tokenizer. Because Chonkie believes in transparency, we note that timings marked with ** were taken after a warm-up phase.
+#### Recursive Chunking
 
-### Token Chunking (ms)
+| Library | Time | Speed Factor |
+|---------|-----------|--------------|
+| 🦛 Chonkie | 1 min 19 sec | 1x |
+| 🔗 LangChain | 2 min 45 sec | 2.09x slower |
+| 📚 LlamaIndex | N/A | Doesn't exist |
+
+#### Semantic Chunking
+
+Tested with `sentence-transformers/all-minilm-l6-v2` model unless specified otherwise.
+
+| Library | Time | Speed Factor |
+|---------|-----------|--------------|
+| 🦛 Chonkie (with default settings) | 13 min 59 sec | 1x |
+| 🦛 Chonkie | 1 hour 8 min min 53 sec |  4.92x slower |
+| 🔗 LangChain | 1 hour 13 sec | 4.35x slower |
+| 📚 LlamaIndex | 1 hour 24 min 15 sec| 6.07x slower |
+
+### 500K Wikipedia Articles
+
+The following benchmarks were run on 500,000 Wikipedia articles from the 
+[`chonkie-ai/wikipedia-500k`](https://huggingface.co/datasets/chonkie-ai/wikipedia-500k) dataset
+
+All tests were run on a `c3-highmem-4` VM from Google Cloud with 32 GB RAM and a 200 GB SSD Persistent Disk attachment.
+
+#### Token Chunking
+
+| Library | Time | Speed Factor |
+|---------|-----------|--------------|
+| 🦛 Chonkie | 2 min 17 sec | 1x |
+| 🔗 LangChain | 2 min 42 sec | 1.18x slower |
+| 📚 LlamaIndex | 50 min | 21.9x slower |
+
+#### Sentence Chunking
+
+| Library | Time | Speed Factor |
+|---------|-----------|--------------|
+| 🦛 Chonkie | 7 min 16 sec | 1x |
+| 📚 LlamaIndex | 10 min 55 sec | 1.5x slower |
+| 🔗 LangChain | N/A | Doesn't exist |
+
+#### Recursive Chunking
+
+| Library | Time | Speed Factor |
+|---------|-----------|--------------|
+| 🦛 Chonkie | 3 min 42 sec | 1x |
+| 🔗 LangChain | 7 min 36 sec | 2.05x slower |
+| 📚 LlamaIndex | N/A | Doesn't exist |
+
+### Paul Graham Essays Dataset
+
+The following benchmarks were run on the Paul Graham Essays dataset using the GPT-2 tokenizer. 
+
+#### Token Chunking
 
 | Library | Time (ms) | Speed Factor |
 |---------|-----------|--------------|
-| 🦛 Chonkie | 8.18** | 1x (fastest CHONK) |
+| 🦛 Chonkie | 8.18 | 1x |
 | 🔗 LangChain | 8.68 | 1.06x slower |
 | 📚 LlamaIndex | 272 | 33.25x slower |
 
-### Sentence Chunking (ms)
+#### Sentence Chunking 
 
 | Library | Time (ms) | Speed Factor |
 |---------|-----------|--------------|
-| 🦛 Chonkie | 52.6 | 1x (solo CHONK) |
+| 🦛 Chonkie | 52.6 | 1x |
 | 📚 LlamaIndex | 91.2 | 1.73x slower |
 | 🔗 LangChain | N/A | Doesn't exist |
 
-### Semantic Chunking (ms)
+#### Semantic Chunking 
 
 | Library | Time | Speed Factor |
 |---------|------|--------------|
-| 🦛 Chonkie | 482ms | 1x (smart CHONK) |
+| 🦛 Chonkie | 482ms | 1x |
 | 🔗 LangChain | 899ms | 1.86x slower |
 | 📚 LlamaIndex | 1.2s | 2.49x slower |
 
-## 💡 Why These Numbers Matter
 
-### Size Benefits
-1. **Faster Installation**: Less to download = faster to get started
-2. **Lower Memory Footprint**: Lighter package = less RAM usage
-3. **Cleaner Dependencies**: Only install what you actually need
-4. **CI/CD Friendly**: Faster builds and deployments
+## 📊 Size Comparison (Package Size)
+
+### Default Installation (Basic Chunking)
+
+| Library | Size | Chonk Factor |
+|---------|------|--------------|
+| 🦛 Chonkie | 11.2 MiB | 1x |
+| 🔗 LangChain | 80 MiB | ~7.1x CHONKier |
+| 📚 LlamaIndex | 171 MiB | ~15.3x CHONKier |
+
+### With Semantic Features
+
+| Library | Size | Chonk Factor |
+|---------|------|--------------|
+| 🦛 Chonkie | 62 MiB | 1x |
+| 🔗 LangChain | 625 MiB | ~10x CHONKier |
+| 📚 LlamaIndex | 678 MiB | ~11x CHONKier |
+
+## 💡 Why These Numbers Matter
 
 ### Speed Benefits
+
 1. **Faster Processing**: Chonkie leads in all chunking methods!
 2. **Production Ready**: Optimized for real-world usage
 3. **Consistent Performance**: Fast across all chunking types
 4. **Scale Friendly**: Process more text in less time
 
+### Size Benefits
+
+1. **Faster Installation**: Less to download = faster to get started
+2. **Lower Memory Footprint**: Lighter package = less RAM usage
+3. **Cleaner Dependencies**: Only install what you actually need
+4. **CI/CD Friendly**: Faster builds and deployments
+
 Remember what Chonkie always says:
-> "I may be a hippo, but I don't have to be heavy... and I can still run fast!" 🦛✨
+> "I may be a hippo, but I'm still light and fast!" 🦛✨
 
 ---
 
-*Note: All measurements were taken using Python 3.8+ on a clean virtual environment. Your actual mileage may vary slightly depending on your specific setup and dependencies. Speed benchmarks were performed on Paul Graham Essays Dataset using the GPT-2 tokenizer.*
+*Note: All measurements were taken using Python 3.8+ on a clean virtual environment. Your actual mileage may vary slightly depending on your specific setup and dependencies.*
diff --git a/pyproject.toml b/pyproject.toml
@@ -43,8 +43,9 @@ Documentation = "https://docs.chonkie.ai"
 model2vec = ["model2vec>=0.3.0", "numpy>=1.23.0, <2.2"]
 st = ["sentence-transformers>=3.0.0", "numpy>=1.23.0, <2.2"]
 openai = ["openai>=1.0.0", "numpy>=1.23.0, <2.2"]
-semantic = ["model2vec>=0.3.0", "numpy>=1.23.0, <2.2"]
-all = ["sentence-transformers>=3.0.0", "numpy>=1.23.0, <2.2", "openai>=1.0.0", "model2vec>=0.3.0"]
+semantic = ["model2vec>=0.1.0", "numpy>=1.23.0, <2.2"]
+litellm = ["litellm>=1.57.10", "numpy>=1.23.0, <2.2"]
+all = ["sentence-transformers>=3.0.0", "numpy>=1.23.0, <2.2", "openai>=1.0.0", "model2vec>=0.3.0", "litellm>=1.57.10"]
 dev = [
     "pytest>=6.2.0", 
     "pytest-cov>=4.0.0",

diff --git a/src/chonkie/__init__.py b/src/chonkie/__init__.py
@@ -16,6 +16,7 @@
     Model2VecEmbeddings,
     OpenAIEmbeddings,
     SentenceTransformerEmbeddings,
+    LiteLLMEmbeddings,
 )
 from .refinery import (
     BaseRefinery,
@@ -78,6 +79,7 @@
     "SentenceTransformerEmbeddings",
     "OpenAIEmbeddings",
     "AutoEmbeddings",
+    "LiteLLMEmbeddings",
 ]
 
 # Add all refinery classes to __all__

diff --git a/src/chonkie/embeddings/__init__.py b/src/chonkie/embeddings/__init__.py
@@ -3,6 +3,7 @@
 from .model2vec import Model2VecEmbeddings
 from .openai import OpenAIEmbeddings
 from .sentence_transformer import SentenceTransformerEmbeddings
+from .litellm import LiteLLMEmbeddings
 
 # Add all embeddings classes to __all__
 __all__ = [
@@ -11,4 +12,5 @@
     "SentenceTransformerEmbeddings",
     "OpenAIEmbeddings",
     "AutoEmbeddings",
+    "LiteLLMEmbeddings",
 ]
diff --git a/src/chonkie/embeddings/auto.py b/src/chonkie/embeddings/auto.py
@@ -24,6 +24,9 @@ class AutoEmbeddings:
         # Get Anthropic embeddings
         embeddings = AutoEmbeddings.get_embeddings("anthropic://claude-v1", api_key="...")
 
+        # Get LiteLLM embeddings
+        embeddings = AutoEmbeddings.get_embeddings("huggingface/microsoft/codebert-base", api_key="...")
+
     """
 
     @classmethod
@@ -52,6 +55,9 @@ def get_embeddings(
             # Get Anthropic embeddings
             embeddings = AutoEmbeddings.get_embeddings("anthropic://claude-v1", api_key="...")
 
+            # Get LiteLLM embeddings
+            embeddings = AutoEmbeddings.get_embeddings("huggingface/microsoft/codebert-base", api_key="...")
+
         """
         # Load embeddings instance if already provided
         if isinstance(model, BaseEmbeddings):

diff --git a/src/chonkie/embeddings/litellm.py b/src/chonkie/embeddings/litellm.py
@@ -0,0 +1,148 @@
+import importlib
+from litellm import embedding
+from litellm import token_counter
+from typing import Callable, List, Optional
+import os
+import time
+import numpy as np
+
+from .base import BaseEmbeddings
+
+
+class LiteLLMEmbeddings(BaseEmbeddings):
+
+    def __init__(
+        self,
+        model: str = 'huggingface/microsoft/codebert-base',
+        input: List[str] = "Hello, my dog is cute",
+        user: str = None,
+        dimensions: Optional[int] = None,
+        api_key: Optional[str] = None,
+        api_type: Optional[str] = None,
+        api_version: Optional[str] = None,
+        api_base: Optional[str] = None,
+        encoding_format: Optional[str] = None,
+        timeout: Optional[int] = 300,
+        input_type: Optional[str] = "feature-extraction",
+    ):
+        """Initialize LiteLLM embeddings.
+
+        Args:
+            model: Name of the LiteLLM embedding model to use
+            input: Text to embed
+            user: User ID for API requests
+            dimensions: Number of dimensions for the embedding model
+            api_key: API key for the model
+            api_type: Type of API to use
+            api_version: Version of the API to use
+            api_base: Base URL for the API
+            encoding_format: Encoding format for the input text
+            timeout: Timeout in seconds for API requests
+
+        """
+        super().__init__()
+        if not self.is_available():
+            raise ImportError(
+                "LiteLLM package is not available. Please install it via pip."
+            )
+        else:
+            # Check if LiteLLM works with given parameters
+            try:
+                api_key = api_key if api_key is not None else os.environ.get("HUGGINGFACE_API_KEY")
+                my_list = []
+                my_list.append(input)
+                response = embedding(model=model, input=my_list, user=user, dimensions=dimensions, api_key=api_key, api_type=api_type, api_version=api_version, api_base=api_base, encoding_format=encoding_format, timeout=timeout)
+            except Exception as e:
+                raise ValueError(f"LiteLLM failed to initialize with the given parameters: {e}")
+            else:
+                self.kwargs = {
+                    "user": user,
+                    "dimensions": dimensions,
+                    "api_key": api_key,
+                    "api_type": api_type,
+                    "api_version": api_version,
+                    "api_base": api_base,
+                    "encoding_format": encoding_format,
+                    "timeout": timeout,
+                }
+                self.model = model
+                if dimensions is None:
+                    self._dimension = len(response.data[0]['embedding'])
+                else: 
+                    self._dimension = dimensions
+
+    @property
+    def dimension(self) -> int:
+        return self._dimension
+
+
+    def embed(self, text: str) -> "np.ndarray":
+        if isinstance(text, str):
+            text = [text]
+        retries = 5  # Number of retries
+        wait_time = 10  # Wait time between retries
+        for i in range(retries):
+            try:
+                response = embedding(model=self.model, input=text, **self.kwargs)
+            except Exception as e:
+                print(f"Attempt {i+1}/{retries}: Model is still loading, retrying in {wait_time} seconds...")
+                time.sleep(wait_time)
+            else:
+                break
+        embeddings = response.data[0]['embedding']
+        return np.array(embeddings)
+
+    def embed_batch(self, texts: List[str]) -> List["np.ndarray"]:
+        if isinstance(texts, str):
+            texts = [texts]
+        retries = 5  # Number of retries
+        wait_time = 10  # Wait time between retries
+        for i in range(retries):
+            try:
+                responses = embedding(
+                    model=self.model,
+                    input=texts,
+                    **self.kwargs 
+                )
+                # Exit the loop if successful
+            except Exception as e:
+                print(f"Attempt {i+1}/{retries}: Model is still loading, retrying in {wait_time} seconds...")
+                time.sleep(wait_time)
+            else:
+                break
+
+        # response = embedding(model=self.model_name, input=texts, **self.kwargs)
+        np_embeddings = []
+        # np_embeddings.append([entry['embedding'] for entry in responses.data])
+        np_embeddings.extend(np.array(entry['embedding']) for entry in responses["data"])
+        return np_embeddings
+
+    def count_tokens(self, text: str) -> int:
+        return token_counter(model=self.model, text=text)
+
+    def count_tokens_batch(self, texts: List[str]) -> List[int]:
+        token_list = []
+        for i in texts:
+            token_list.append(token_counter(model=self.model, text=i))
+        return token_list
+
+    def _tokenizer_helper(self, string: str) -> int:
+        return token_counter(model=self.model, text=str)
+
+    def get_tokenizer_or_token_counter(self) -> "Callable[[str], int]":
+        return self._tokenizer_helper
+
+
+    def similarity(self, u: np.ndarray, v: np.ndarray) -> float:
+        """Compute cosine similarity between two embeddings."""
+        return np.divide(
+            np.dot(u, v), np.linalg.norm(u) * np.linalg.norm(v), dtype=float
+        )
+
+
+    @classmethod
+    def is_available(cls) -> bool:
+        return importlib.util.find_spec("litellm") is not None
+
+    def __repr__(self) -> str:
+        return f"LiteLLMEmbeddings(model={self.model})"