Skip to content

Commit

Permalink
Merge branch 'main' into improve_python_dependency
Browse files Browse the repository at this point in the history
  • Loading branch information
AruneshSingh authored Oct 23, 2024
2 parents 89701d4 + 0640845 commit e22e016
Show file tree
Hide file tree
Showing 43 changed files with 407 additions and 219 deletions.
4 changes: 2 additions & 2 deletions docs/articles/semantic_chunking.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Semantic Chunking

<!-- SEO: Explore semantic chunking for Retrieval Augmented Generation (RAG) in this comprehensive guide. Learn about embedding similarity, hierarchical clustering, and LLM-based methods for optimal text chunking. Discover how semantic chunking improves RAG performance compared to traditional rule-based approaches. Includes code examples, evaluation metrics, and comparisons using HotpotQA and SQUAD datasets with BAAI/bge-small-en-v1.5 embeddings.
-->

# Semantic Chunking

Chunking in Natural Language Processing is simply dividing large bodies of text into smaller pieces that computers can manage more easily. Splitting large datasets into chunks enables your Retrieval Augmented Generation (RAG) system to embed, index, and store even very large datasets optimally. But *how* you chunk your data is crucial in determining whether you can efficiently return only the most relevant results to your user queries.

To get your RAG system to handle user queries better, you need a chunking method that's a good fit for your data. Some widely used chunking algorithms are **rule-based** - e.g., fixed character splitter, recursive character splitter, document-specific splitter, among others. But in some real-world applications, rule-based methods have trouble. If, for example, your dataset has multi-topic documents, rule-based splitting algorithms can result in incomplete contexts or noise-filled chunks. **Semantic chunking**, on the other hand - because it divides text on the basis of meaning rather than rules - creates chunks that are semantically independent and cohesive, and therefore results in more effective text processing and information retrieval.
Expand Down
6 changes: 3 additions & 3 deletions docs/tools/vdb_table/data/activeloop.json
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,7 @@
"comment": ""
},
"github_stars": {
"value": 8081,
"value": 8101,
"source_url": "https://github.com/activeloopai/deeplake",
"comment": "",
"value_90_days": 0
Expand All @@ -169,10 +169,10 @@
"value_90_days": 0
},
"pypi_downloads": {
"value": 917329,
"value": 949405,
"source_url": "https://pypi.org/project/deeplake/",
"comment": "",
"value_90_days": 174429
"value_90_days": 167920
},
"npm_downloads": {
"value": 0,
Expand Down
4 changes: 2 additions & 2 deletions docs/tools/vdb_table/data/aerospike.json
Original file line number Diff line number Diff line change
Expand Up @@ -178,10 +178,10 @@
"value_90_days": 0
},
"pypi_downloads": {
"value": 1191,
"value": 1671,
"source_url": "https://pypi.org/project/aerospike-vector/",
"comment": "",
"value_90_days": 524
"value_90_days": 926
},
"npm_downloads": {
"value": 0,
Expand Down
4 changes: 2 additions & 2 deletions docs/tools/vdb_table/data/anariai.json
Original file line number Diff line number Diff line change
Expand Up @@ -174,10 +174,10 @@
"value_90_days": 0
},
"npm_downloads": {
"value": 4909,
"value": 5096,
"source_url": "https://www.npmjs.com/package/epsillajs",
"comment": "",
"value_90_days": 521
"value_90_days": 608
},
"crates_io_downloads": {
"value": 0,
Expand Down
16 changes: 8 additions & 8 deletions docs/tools/vdb_table/data/apachecassandra.json
Original file line number Diff line number Diff line change
Expand Up @@ -157,33 +157,33 @@
"comment": "via Lucene"
},
"github_stars": {
"value": 8755,
"value": 8793,
"source_url": "https://github.com/apache/cassandra",
"comment": "",
"value_90_days": 0
},
"docker_pulls": {
"value": 215709756,
"value": 216043667,
"source_url": "https://hub.docker.com/_/cassandra",
"comment": "",
"value_90_days": 0
},
"pypi_downloads": {
"value": 83982033,
"value": 85375406,
"source_url": "https://pypi.org/project/cassandra-driver/",
"comment": "",
"value_90_days": 8977518
"value_90_days": 9192091
},
"npm_downloads": {
"value": 4571141,
"value": 4598405,
"source_url": "https://www.npmjs.com/package/cassandra-driver",
"comment": "",
"value_90_days": 884795
"value_90_days": 868693
},
"crates_io_downloads": {
"value": 88518,
"value": 89762,
"source_url": "https://crates.io/crates/cassandra",
"comment": "",
"value_90_days": 974
"value_90_days": 1044
}
}
16 changes: 8 additions & 8 deletions docs/tools/vdb_table/data/apachesolr.json
Original file line number Diff line number Diff line change
Expand Up @@ -157,33 +157,33 @@
"comment": "via Lucene"
},
"github_stars": {
"value": 1178,
"value": 1197,
"source_url": "https://github.com/apache/solr",
"comment": "",
"value_90_days": 0
},
"docker_pulls": {
"value": 343830137,
"value": 344024528,
"source_url": "https://hub.docker.com/_/solr",
"comment": "",
"value_90_days": 0
},
"pypi_downloads": {
"value": 11847305,
"value": 12001490,
"source_url": "https://pypi.org/project/pysolr/",
"comment": "",
"value_90_days": 924379
"value_90_days": 919478
},
"npm_downloads": {
"value": 599287,
"value": 595209,
"source_url": "https://www.npmjs.com/package/solr-client",
"comment": "",
"value_90_days": 109339
"value_90_days": 110632
},
"crates_io_downloads": {
"value": 1483,
"value": 1515,
"source_url": "https://crates.io/crates/solr",
"comment": "",
"value_90_days": 215
"value_90_days": 221
}
}
8 changes: 4 additions & 4 deletions docs/tools/vdb_table/data/aperturedb.json
Original file line number Diff line number Diff line change
Expand Up @@ -168,16 +168,16 @@
"value_90_days": 0
},
"pypi_downloads": {
"value": 153360,
"value": 170148,
"source_url": "https://pypi.org/project/aperturedb/",
"comment": "",
"value_90_days": 19781
"value_90_days": 33621
},
"npm_downloads": {
"value": 18936,
"value": 18587,
"source_url": "https://www.npmjs.com/package/aperture",
"comment": "",
"value_90_days": 1697
"value_90_days": 1644
},
"crates_io_downloads": {
"value": 0,
Expand Down
8 changes: 4 additions & 4 deletions docs/tools/vdb_table/data/azureai.json
Original file line number Diff line number Diff line change
Expand Up @@ -179,16 +179,16 @@
"value_90_days": 0
},
"pypi_downloads": {
"value": 30230401,
"value": 31788852,
"source_url": "https://pypi.org/project/azure-ai-ml/",
"comment": "",
"value_90_days": 9291085
"value_90_days": 9778906
},
"npm_downloads": {
"value": 6479858,
"value": 6865795,
"source_url": "https://www.npmjs.com/package/@azure/openai",
"comment": "",
"value_90_days": 2155449
"value_90_days": 2258047
},
"crates_io_downloads": {
"value": 0,
Expand Down
14 changes: 7 additions & 7 deletions docs/tools/vdb_table/data/chroma.json
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,7 @@
"comment": ""
},
"github_stars": {
"value": 14811,
"value": 15014,
"source_url": "https://github.com/chroma-core/chroma",
"comment": "",
"value_90_days": 0
Expand All @@ -169,21 +169,21 @@
"value_90_days": 0
},
"pypi_downloads": {
"value": 18897962,
"value": 19924126,
"source_url": "https://pypi.org/project/chromadb/",
"comment": "",
"value_90_days": 5238534
"value_90_days": 5528579
},
"npm_downloads": {
"value": 1726093,
"value": 1805790,
"source_url": "https://www.npmjs.com/package/chromadb",
"comment": "",
"value_90_days": 475082
"value_90_days": 494462
},
"crates_io_downloads": {
"value": 13492,
"value": 14791,
"source_url": "https://crates.io/crates/chromadb",
"comment": "",
"value_90_days": 1368
"value_90_days": 1597
}
}
14 changes: 7 additions & 7 deletions docs/tools/vdb_table/data/clickhouse.json
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@
"comment": "HNSW via USearch"
},
"github_stars": {
"value": 37016,
"value": 37224,
"source_url": "https://github.com/ClickHouse/ClickHouse",
"comment": "",
"value_90_days": 0
Expand All @@ -168,21 +168,21 @@
"value_90_days": 0
},
"pypi_downloads": {
"value": 200794,
"value": 202127,
"source_url": "https://pypi.org/project/clickhouse/",
"comment": "",
"value_90_days": 8612
"value_90_days": 8169
},
"npm_downloads": {
"value": 10197127,
"value": 11576731,
"source_url": "https://www.npmjs.com/package/@clickhouse/client",
"comment": "",
"value_90_days": 4165890
"value_90_days": 5018953
},
"crates_io_downloads": {
"value": 439372,
"value": 457291,
"source_url": "https://crates.io/crates/clickhouse",
"comment": "",
"value_90_days": 92556
"value_90_days": 95418
}
}
16 changes: 8 additions & 8 deletions docs/tools/vdb_table/data/couchbase.json
Original file line number Diff line number Diff line change
Expand Up @@ -135,33 +135,33 @@
"comment": "Automatic algorithm selection between: IdMap2,Flat, IVF,Flat, IVF,SQ8"
},
"github_stars": {
"value": 1620,
"value": 1625,
"source_url": "https://github.com/couchbase/couchbase-lite-ios",
"comment": "",
"value_90_days": 0
},
"docker_pulls": {
"value": 87116407,
"value": 87146631,
"source_url": "https://hub.docker.com/_/couchbase",
"comment": "50M+",
"value_90_days": 0
},
"pypi_downloads": {
"value": 13405354,
"value": 13592323,
"source_url": "https://pypi.org/project/couchbase/",
"comment": "",
"value_90_days": 1148597
"value_90_days": 1098901
},
"npm_downloads": {
"value": 1129807,
"value": 1134244,
"source_url": "https://www.npmjs.com/package/couchbase",
"comment": "",
"value_90_days": 207614
"value_90_days": 207212
},
"crates_io_downloads": {
"value": 37828,
"value": 38241,
"source_url": "https://crates.io/crates/couchbase",
"comment": "",
"value_90_days": 1018
"value_90_days": 1141
}
}
16 changes: 8 additions & 8 deletions docs/tools/vdb_table/data/cratedb.json
Original file line number Diff line number Diff line change
Expand Up @@ -157,33 +157,33 @@
"comment": "via Lucene"
},
"github_stars": {
"value": 4065,
"value": 4083,
"source_url": "https://github.com/crate/crate",
"comment": "",
"value_90_days": 0
},
"docker_pulls": {
"value": 18703647,
"value": 18711061,
"source_url": "https://hub.docker.com/_/crate",
"comment": "",
"value_90_days": 0
},
"pypi_downloads": {
"value": 1342001,
"value": 1356972,
"source_url": "https://pypi.org/project/crate/",
"comment": "",
"value_90_days": 67143
"value_90_days": 70259
},
"npm_downloads": {
"value": 25992,
"value": 25865,
"source_url": "https://www.npmjs.com/package/node-crate",
"comment": "",
"value_90_days": 4597
"value_90_days": 4438
},
"crates_io_downloads": {
"value": 7059,
"value": 7436,
"source_url": "https://crates.io/crates/cratedb",
"comment": "",
"value_90_days": 866
"value_90_days": 1031
}
}
Loading

0 comments on commit e22e016

Please sign in to comment.