Skip to content

Commit 6f71152

Browse files
authored
Initial alpha version (#1)
Initial alpha version
1 parent da14c81 commit 6f71152

18 files changed

+1041
-15
lines changed

Diff for: .github/workflows/CI.yml

+3-10
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,7 @@
11
name: CI
22
on:
3-
push:
4-
branches:
5-
- main
6-
tags: '*'
7-
pull_request:
3+
- push
4+
- pull_request
85
jobs:
96
test:
107
name: Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }}
@@ -13,19 +10,15 @@ jobs:
1310
fail-fast: false
1411
matrix:
1512
version:
16-
- '1.0'
1713
- '1.6'
14+
- '1.7'
1815
- 'nightly'
1916
os:
2017
- ubuntu-latest
2118
- macOS-latest
2219
- windows-latest
2320
arch:
2421
- x64
25-
- x86
26-
exclude:
27-
- os: macOS-latest
28-
arch: x86
2922
steps:
3023
- uses: actions/checkout@v2
3124
- uses: julia-actions/setup-julia@v1

Diff for: .gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,4 @@
33
*.jl.mem
44
/Manifest.toml
55
/docs/build/
6+
.vscode

Diff for: Project.toml

+8-1
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,18 @@ uuid = "2e3c4037-312d-4650-b9c0-fcd0fc09aae4"
33
authors = ["Bernard Brenyah"]
44
version = "0.1.0"
55

6+
[deps]
7+
CircularArrays = "7a955b69-7140-5f4e-a0ed-f168c5e2e749"
8+
DataStructures = "864edb3b-99cc-5e75-8d2d-829cb0a9cfe8"
9+
OffsetArrays = "6fe1bfb0-de20-5000-8ca7-80f57d26f881"
10+
ProgressMeter = "92933f4c-e287-5a05-a399-4b506db050ca"
11+
612
[compat]
713
julia = "1"
814

915
[extras]
16+
Faker = "0efc519c-db33-5916-ab87-703215c3906f"
1017
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
1118

1219
[targets]
13-
test = ["Test"]
20+
test = ["Test", "Faker"]

Diff for: README.md

+42
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,45 @@
66
[![Coverage](https://codecov.io/gh/PyDataBlog/SimString.jl/branch/main/graph/badge.svg)](https://codecov.io/gh/PyDataBlog/SimString.jl)
77
[![Code Style: Blue](https://img.shields.io/badge/code%20style-blue-4495d1.svg)](https://github.com/invenia/BlueStyle)
88
[![ColPrac: Contributor's Guide on Collaborative Practices for Community Packages](https://img.shields.io/badge/ColPrac-Contributor's%20Guide-blueviolet)](https://github.com/SciML/ColPrac)
9+
10+
A native Julia implementation of the CPMerge algorithm, which is designed for approximate string matching.
11+
This package is be particulary useful for natural language processing tasks which demand the retrieval of strings/texts from a very large corpora (big amounts of texts). Currently, this package supports both Character and Word based N-grams feature generations and there are plans to open the package up for custom user defined feature generation methods.
12+
13+
## Features
14+
15+
- [X] Fast algorithm for string matching
16+
- [X] 100% exact retrieval
17+
- [X] Support for unicodes
18+
- [ ] Custom user defined feature generation methods
19+
- [ ] Mecab-based tokenizer support
20+
21+
## Suported String Similarity Measures
22+
23+
- [X] Dice coefficient
24+
- [X] Jaccard coefficient
25+
- [X] Cosine coefficient
26+
- [X] Overlap coefficient
27+
28+
## Installation
29+
30+
You can grab the latest stable version of this package from Julia registries by simply running;
31+
32+
*NB:* Don't forget to invoke Julia's package manager with `]`
33+
34+
```julia
35+
pkg> add SimString
36+
```
37+
38+
The few (and selected) brave ones can simply grab the current experimental features by simply adding the master branch to your development environment after invoking the package manager with `]`:
39+
40+
```julia
41+
pkg> add SimString#master
42+
```
43+
44+
You are good to go with bleeding edge features and breakages!
45+
46+
To revert to a stable version, you can simply run:
47+
48+
```julia
49+
pkg> free SimString
50+
```

Diff for: docs/src/index.md

+70
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,76 @@ CurrentModule = SimString
66

77
Documentation for [SimString](https://github.com/PyDataBlog/SimString.jl).
88

9+
A native Julia implementation of the CPMerge algorithm, which is designed for approximate string matching.
10+
This package is be particulary useful for natural language processing tasks which demand the retrieval of strings/texts from a very large corpora (big amounts of texts). Currently, this package supports both Character and Word based N-grams feature generations and there are plans to open the package up for custom user defined feature generation methods.
11+
12+
## Features
13+
14+
- [X] Fast algorithm for string matching
15+
- [X] 100% exact retrieval
16+
- [X] Support for unicodes
17+
- [ ] Custom user defined feature generation methods
18+
- [ ] Mecab-based tokenizer support
19+
20+
## Suported String Similarity Measures
21+
22+
- [X] Dice coefficient
23+
- [X] Jaccard coefficient
24+
- [X] Cosine coefficient
25+
- [X] Overlap coefficient
26+
27+
## Installation
28+
29+
You can grab the latest stable version of this package from Julia registries by simply running;
30+
31+
*NB:* Don't forget to invoke Julia's package manager with `]`
32+
33+
```julia
34+
pkg> add SimString
35+
```
36+
37+
The few (and selected) brave ones can simply grab the current experimental features by simply adding the master branch to your development environment after invoking the package manager with `]`:
38+
39+
```julia
40+
pkg> add SimString#master
41+
```
42+
43+
You are good to go with bleeding edge features and breakages!
44+
45+
To revert to a stable version, you can simply run:
46+
47+
```julia
48+
pkg> free SimString
49+
```
50+
51+
## Usage
52+
53+
```julia
54+
using SimString
55+
56+
# Inilisate database and some strings
57+
db = DictDB(CharacterNGrams(2, " "));
58+
push!(db, "foo");
59+
push!(db, "bar");
60+
push!(db, "fooo");
61+
62+
# Convinient approach is to use an array of strings for multiple entries: `append!(db, ["foo", "bar", "fooo"]);`
63+
64+
# Retrieve the closest match(es)
65+
res = search(Dice(), db, "foo"; α=0.8, ranked=true)
66+
# 2-element Vector{Tuple{String, Float64}}:
67+
# ("foo", 1.0)
68+
# ("fooo", 0.8888888888888888)
69+
70+
71+
```
72+
73+
## TODO: Benchmarks
74+
75+
## Release History
76+
77+
- 0.1.0 Initial release.
78+
979
```@index
1080
```
1181

Diff for: extras/examples.jl

+46
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
using SimString
2+
using Faker
3+
using BenchmarkTools
4+
using DataStructures
5+
6+
################################# Benchmark Bulk addition #####################
7+
db = DictDB(CharacterNGrams(3, " "));
8+
Faker.seed(2020)
9+
@time fake_names = [string(Faker.first_name(), " ", Faker.last_name()) for i in 1:100_000];
10+
11+
12+
f(d, x) = append!(d, x)
13+
@time f(db, fake_names)
14+
15+
16+
17+
################################ Simple Addition ###############################
18+
19+
db = DictDB(CharacterNGrams(2, " "));
20+
push!(db, "foo");
21+
push!(db, "bar");
22+
push!(db, "fooo");
23+
24+
f(x, c, s) = search(x, c, s)
25+
test = "foo";
26+
col = db;
27+
sim = Cosine();
28+
29+
f(Cosine(), db, "foo")
30+
31+
@btime f($sim, $col, $test)
32+
@btime search(Cosine(), db, "foo"; α=0.8, ranked=true)
33+
34+
35+
36+
db2 = DictDB(CharacterNGrams(3, " "));
37+
append!(db2, ["foo", "bar", "fooo", "foor"]) # also works via multiple dispatch on a vector
38+
39+
results = search(Cosine(), db, "foo"; α=0.8, ranked=true) # yet to be implemented
40+
41+
bs = ["foo", "bar", "foo", "foo", "bar"]
42+
SimString.extract_features(CharacterNGrams(3, " "), "prepress")
43+
SimString.extract_features(WordNGrams(2, " ", " "), "You are a really really really cool dude.")
44+
45+
db = DictDB(WordNGrams(2, " ", " "))
46+
push!(db, "You are a really really really cool dude.")

Diff for: extras/py_benchmarks.py

+16
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
from simstring.feature_extractor.character_ngram import CharacterNgramFeatureExtractor
2+
from simstring.measure.cosine import CosineMeasure
3+
from simstring.database.dict import DictDatabase
4+
from simstring.searcher import Searcher
5+
from faker import Faker
6+
7+
db = DictDatabase(CharacterNgramFeatureExtractor(3))
8+
9+
fake = Faker()
10+
fake_names = [fake.name() for i in range(100_000)]
11+
12+
def f(x):
13+
for i in x:
14+
db.add(i)
15+
16+
# %time f(fake_names)

Diff for: src/SimString.jl

+25-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,29 @@
11
module SimString
22

3-
# Write your package code here.
3+
import Base: push!, append!
4+
using DataStructures: DefaultOrderedDict, DefaultDict
5+
# using ProgressMeter
6+
# using CircularArrays
7+
# using OffsetArrays
8+
9+
######### Import modules & utils ################
10+
include("db_collection.jl")
11+
include("dictdb.jl")
12+
include("features.jl")
13+
include("measures.jl")
14+
include("search.jl")
15+
16+
17+
18+
####### Global export of user API #######
19+
export Dice, Jaccard, Cosine, Overlap,
20+
AbstractSimStringDB, DictDB,
21+
CharacterNGrams, WordNGrams,
22+
search
23+
24+
25+
26+
27+
428

529
end

Diff for: src/db_collection.jl

+35
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Custom Collections
2+
3+
"""
4+
Base type for all custom db collections.
5+
"""
6+
abstract type AbstractSimStringDB end
7+
8+
9+
"""
10+
Abstract type for feature extraction structs
11+
"""
12+
abstract type FeatureExtractor end
13+
14+
15+
# Feature Extraction Definitions
16+
17+
"""
18+
Feature extraction on character-level ngrams
19+
"""
20+
struct CharacterNGrams{T1<:Int, T2<:AbstractString} <: FeatureExtractor
21+
n::T1 # number of n-grams to extract
22+
padder::T2 # string to use to pad n-grams
23+
end
24+
25+
26+
"""
27+
Feature extraction based on word-level ngrams
28+
"""
29+
struct WordNGrams{T1<:Int, T2<:AbstractString} <: FeatureExtractor
30+
n::T1 # number of n-grams to extract
31+
padder::T2 # string to use to pad n-grams
32+
splitter::T2 # string to use to split words
33+
end
34+
35+

0 commit comments

Comments
 (0)