Skip to content

Commit 75ba12c

Browse files
committed
Merge branch 'master' into feature/generate_friendman_functions
2 parents 51343d6 + d6dcc86 commit 75ba12c

File tree

3 files changed

+118
-17
lines changed

3 files changed

+118
-17
lines changed

README.md

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -25,16 +25,21 @@ The package has an interface for the dataset generator of the [ScikitLearn](http
2525
### ScikitLearn
2626
List of package datasets:
2727

28-
Dataset | Title | Reference
29-
----------------|------------------------------------------------------------------------|--------------------------------------------------
30-
make_blobs | Generate isotropic Gaussian blobs for clustering. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons.html)
31-
make_moons | Make two interleaving half circles | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html)
32-
make_s_curve | Generate an S curve dataset. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_s_curve.html)
33-
make_regression | Generate a random regression problem. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html])
34-
make_classification | Generate a random n-class classification problem. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html])
35-
make_friedman1 | Generate the “Friedman #1” regression problem. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_friedman1.html)
36-
make_friedman2 | Generate the “Friedman #2” regression problem. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_friedman2.html)
37-
make_friedman3 | Generate the “Friedman #3” regression problem. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_friedman3.html)
28+
Dataset | Title | Reference
29+
---------------------|-------------------------------------------------------------------------|--------------------------------------------------
30+
make_blobs | Generate isotropic Gaussian blobs for clustering. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons.html)
31+
make_moons | Make two interleaving half circles | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html)
32+
make_s_curve | Generate an S curve dataset. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_s_curve.html)
33+
make_regression | Generate a random regression problem. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html])
34+
make_classification | Generate a random n-class classification problem. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html])
35+
make_friedman1 | Generate the “Friedman #1” regression problem. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_friedman1.html)
36+
make_friedman2 | Generate the “Friedman #2” regression problem. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_friedman2.html)
37+
make_friedman3 | Generate the “Friedman #3” regression problem. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_friedman3.html)
38+
make_circles | Make a large circle containing a smaller circle in 2d | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_circles.html)
39+
make_regression | Generate a random regression problem. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html)
40+
make_classification | Generate a random n-class classification problem. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html)
41+
make_low_rank_matrix | Generate a mostly low rank matrix with bell-shaped singular values. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_low_rank_matrix.html)
42+
make_swiss_roll | Generate a swiss roll dataset. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_swiss_roll.html)
3843

3944
**Disclaimer**: SyntheticDatasets.jl borrows code and documentation from
4045
[scikit-learn](https://scikit-learn.org/stable/modules/classes.html#samples-generator) in the dataset module, but *it is not an official part

src/sklearn.jl

Lines changed: 83 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,37 @@ function generate_s_curve(; n_samples::Int = 100,
8585
return convert(features, labels)
8686
end
8787

88+
"""
89+
function generate_circles(; n_samples::Int = 100,
90+
shuffle::Bool = true,
91+
noise::Float64 = 0.0,
92+
random_state::Union{Int, Nothing} = nothing,
93+
factor::Float64 = 0.8)::DataFrame
94+
Make a large circle containing a smaller circle in 2d. Sklearn interface to make_circles.
95+
# Arguments
96+
- `n_samples::Union{Int, Tuple{Int, Int}} = 100`: If int, it is the total number of points generated. For odd numbers, the inner circle will have one point more than the outer circle. If two-element tuple, number of points in outer circle and inner circle.
97+
- `shuffle::Bool = true`: Whether to shuffle the samples.
98+
- `noise::Union{Nothing, Float64} = nothing`: Standard deviation of Gaussian noise added to the data.
99+
- `random_state::Union{Int, Nothing} = nothing`: Determines random number generation for dataset shuffling and noise. Pass an int for reproducible output across multiple function calls.
100+
- `factor::Float64 = 0.8`: Scale factor between inner and outer circle.
101+
Reference: [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_circles.html)
102+
103+
"""
104+
function generate_circles(; n_samples::Union{Int, Tuple{Int, Int}} = 100,
105+
shuffle::Bool = true,
106+
noise::Union{Nothing, Float64} = nothing,
107+
random_state::Union{Int, Nothing} = nothing,
108+
factor::Float64 = 0.8)::DataFrame
109+
110+
(features, labels) = datasets.make_circles( n_samples = n_samples,
111+
shuffle = shuffle,
112+
noise = noise,
113+
random_state = random_state,
114+
factor = factor)
115+
116+
return convert(features, labels)
117+
end
118+
88119
"""
89120
generate_regression(; n_samples::Int = 100,
90121
n_features::Int = 100,
@@ -124,7 +155,6 @@ function generate_regression(; n_samples::Int = 100,
124155
coef::Bool = false,
125156
random_state::Union{Int, Nothing}= nothing)
126157

127-
128158
(features, labels) = datasets.make_regression( n_samples = n_samples,
129159
n_features = n_features,
130160
n_informative = n_informative,
@@ -136,10 +166,8 @@ function generate_regression(; n_samples::Int = 100,
136166
shuffle = shuffle,
137167
coef = coef,
138168
random_state = random_state)
139-
140169

141170
return convert(features, labels)
142-
143171
end
144172

145173
"""
@@ -193,7 +221,6 @@ function generate_classification(; n_samples::Int = 100,
193221
shuffle::Bool = true,
194222
random_state::Union{Int, Nothing} = nothing)
195223

196-
197224
(features, labels) = datasets.make_classification( n_samples = n_samples,
198225
n_features = n_features,
199226
n_informative = n_informative,
@@ -282,4 +309,55 @@ function generate_friedman3(; n_samples::Int = 100,
282309
random_state = random_state)
283310

284311
return convert(features, labels)
285-
end
312+
end
313+
314+
"""
315+
function generate_low_rank_matrix(; n_samples::Int =100,
316+
n_features::Int =100,
317+
effective_rank::Int =10,
318+
tail_strength::Float64 =0.5,
319+
random_state::Union{Int, Nothing} = nothing)
320+
Generate a mostly low rank matrix with bell-shaped singular values
321+
#Arguments
322+
- `n_samples::Int = 100`: The number of samples.
323+
- `n_features::Int = 20`: The total number of features. These comprise `n_informative` informative features, `n_redundant` redundant features, `n_repeated` duplicated features and `n_features-n_informative-n_redundant-n_repeated` useless features drawn at random.
324+
- `effective_rank::Int = 10`: The approximate number of singular vectors required to explain most of the data by linear combinations.
325+
- `tail_strength::Float64 = 0.5`: The relative importance of the fat noisy tail of the singular values profile.
326+
- `random_state::Union{Int, Nothing} = nothing`: Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls. See Glossary.
327+
Reference: [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_low_rank_matrix.html)
328+
"""
329+
function generate_low_rank_matrix(; n_samples::Int = 100,
330+
n_features::Int = 100,
331+
effective_rank::Int = 10,
332+
tail_strength::Float64 = 0.5,
333+
random_state::Union{Int, Nothing} = nothing)
334+
335+
features = datasets.make_low_rank_matrix(n_samples = n_samples,
336+
n_features = n_features,
337+
effective_rank = effective_rank,
338+
tail_strength = tail_strength,
339+
random_state = random_state)
340+
return features
341+
end
342+
343+
"""
344+
function generate_swiss_roll(; n_samples::Int = 100,
345+
noise::Float64 = 0.0,
346+
random_state::Union{Int,Nothing} = nothing)
347+
Generate a swiss roll dataset.
348+
#Arguments
349+
- `n_samples::Int = 100`: The number of samples.
350+
- `noise::Float64 = 0.0 : Standard deviation of Gaussian noise added to the data.
351+
- `random_state::Union{Int, Nothing} = nothing`: Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls. See Glossary.
352+
Reference: [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_swiss_roll.htmll)
353+
"""
354+
function generate_swiss_roll(; n_samples::Int = 100,
355+
noise::Float64 = 0.0,
356+
random_state::Union{Int,Nothing} = nothing)
357+
358+
(features, labels) = datasets.make_swiss_roll( n_samples = n_samples,
359+
noise = noise,
360+
random_state = random_state)
361+
362+
return convert(features, labels)
363+
end

test/runtests.jl

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,10 @@ using Test
2727
@test size(data)[1] == samples
2828
@test size(data)[2] == 4
2929

30+
data = SyntheticDatasets.generate_circles(n_samples = samples)
31+
32+
@test size(data)[1] == samples
33+
@test size(data)[2] == 3
3034

3135
data = SyntheticDatasets.generate_regression(n_samples = samples,
3236
n_features = features,
@@ -40,7 +44,6 @@ using Test
4044
n_features = features,
4145
n_classes = 1)
4246

43-
4447
@test size(data)[1] == samples
4548
@test size(data)[2] == features + 1
4649

@@ -60,4 +63,19 @@ using Test
6063
@test size(data)[1] == samples
6164
@test size(data)[2] == 5
6265

63-
end
66+
data = SyntheticDatasets.generate_low_rank_matrix(n_samples = samples,
67+
n_features = features,
68+
effective_rank = 10,
69+
tail_strength = 0.5,
70+
random_state = 5)
71+
72+
@test size(data)[1] == samples
73+
@test size(data)[2] == features
74+
75+
data = SyntheticDatasets.generate_swiss_roll(n_samples =samples,
76+
noise = 2.2,
77+
random_state = 5)
78+
79+
@test size(data)[1] == samples
80+
@test size(data)[2] == 4
81+
end

0 commit comments

Comments
 (0)