Skip to content

Commit eedc517

Browse files
Merge remote-tracking branch 'origin/feature/generate_twospirals_function' into feature/generate_twospirals_function
2 parents da8ace3 + ac7fd92 commit eedc517

File tree

3 files changed

+206
-13
lines changed

3 files changed

+206
-13
lines changed

README.md

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -25,13 +25,21 @@ The package has an interface for the dataset generator of the [ScikitLearn](http
2525
### ScikitLearn
2626
List of package datasets:
2727

28-
Dataset | Title | Reference
29-
----------------|------------------------------------------------------------------------|--------------------------------------------------
30-
make_blobs | Generate isotropic Gaussian blobs for clustering. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons.html)
31-
make_moons | Make two interleaving half circles | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html)
32-
make_s_curve | Generate an S curve dataset. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_s_curve.html)
33-
make_regression | Generate a random regression problem. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html])
34-
make_classification | Generate a random n-class classification problem. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html])
28+
Dataset | Title | Reference
29+
---------------------|-------------------------------------------------------------------------|--------------------------------------------------
30+
make_blobs | Generate isotropic Gaussian blobs for clustering. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons.html)
31+
make_moons | Make two interleaving half circles | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html)
32+
make_s_curve | Generate an S curve dataset. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_s_curve.html)
33+
make_regression | Generate a random regression problem. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html])
34+
make_classification | Generate a random n-class classification problem. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html])
35+
make_friedman1 | Generate the “Friedman #1” regression problem. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_friedman1.html)
36+
make_friedman2 | Generate the “Friedman #2” regression problem. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_friedman2.html)
37+
make_friedman3 | Generate the “Friedman #3” regression problem. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_friedman3.html)
38+
make_circles | Make a large circle containing a smaller circle in 2d | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_circles.html)
39+
make_regression | Generate a random regression problem. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html)
40+
make_classification | Generate a random n-class classification problem. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html)
41+
make_low_rank_matrix | Generate a mostly low rank matrix with bell-shaped singular values. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_low_rank_matrix.html)
42+
make_swiss_roll | Generate a swiss roll dataset. | [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_swiss_roll.html)
3543

3644
**Disclaimer**: SyntheticDatasets.jl borrows code and documentation from
3745
[scikit-learn](https://scikit-learn.org/stable/modules/classes.html#samples-generator) in the dataset module, but *it is not an official part

src/sklearn.jl

Lines changed: 154 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,37 @@ function generate_s_curve(; n_samples::Int = 100,
8585
return convert(features, labels)
8686
end
8787

88+
"""
89+
function generate_circles(; n_samples::Int = 100,
90+
shuffle::Bool = true,
91+
noise::Float64 = 0.0,
92+
random_state::Union{Int, Nothing} = nothing,
93+
factor::Float64 = 0.8)::DataFrame
94+
Make a large circle containing a smaller circle in 2d. Sklearn interface to make_circles.
95+
# Arguments
96+
- `n_samples::Union{Int, Tuple{Int, Int}} = 100`: If int, it is the total number of points generated. For odd numbers, the inner circle will have one point more than the outer circle. If two-element tuple, number of points in outer circle and inner circle.
97+
- `shuffle::Bool = true`: Whether to shuffle the samples.
98+
- `noise::Union{Nothing, Float64} = nothing`: Standard deviation of Gaussian noise added to the data.
99+
- `random_state::Union{Int, Nothing} = nothing`: Determines random number generation for dataset shuffling and noise. Pass an int for reproducible output across multiple function calls.
100+
- `factor::Float64 = 0.8`: Scale factor between inner and outer circle.
101+
Reference: [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_circles.html)
102+
103+
"""
104+
function generate_circles(; n_samples::Union{Int, Tuple{Int, Int}} = 100,
105+
shuffle::Bool = true,
106+
noise::Union{Nothing, Float64} = nothing,
107+
random_state::Union{Int, Nothing} = nothing,
108+
factor::Float64 = 0.8)::DataFrame
109+
110+
(features, labels) = datasets.make_circles( n_samples = n_samples,
111+
shuffle = shuffle,
112+
noise = noise,
113+
random_state = random_state,
114+
factor = factor)
115+
116+
return convert(features, labels)
117+
end
118+
88119
"""
89120
generate_regression(; n_samples::Int = 100,
90121
n_features::Int = 100,
@@ -124,7 +155,6 @@ function generate_regression(; n_samples::Int = 100,
124155
coef::Bool = false,
125156
random_state::Union{Int, Nothing}= nothing)
126157

127-
128158
(features, labels) = datasets.make_regression( n_samples = n_samples,
129159
n_features = n_features,
130160
n_informative = n_informative,
@@ -136,10 +166,8 @@ function generate_regression(; n_samples::Int = 100,
136166
shuffle = shuffle,
137167
coef = coef,
138168
random_state = random_state)
139-
140169

141170
return convert(features, labels)
142-
143171
end
144172

145173
"""
@@ -193,7 +221,6 @@ function generate_classification(; n_samples::Int = 100,
193221
shuffle::Bool = true,
194222
random_state::Union{Int, Nothing} = nothing)
195223

196-
197224
(features, labels) = datasets.make_classification( n_samples = n_samples,
198225
n_features = n_features,
199226
n_informative = n_informative,
@@ -211,4 +238,126 @@ function generate_classification(; n_samples::Int = 100,
211238
random_state = random_state)
212239

213240
return convert(features, labels)
214-
end
241+
end
242+
243+
"""
244+
function generate_friedman1(; n_samples::Int = 100,
245+
n_features::Int = 10,
246+
noise::Float64 = 0.0,
247+
random_state::Union{Int, Nothing} = nothing)::DataFrame
248+
Generate the “Friedman #1” regression problem. Sklearn interface to make_regression.
249+
#Arguments
250+
- `n_samples::Int = 100`: The number of samples.
251+
- `n_features::Int = 10`: The number of features. Should be at least 5.
252+
- `noise::Union{Nothing, Float64} = nothing`: The standard deviation of the gaussian noise applied to the output.
253+
- `random_state::Union{Int, Nothing} = nothing`: Determines random number generation for dataset noise. Pass an int for reproducible output across multiple function calls.
254+
Reference: [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_friedman1.html)
255+
"""
256+
function generate_friedman1(; n_samples::Int = 100,
257+
n_features::Int = 10,
258+
noise::Float64 = 0.0,
259+
random_state::Union{Int, Nothing} = nothing)::DataFrame
260+
261+
(features, labels) = datasets.make_friedman1( n_samples = n_samples,
262+
n_features = n_features,
263+
noise = noise,
264+
random_state = random_state)
265+
266+
return convert(features, labels)
267+
end
268+
269+
"""
270+
function generate_friedman2(; n_samples::Int = 100,
271+
noise::Float64 = 0.0,
272+
random_state::Union{Int, Nothing} = nothing)::DataFrame
273+
Generate the “Friedman #2” regression problem. Sklearn interface to make_friedman2.
274+
#Arguments
275+
- `n_samples::Int = 100`: The number of samples.
276+
- `n_features::Int = 10`: The number of features. Should be at least 5.
277+
- `noise::Union{Nothing, Float64} = nothing`: The standard deviation of the gaussian noise applied to the output.
278+
- `random_state::Union{Int, Nothing} = nothing`: Determines random number generation for dataset noise. Pass an int for reproducible output across multiple function calls.
279+
Reference: [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_friedman2.html)
280+
"""
281+
function generate_friedman2(; n_samples::Int = 100,
282+
noise::Float64 = 0.0,
283+
random_state::Union{Int, Nothing} = nothing)::DataFrame
284+
285+
(features, labels) = datasets.make_friedman2( n_samples = n_samples,
286+
noise = noise,
287+
random_state = random_state)
288+
289+
return convert(features, labels)
290+
end
291+
292+
"""
293+
function generate_friedman3(; n_samples::Int = 100,
294+
noise::Float64 = 0.0,
295+
random_state::Union{Int, Nothing} = nothing)::DataFrame
296+
Generate the “Friedman #3” regression problem. Sklearn interface to make_friedman3.
297+
#Arguments
298+
- `n_samples::Int = 100`: The number of samples.
299+
- `noise::Union{Nothing, Float64} = nothing`: The standard deviation of the gaussian noise applied to the output.
300+
- `random_state::Union{Int, Nothing} = nothing`: Determines random number generation for dataset noise. Pass an int for reproducible output across multiple function calls.
301+
Reference: [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_friedman3.html)
302+
"""
303+
function generate_friedman3(; n_samples::Int = 100,
304+
noise::Float64 = 0.0,
305+
random_state::Union{Int, Nothing} = nothing)::DataFrame
306+
307+
(features, labels) = datasets.make_friedman3( n_samples = n_samples,
308+
noise = noise,
309+
random_state = random_state)
310+
311+
return convert(features, labels)
312+
end
313+
314+
"""
315+
function generate_low_rank_matrix(; n_samples::Int =100,
316+
n_features::Int =100,
317+
effective_rank::Int =10,
318+
tail_strength::Float64 =0.5,
319+
random_state::Union{Int, Nothing} = nothing)
320+
Generate a mostly low rank matrix with bell-shaped singular values
321+
#Arguments
322+
- `n_samples::Int = 100`: The number of samples.
323+
- `n_features::Int = 20`: The total number of features. These comprise `n_informative` informative features, `n_redundant` redundant features, `n_repeated` duplicated features and `n_features-n_informative-n_redundant-n_repeated` useless features drawn at random.
324+
- `effective_rank::Int = 10`: The approximate number of singular vectors required to explain most of the data by linear combinations.
325+
- `tail_strength::Float64 = 0.5`: The relative importance of the fat noisy tail of the singular values profile.
326+
- `random_state::Union{Int, Nothing} = nothing`: Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls. See Glossary.
327+
Reference: [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_low_rank_matrix.html)
328+
"""
329+
function generate_low_rank_matrix(; n_samples::Int = 100,
330+
n_features::Int = 100,
331+
effective_rank::Int = 10,
332+
tail_strength::Float64 = 0.5,
333+
random_state::Union{Int, Nothing} = nothing)
334+
335+
features = datasets.make_low_rank_matrix(n_samples = n_samples,
336+
n_features = n_features,
337+
effective_rank = effective_rank,
338+
tail_strength = tail_strength,
339+
random_state = random_state)
340+
return features
341+
end
342+
343+
"""
344+
function generate_swiss_roll(; n_samples::Int = 100,
345+
noise::Float64 = 0.0,
346+
random_state::Union{Int,Nothing} = nothing)
347+
Generate a swiss roll dataset.
348+
#Arguments
349+
- `n_samples::Int = 100`: The number of samples.
350+
- `noise::Float64 = 0.0 : Standard deviation of Gaussian noise added to the data.
351+
- `random_state::Union{Int, Nothing} = nothing`: Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls. See Glossary.
352+
Reference: [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_swiss_roll.htmll)
353+
"""
354+
function generate_swiss_roll(; n_samples::Int = 100,
355+
noise::Float64 = 0.0,
356+
random_state::Union{Int,Nothing} = nothing)
357+
358+
(features, labels) = datasets.make_swiss_roll( n_samples = n_samples,
359+
noise = noise,
360+
random_state = random_state)
361+
362+
return convert(features, labels)
363+
end

test/runtests.jl

Lines changed: 37 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,10 @@ using Test
2727
@test size(data)[1] == samples
2828
@test size(data)[2] == 4
2929

30+
data = SyntheticDatasets.generate_circles(n_samples = samples)
31+
32+
@test size(data)[1] == samples
33+
@test size(data)[2] == 3
3034

3135
data = SyntheticDatasets.generate_regression(n_samples = samples,
3236
n_features = features,
@@ -40,10 +44,43 @@ using Test
4044
n_features = features,
4145
n_classes = 1)
4246

47+
@test size(data)[1] == samples
48+
@test size(data)[2] == features + 1
4349

4450
@test size(data)[1] == samples
4551
@test size(data)[2] == features + 1
4652

53+
data = SyntheticDatasets.generate_friedman1(n_samples = samples,
54+
n_features = features)
55+
56+
@test size(data)[1] == samples
57+
@test size(data)[2] == features + 1
58+
59+
data = SyntheticDatasets.generate_friedman2(n_samples = samples)
60+
61+
@test size(data)[1] == samples
62+
@test size(data)[2] == 5
63+
64+
data = SyntheticDatasets.generate_friedman3(n_samples = samples)
65+
66+
@test size(data)[1] == samples
67+
@test size(data)[2] == 5
68+
69+
data = SyntheticDatasets.generate_low_rank_matrix(n_samples = samples,
70+
n_features = features,
71+
effective_rank = 10,
72+
tail_strength = 0.5,
73+
random_state = 5)
74+
75+
@test size(data)[1] == samples
76+
@test size(data)[2] == features
77+
78+
data = SyntheticDatasets.generate_swiss_roll(n_samples =samples,
79+
noise = 2.2,
80+
random_state = 5)
81+
82+
@test size(data)[1] == samples
83+
@test size(data)[2] == 4
4784
end
4885

4986
@testset "Matlab Generators" begin
@@ -52,6 +89,5 @@ end
5289
data = SyntheticDatasets.generate_twospirals(n_samples = samples,
5390
noise = 2.2)
5491

55-
5692
@test size(data)[1] == samples
5793
end

0 commit comments

Comments
 (0)