-
Notifications
You must be signed in to change notification settings - Fork 20
/
Copy pathtutorial.jl
265 lines (189 loc) · 6.88 KB
/
tutorial.jl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
using Pkg # hideall
Pkg.activate("_literate/D0-dataframe/Project.toml")
Pkg.instantiate()
# This tutorial is loosely adapted from [this pandas tutorial](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) as well as the [DataFrames.jl documentation](http://juliadata.github.io/DataFrames.jl/latest/man/getting_started/).
# It is by no means meant to be a complete introduction, rather, it focuses on some key functionalities that are particularly useful in a classical machine learning context.
#
# @@dropdown
# ## Basics
# @@
# @@dropdown-content
#
# To start with, we will use the `Boston` dataset which is very simple.
using RDatasets
using DataFrames
boston = dataset("MASS", "Boston");
# The `dataset` function returns a `DataFrame` object:
typeof(boston)
# @@dropdown
# ### Accessing data
# @@
# @@dropdown-content
#
# Intuitively a DataFrame is just a wrapper around a number of columns, each of which is a `Vector` of some type with a name"
names(boston)
# You can view the first few rows using `first` and specifying a number of rows:
first(boston, 4)
# You can access one of those columns easily using `.colname`, this returns a vector that you can access like any Julia vector:
boston.Crim[1:5]
# You can also just access the dataframe as you would a big matrix:
boston[3, 5]
# or specifying a range of rows/columns:
boston[1:5, [:Crim, :Zn]]
# or, similarly,
boston[1:5, 1:2]
# The `select` function is very convenient to get sub dataframes of interest:
b1 = select(boston, [:Crim, :Zn, :Indus])
first(b1, 2)
# The `Not` syntax is also very useful:
b2 = select(boston, Not(:NOx))
first(b2, 2)
# Finally, if you would like to drop columns, you can use `select!` which will mutate the dataframe in place:
select!(b1, Not(:Crim))
first(b1, 2)
#
# @@
# @@dropdown
# ### Describing the data
# @@
# @@dropdown-content
#
# `StatsBase` offers a convenient `describe` function which you can use on a DataFrame to get an overview of the data:
using StatsBase
describe(boston, :min, :max, :mean, :median, :std)
# You can pass a number of symbols to the `describe` function to indicate which statistics to compute for each feature:
#
# * `mean`, `std`, `min`, `max`, `median`, `first`, `last` are all fairly self explanatory
# * `q25`, `q75` are respectively for the 25th and 75th percentile,
# * `eltype`, `nunique`, `nmissing` can also be used
#
# You can also pass your custom function with a pair `name => function` for instance:
foo(x) = sum(abs.(x)) / length(x)
d = describe(boston, :mean, :median, foo => :foo)
first(d, 3)
# The `describe` function returns a derived object with one row per feature and one column per required statistic.
#
# Further to `StatsBase`, `Statistics` offers a range of useful functions for data analysis.
using Statistics
#
#
# @@
# @@dropdown
# ### Converting the data
# @@
# @@dropdown-content
#
# If you want to get the content of the dataframe as one big matrix, use `convert`:
mat = Matrix(boston)
mat[1:3, 1:3]
#
# @@
# @@dropdown
# ### Adding columns
# @@
# @@dropdown-content
#
# Adding a column to a dataframe is very easy:
boston.Crim_x_Zn = boston.Crim .* boston.Zn;
# that's it! Remember also that you can drop columns or make subselections with `select` and `select!`.
#
# @@
# @@dropdown
# ### Missing values
# @@
# @@dropdown-content
#
# Let's load a dataset with missing values
mao = dataset("gap", "mao")
describe(mao, :nmissing)
# Lots of missing values...
# If you wanted to compute simple functions on columns, they may just return `missing`:
std(mao.Age)
# The `skipmissing` function can help counter this easily:
std(skipmissing(mao.Age))
#
#
# @@
#
# @@
# @@dropdown
# ## Split-Apply-Combine
# @@
# @@dropdown-content
#
# This is a shorter version of the [DataFrames.jl tutorial](http://juliadata.github.io/DataFrames.jl/latest/man/split_apply_combine/).
iris = dataset("datasets", "iris")
first(iris, 3)
# @@dropdown
# ### `groupby`
# @@
# @@dropdown-content
#
# The `groupby` function allows to form "sub-dataframes" corresponding to groups of rows.
# This can be very convenient to run specific analyses for specific groups without copying the data.
#
# The basic usage is `groupby(df, cols)` where `cols` specifies one or several columns to use for the grouping.
#
# Consider a simple example: in `iris` there is a `Species` column with 3 species:
unique(iris.Species)
# We can form views for each of these:
gdf = groupby(iris, :Species);
# The `gdf` object now corresponds to **views** of the original dataframe for each of the 3 species; the first species is `"setosa"` with:
subdf_setosa = gdf[1]
describe(subdf_setosa, :min, :mean, :max)
# Note that `subdf_setosa` is a `SubDataFrame` meaning that it is just a view of the parent dataframe `iris`; if you modify that parent dataframe then the sub dataframe is also modified.
#
# See `?groupby` for more information.
#
#
# @@
# @@dropdown
# ### `combine`
# @@
# @@dropdown-content
#
# The `combine` function allows to derive a new dataframe out of transformations of an existing one.
# Here's an example taken from the official doc (see `?combine`):
df = DataFrame(a=1:3, b=4:6)
combine(df, :a => sum, nrow)
# what happened here is that the derived DataFrame has two columns obtained respectively by (1) computing the sum of the first column and (2) applying the `nrow` function on the `df`.
#
# The transformation can produce one or several values, `combine` will try to concatenate these columns as it can, for instance:
foo(v) = v[1:2]
combine(df, :a => maximum, :b => foo)
# here the maximum value of `a` is copied twice so that the two columns have the same number of rows.
#
bar(v) = v[end-1:end]
combine(df, :a => foo, :b => bar)
#
# @@
# @@dropdown
# ### `combine` with `groupby`
# @@
# @@dropdown-content
#
# Combining `groupby` with `combine` is very useful.
# For instance you might want to compute statistics across groups for different variables:
combine(groupby(iris, :Species), :PetalLength => mean)
# let's decompose that:
#
# 1. the `groupby(iris, :Species)` creates groups using the `:Species` column (which has values `setosa`, `versicolor`, `virginica`)
# 2. the `combine` creates a derived dataframe by applying the `mean` function to the `:PetalLength` column
# 3. since there are three groups, we get one column (mean of `PetalLength`) and three rows (one per group).
#
#
# You can do this for several columns/statistics at the time and give new column names to the results:
gdf = groupby(iris, :Species)
combine(gdf, :PetalLength => mean => :MPL, :PetalLength => std => :SPL)
# so here we assign the names `:MPL` and `:SPL` to the derived columns.
# If you want to apply something on all columns apart from the grouping one, using `names` and `Not` comes in handy:
combine(gdf, names(iris, Not(:Species)) .=> std)
#
# where
names(iris, Not(:Species))
#
# and note the use of `.` in `.=>` to indicate that we broadcast the function over each column.
#
# @@
#
# @@