-
Notifications
You must be signed in to change notification settings - Fork 20
/
Copy pathtutorial.jl
181 lines (153 loc) · 7.47 KB
/
tutorial.jl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
using Pkg # hideall
Pkg.activate("_literate/D0-loading/Project.toml")
Pkg.instantiate()
# In this short tutorial we discuss two ways to easily load data in Julia:
#
# 1. loading a standard dataset via `RDatasets.jl`,
# 1. loading a local file with `CSV.jl`,
#
# @@dropdown
# ## Using RDatasets
# @@
# @@dropdown-content
#
# The package [RDatasets.jl](https://github.com/JuliaStats/RDatasets.jl) provides access to most of the many datasets listed on [this page](http://vincentarelbundock.github.io/Rdatasets/datasets.html).
# These are well known, standard datasets that can be used to get started with data processing and classical machine learning such as for instance `iris`, `crabs`, `Boston`, etc.
#
# To load such a dataset, you will need to specify which R package it belongs to as well as its name; for instance `Boston` is part of `MASS`.
using RDatasets
import DataFrames
boston = dataset("MASS", "Boston");
# The fact that `Boston` is part of `MASS` is clearly indicated on the [list](http://vincentarelbundock.github.io/Rdatasets/datasets.html) linked to earlier.
# While it can be a bit slow, loading a dataset via RDatasets is very simple and convenient as you don't have to worry about setting the names of columns etc.
#
# The `dataset` function returns a `DataFrame` object from the [DataFrames.jl](https://github.com/JuliaData/DataFrames.jl) package.
typeof(boston)
# For a short introduction to DataFrame objects, see [this tutorial](/data/dataframe).
#
# @@
# @@dropdown
# ## Using CSV
# @@
# @@dropdown-content
#
# The package [CSV.jl](https://github.com/JuliaData/CSV.jl) offers a powerful way to read arbitrary CSV files efficiently.
# In particular the `CSV.read` function allows to read a file and return a DataFrame.
#
# @@dropdown
# ### Basic usage
# @@
# @@dropdown-content
#
# Let's say you have a file `foo.csv` at some path `fpath=joinpath("data", "foo.csv")` with the content
#
# ```
# col1,col2,col3,col4,col5,col6,col7,col8
# ,1,1.0,1,one,2019-01-01,2019-01-01T00:00:00,true
# ,2,2.0,2,two,2019-01-02,2019-01-02T00:00:00,false
# ,3,3.0,3.14,three,2019-01-03,2019-01-03T00:00:00,true
# ```
#
c = """
col1,col2,col3,col4,col5,col6,col7,col8
,1,1.0,1,one,2019-01-01,2019-01-01T00:00:00,true
,2,2.0,2,two,2019-01-02,2019-01-02T00:00:00,false
,3,3.0,3.14,three,2019-01-03,2019-01-03T00:00:00,true
"""
fpath, = mktemp()
write(fpath, c);
# You can read it with CSV using
using CSV
data = CSV.read(fpath, DataFrames.DataFrame)
# Note that we use this `joinpath` for compatibility with our system but you could pass any valid path on your system for instance `CSV.read("path/to/file.csv")`.
# The data is also returned as a dataframe
typeof(data)
# Some of the useful arguments for `read` are:
#
# * `header=` to specify whether there's a header, or which line the header is on or to specify a full header yourself,
# * `skipto=` to specify how many rows to skip before starting to read the data,
# * `limit=` to specify a maximum number of rows to parse,
# * `missingstring=` to specify a string or vector of strings that should be parsed as missing values,
# * `delim=','` a char or string to specify how columns are separated.
#
# For more details see `?CSV.File`.
#
#
# @@
# @@dropdown
# ### Example 1
# @@
# @@dropdown-content
#
# Let's consider [this dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/00504/), the content of which we saved in a file at path `fpath`.
c = """
3.26;0.829;1.676;0;1;1.453;3.770
2.189;0.58;0.863;0;0;1.348;3.115
2.125;0.638;0.831;0;0;1.348;3.531
3.027;0.331;1.472;1;0;1.807;3.510
2.094;0.827;0.86;0;0;1.886;5.390
3.222;0.331;2.177;0;0;0.706;1.819
3.179;0;1.063;0;0;2.942;3.947
3;0;0.938;1;0;2.851;3.513
2.62;0.499;0.99;0;0;2.942;4.402
2.834;0.134;0.95;0;0;1.591;3.021
2.405;0.134;0.843;0;0;1.769;3.210
2.728;0.223;0.953;0;0;1.591;2.371
2.512;0.223;0.929;1;0;1.769;3.919
2.834;0.134;1.237;0;0;1.859;3.030
2.819;0.331;1.271;0;1;0.981;2.736
2.126;0.251;1.114;0;0;0.143;2.157
2.834;0.134;1.322;0;0;1.199;2.413
3.014;0.56;1.781;0;0;-0.115;0.898
3.024;0.452;2.698;0;0;1.107;0.450
3.036;0.405;1.205;1;0;1.807;3.733
2.707;0.972;1.889;0;3;-1.169;2.976
2.978;1.246;1.103;0;1;3.988;6.535
3.111;0.732;0.923;0;0;4.068;5.643
"""
fpath, = mktemp()
write(fpath, c);
# It doesn't have a header so we have to provide it ourselves.
header = ["CIC0", "SM1_Dz", "GATS1i",
"NdsCH", "NdssC", "MLOGP", "LC50"]
data = CSV.read(fpath, DataFrames.DataFrame, header=header)
first(data, 3)
#
# @@
# @@dropdown
# ### Example 2
# @@
# @@dropdown-content
#
# Let's consider [this dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/00423/), the content of which we saved at `fpath`.
c = """
1,0,1,0,0,0,0,1,0,1,1,?,1,0,0,0,0,1,0,0,0,0,1,67,137,15,0,1,1,1.53,95,13.7,106.6,4.9,99,3.4,2.1,34,41,183,150,7.1,0.7,1,3.5,0.5,?,?,?,1
0,?,0,0,0,0,1,1,?,?,1,0,0,1,0,0,0,1,0,0,0,0,1,62,0,?,0,1,1,?,?,?,?,?,?,?,?,?,?,?,?,?,?,1,1.8,?,?,?,?,1
1,0,1,1,0,1,0,1,0,1,0,0,0,1,1,0,0,0,0,1,0,1,1,78,50,50,2,1,2,0.96,5.8,8.9,79.8,8.4,472,3.3,0.4,58,68,202,109,7,2.1,5,13,0.1,28,6,16,1
1,1,1,0,0,0,0,1,0,1,1,0,0,1,0,0,0,0,0,0,0,1,1,77,40,30,0,1,1,0.95,2440,13.4,97.1,9,279,3.7,0.4,16,64,94,174,8.1,1.11,2,15.7,0.2,?,?,?,0
1,1,1,1,0,1,0,1,0,1,0,0,0,1,1,0,0,0,0,0,0,0,1,76,100,30,0,1,1,0.94,49,14.3,95.1,6.4,199,4.1,0.7,147,306,173,109,6.9,1.8,1,9,?,59,15,22,1
1,0,1,0,?,0,0,1,0,?,0,1,0,0,0,0,0,1,1,1,0,0,1,75,?,?,1,1,2,1.58,110,13.4,91.5,5.4,85,3.4,3.5,91,122,242,396,5.6,0.9,1,10,1.4,53,22,111,0
1,0,0,0,?,1,1,1,0,0,1,0,?,0,0,0,0,0,0,0,0,0,1,49,0,0,0,1,1,1.4,138.9,10.4,102,3.2,42000,2.35,2.72,119,183,143,211,7.3,0.8,5,2.6,2.19,171,126,1452,0
1,1,1,0,?,0,0,1,0,1,1,?,0,0,0,0,0,0,1,1,1,0,1,61,?,20,3,1,1,1.46,9860,10.8,92,3,58,3.1,3.2,79,108,184,300,7.1,0.52,2,9,1.3,42,25,706,0
1,1,1,0,0,0,0,1,0,1,1,0,0,1,0,0,0,?,1,1,0,0,1,50,100,32,1,1,2,3.14,8.8,11.9,107.5,4.9,70,1.9,3.3,26,59,115,63,6.1,0.59,1,6.4,1.2,85,73,982,1
1,1,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,43,100,0,0,1,1,1.12,1.8,11.8,87.8,5100,193000,4.2,0.5,71,45,256,303,7.1,0.59,1,9.3,0.7,?,?,?,1
1,0,1,0,0,0,1,1,?,?,0,0,0,0,0,0,0,?,1,1,0,0,1,41,?,?,0,1,2,1.05,100809,13,94.2,5.7,196,4.4,3,90,334,494,236,7.6,0.8,5,?,1.1,?,?,?,0
1,0,1,0,0,0,1,1,1,0,0,0,0,1,0,0,0,?,0,1,0,0,1,74,?,0,0,1,1,1.33,86,15.7,96.7,4,61,3.7,1.3,132,168,113,154,?,7.6,5,1.9,0.3,144,41,277,1
1,0,1,0,0,0,0,1,0,1,1,0,0,1,0,0,?,?,1,1,1,0,0,66,?,30,0,1,1,1.53,60,13.3,90.1,5.5,207000,4.4,8.5,25,36,35,74,8.5,0.73,1,5,0.8,?,?,?,1
1,?,0,0,0,0,1,1,?,?,0,0,0,0,0,0,0,0,0,0,0,0,1,56,0,?,0,1,1,1.2,6.6,13.7,93.8,4.1,91000,4.5,1,103,96,205,70,8.8,0.88,1,22,?,82,24,?,1
1,0,1,0,0,0,0,1,0,?,1,0,0,1,0,0,?,1,1,1,0,0,1,63,?,?,2,2,2,1.25,29,13.5,93,6,128,3.15,10.5,76,116,165,163,7.3,1.07,4,4.5,4.5,197,84,302,1
0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1,1,1,0,1,41,100,0,1,1,2,1.61,4.6,10.2,89.6,5.5,161,3.1,3.1,24,57,163,176,5,0.8,2,2.6,1.3,25,13,60,1
1,0,1,0,0,0,0,1,?,1,1,?,1,0,0,0,?,?,1,1,1,0,1,72,?,?,3,2,1,2.14,60,12.1,99.2,5,58,2.4,9.8,69,63,201,235,6.2,0.96,2,2,2.9,136,95,767,0
1,1,1,0,0,0,0,1,0,1,0,0,?,0,0,0,?,1,1,1,1,1,1,60,100,60,2,1,1,1.05,9.2,10.3,103.7,5.4,159,3.8,0.5,56,91,459,146,5.4,1.23,5,13.5,3.8,187,58,443,1
1,?,1,0,0,0,0,1,?,1,0,?,0,1,1,0,0,1,0,0,0,0,1,64,200,78,1,1,1,1.13,8.8,14.9,94.8,6.3,137,4.3,0.9,16,23,82,180,6.5,4.95,1,5.4,0.9,144,49,295,1
1,1,1,0,0,0,0,1,?,?,0,0,0,1,0,0,0,1,1,1,1,0,1,75,500,?,0,1,3,1.44,34,15.9,103.4,9600,101000,3.4,3.4,27,87,260,147,6.3,0.9,5,2.3,1.6,67,34,774,0
"""
fpath, = mktemp()
write(fpath, c);
# It does not have a header and missing values indicated by `?`.
data = CSV.read(fpath, DataFrames.DataFrame, header=false, missingstring="?")
first(data[:, 1:5], 3)
#
# @@
#
# @@