-
Notifications
You must be signed in to change notification settings - Fork 20
/
Copy pathEX-COMPAS.jl
135 lines (97 loc) · 4.64 KB
/
EX-COMPAS.jl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
# This fairness analysis of COMPAS dataset has been adapted partly from the [COMPAS analysis by Aequitas](https://dssg.github.io/aequitas/examples/compas_demo.html)
# ## Introduction to fairness and bias analysis
#
# Recent work in the Machine Learning community has raised concerns about the risk of unintended bias in Algorithmic Decision-Making systems, affecting individuals unfairly. While many bias metrics and fairness definitions have been proposed in recent years, the community has not reached a consensus on which definitions and metrics should be used, and there has been very little empirical analyses of real-world problems using the proposed metrics.
# ## COMPAS Dataset
#
# Northpointe’s COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is one of the most widesly utilized risk assessment tools/ algorithms within the criminal justice system for guiding decisions such as how to set bail. The ProPublica dataset represents two years of COMPAS predicitons from Broward County, FL.
# ## Getting started
using DataFrames, MLJ, CSV, VegaLite
using HTTP
MLJ.color_off() # hide
req = HTTP.get("https://raw.githubusercontent.com/dssg/aequitas/master/examples/data/compas_for_aequitas.csv")
df = CSV.read(req.body)
df[1:5, :] |> pretty
#
schema(df)
#
df = coerce(df, Textual=>OrderedFactor)
df = coerce(df, :score=>Count)
schema(df)
# ## Levels of recidivism
df |>
@vlplot(
:bar,
width=50,
height=50,
column="race:o",
y={"count()", axis={title="count", grid=false}},
x={"label_value:n", axis={title=""}},
color={"label_value:n", scale={range=["#675193", "#ca8861"]}},
spacing=10,
config={
view={stroke=:transparent},
axis={domainWidth=1}
}
) |> save(joinpath(@OUTPUT,"COMPAS_plot1.svg"))
# \figalt{Levels of recidivism}{COMPAS_plot1.svg}
# ## Model Training
#
# Now we will train a AdaBoostClassifier to predict the label_value. In this tutorial we will be training only on entity_id, age, sex and race. The actual COMPAS Dataset contains multiple columns. But for simplicity, we will be training only on these 4 values.
# ## Data Preprocessing
#
# We unpack our dataframe, convert our target labels to categorical. Then we use the Transformer:OneHotEncoder provided by MLJ.
y, X = unpack(df, ==(:label_value), col -> true);
y = categorical(y);
X = X[[:entity_id, :race, :sex, :age_cat]]
X = coerce(X, Count=>Continuous);
X = transform(fit!(machine(OneHotEncoder(), X)), X);
train, test = partition(eachindex(y), 0.7, shuffle=true);
schema(X)
#
aboost = @load AdaBoostClassifier pkg=ScikitLearn
aboost_m = machine(aboost, X, y);
fit!(aboost_m, rows=train);
pred_aboost = MLJ.predict(aboost_m, rows=test);
# Each value in pred_aboost is UnivariateFinite with predicted probability of each label. To simplify the discussion, we now convert pred_aboost to a simple array where the label with higher probability is chosen.
y_pred = Array{Int64, 1}(undef, 2164);
for i in range(1, stop=length(pred_aboost))
y_pred[i] = pred_aboost[i].prob_given_class[1]>0.5 ? 0 : 1
end
# Now we create a DataFrame of test rows and create a new column for the predictions the model made.
df_test = df[test, :]
insertcols!(df_test, 2, :pred=>y_pred);
schema(df_test)
# ## Plot of the count of predicted labels for each value of race
df_test |>
@vlplot(
:bar,
width=50,
height=50,
column="race:o",
y={"count()", axis={title="count", grid=false}},
x={"pred:n", axis={title=""}},
color={"pred:n", scale={range=["#675193", "#ca8861"]}},
spacing=10,
config={
view={stroke=:transparent},
axis={domainWidth=1}
}
) |> save(joinpath(@OUTPUT,"COMPAS_plot2.svg"))
# \figalt{count of predicted labels}{COMPAS_plot2.svg}
# ## Fairness Metrics
#
# Now we find the values of False Negative Rate, False Positive Rate, True Negative Rate and True Positive Rate. Values of other metrics like Equal Opportunity Score, etc can be calculated
for r in ["African-American", "Caucasian", "Hispanic"]
indices = [x==r for x in df_test[:race]]
ŷ = df_test[indices, :pred]
ŷ = convert(CategoricalArray, ŷ)
y_test = convert(CategoricalArray, y[test])
println("Printing values for the race : ", r)
println("False Negative Rate : ", false_negative_rate(ŷ, y_test[indices]))
println("False Positive Rate : ", false_positive_rate(ŷ, y_test[indices]))
println("True Negative Rate : ", true_negative_rate(ŷ, y_test[indices]))
println("True Positive Rate : ", true_positive_rate(ŷ, y_test[indices]))
println()
end
# Above analysis was performed on the sensitive attribute : race. Similar analysis could also be performed on the other protected attributes : Sex and Age