Skip to content

Commit 4ef8e9e

Browse files
committed
first commit
0 parents  commit 4ef8e9e

File tree

3,094 files changed

+1181893
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

3,094 files changed

+1181893
-0
lines changed

.ipynb_checkpoints/NaiveBayes-checkpoint.ipynb

+583
Large diffs are not rendered by default.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,384 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {
6+
"deletable": true,
7+
"editable": true
8+
},
9+
"source": [
10+
"# Conditional Probability Activity & Exercise"
11+
]
12+
},
13+
{
14+
"cell_type": "markdown",
15+
"metadata": {
16+
"deletable": true,
17+
"editable": true
18+
},
19+
"source": [
20+
"Below is some code to create some fake data on how much stuff people purchase given their age range.\n",
21+
"\n",
22+
"It generates 100,000 random \"people\" and randomly assigns them as being in their 20's, 30's, 40's, 50's, 60's, or 70's.\n",
23+
"\n",
24+
"It then assigns a lower probability for young people to buy stuff.\n",
25+
"\n",
26+
"In the end, we have two Python dictionaries:\n",
27+
"\n",
28+
"\"totals\" contains the total number of people in each age group.\n",
29+
"\"purchases\" contains the total number of things purchased by people in each age group.\n",
30+
"The grand total of purchases is in totalPurchases, and we know the total number of people is 100,000.\n",
31+
"\n",
32+
"Let's run it and have a look:"
33+
]
34+
},
35+
{
36+
"cell_type": "code",
37+
"execution_count": 1,
38+
"metadata": {
39+
"collapsed": false,
40+
"deletable": true,
41+
"editable": true
42+
},
43+
"outputs": [],
44+
"source": [
45+
"from numpy import random\n",
46+
"random.seed(0)\n",
47+
"\n",
48+
"totals = {20:0, 30:0, 40:0, 50:0, 60:0, 70:0}\n",
49+
"purchases = {20:0, 30:0, 40:0, 50:0, 60:0, 70:0}\n",
50+
"totalPurchases = 0\n",
51+
"for _ in range(100000):\n",
52+
" ageDecade = random.choice([20, 30, 40, 50, 60, 70])\n",
53+
" purchaseProbability = float(ageDecade) / 100.0\n",
54+
" totals[ageDecade] += 1\n",
55+
" if (random.random() < purchaseProbability):\n",
56+
" totalPurchases += 1\n",
57+
" purchases[ageDecade] += 1"
58+
]
59+
},
60+
{
61+
"cell_type": "code",
62+
"execution_count": 2,
63+
"metadata": {
64+
"collapsed": false,
65+
"deletable": true,
66+
"editable": true
67+
},
68+
"outputs": [
69+
{
70+
"data": {
71+
"text/plain": [
72+
"{20: 16576, 30: 16619, 40: 16632, 50: 16805, 60: 16664, 70: 16704}"
73+
]
74+
},
75+
"execution_count": 2,
76+
"metadata": {},
77+
"output_type": "execute_result"
78+
}
79+
],
80+
"source": [
81+
"totals"
82+
]
83+
},
84+
{
85+
"cell_type": "code",
86+
"execution_count": 3,
87+
"metadata": {
88+
"collapsed": false,
89+
"deletable": true,
90+
"editable": true
91+
},
92+
"outputs": [
93+
{
94+
"data": {
95+
"text/plain": [
96+
"{20: 3392, 30: 4974, 40: 6670, 50: 8319, 60: 9944, 70: 11713}"
97+
]
98+
},
99+
"execution_count": 3,
100+
"metadata": {},
101+
"output_type": "execute_result"
102+
}
103+
],
104+
"source": [
105+
"purchases"
106+
]
107+
},
108+
{
109+
"cell_type": "code",
110+
"execution_count": 4,
111+
"metadata": {
112+
"collapsed": false,
113+
"deletable": true,
114+
"editable": true
115+
},
116+
"outputs": [
117+
{
118+
"data": {
119+
"text/plain": [
120+
"45012"
121+
]
122+
},
123+
"execution_count": 4,
124+
"metadata": {},
125+
"output_type": "execute_result"
126+
}
127+
],
128+
"source": [
129+
"totalPurchases"
130+
]
131+
},
132+
{
133+
"cell_type": "markdown",
134+
"metadata": {
135+
"deletable": true,
136+
"editable": true
137+
},
138+
"source": [
139+
"Let's play with conditional probability.\n",
140+
"\n",
141+
"First let's compute P(E|F), where E is \"purchase\" and F is \"you're in your 30's\". The probability of someone in their 30's buying something is just the percentage of how many 30-year-olds bought something:"
142+
]
143+
},
144+
{
145+
"cell_type": "code",
146+
"execution_count": 6,
147+
"metadata": {
148+
"collapsed": false,
149+
"deletable": true,
150+
"editable": true
151+
},
152+
"outputs": [
153+
{
154+
"name": "stdout",
155+
"output_type": "stream",
156+
"text": [
157+
"P(purchase | 30s): 0.29929598652145134\n"
158+
]
159+
}
160+
],
161+
"source": [
162+
"PEF = float(purchases[30]) / float(totals[30])\n",
163+
"print('P(purchase | 30s): ' + str(PEF))"
164+
]
165+
},
166+
{
167+
"cell_type": "markdown",
168+
"metadata": {
169+
"deletable": true,
170+
"editable": true
171+
},
172+
"source": [
173+
"P(F) is just the probability of being 30 in this data set:"
174+
]
175+
},
176+
{
177+
"cell_type": "code",
178+
"execution_count": 7,
179+
"metadata": {
180+
"collapsed": false,
181+
"deletable": true,
182+
"editable": true
183+
},
184+
"outputs": [
185+
{
186+
"name": "stdout",
187+
"output_type": "stream",
188+
"text": [
189+
"P(30's): 0.16619\n"
190+
]
191+
}
192+
],
193+
"source": [
194+
"PF = float(totals[30]) / 100000.0\n",
195+
"print(\"P(30's): \" + str(PF))"
196+
]
197+
},
198+
{
199+
"cell_type": "markdown",
200+
"metadata": {
201+
"deletable": true,
202+
"editable": true
203+
},
204+
"source": [
205+
"And P(E) is the overall probability of buying something, regardless of your age:"
206+
]
207+
},
208+
{
209+
"cell_type": "code",
210+
"execution_count": 8,
211+
"metadata": {
212+
"collapsed": false,
213+
"deletable": true,
214+
"editable": true
215+
},
216+
"outputs": [
217+
{
218+
"name": "stdout",
219+
"output_type": "stream",
220+
"text": [
221+
"P(Purchase):0.45012\n"
222+
]
223+
}
224+
],
225+
"source": [
226+
"PE = float(totalPurchases) / 100000.0\n",
227+
"print(\"P(Purchase):\" + str(PE))"
228+
]
229+
},
230+
{
231+
"cell_type": "markdown",
232+
"metadata": {
233+
"deletable": true,
234+
"editable": true
235+
},
236+
"source": [
237+
"If E and F were independent, then we would expect P(E | F) to be about the same as P(E). But they're not; PE is 0.45, and P(E|F) is 0.3. So, that tells us that E and F are dependent (which we know they are in this example.)\n",
238+
"\n",
239+
"What is P(E)P(F)?"
240+
]
241+
},
242+
{
243+
"cell_type": "code",
244+
"execution_count": 9,
245+
"metadata": {
246+
"collapsed": false,
247+
"deletable": true,
248+
"editable": true
249+
},
250+
"outputs": [
251+
{
252+
"name": "stdout",
253+
"output_type": "stream",
254+
"text": [
255+
"P(30's)P(Purchase)0.07480544280000001\n"
256+
]
257+
}
258+
],
259+
"source": [
260+
"print(\"P(30's)P(Purchase)\" + str(PE * PF))"
261+
]
262+
},
263+
{
264+
"cell_type": "markdown",
265+
"metadata": {
266+
"deletable": true,
267+
"editable": true
268+
},
269+
"source": [
270+
"P(E,F) is different from P(E|F). P(E,F) would be the probability of both being in your 30's and buying something, out of the total population - not just the population of people in their 30's:"
271+
]
272+
},
273+
{
274+
"cell_type": "code",
275+
"execution_count": 10,
276+
"metadata": {
277+
"collapsed": false,
278+
"deletable": true,
279+
"editable": true
280+
},
281+
"outputs": [
282+
{
283+
"name": "stdout",
284+
"output_type": "stream",
285+
"text": [
286+
"P(30's, Purchase)0.04974\n"
287+
]
288+
}
289+
],
290+
"source": [
291+
"print(\"P(30's, Purchase)\" + str(float(purchases[30]) / 100000.0))"
292+
]
293+
},
294+
{
295+
"cell_type": "markdown",
296+
"metadata": {
297+
"deletable": true,
298+
"editable": true
299+
},
300+
"source": [
301+
"P(E,F) = P(E)P(F), and they are pretty close in this example. But because E and F are actually dependent on each other, and the randomness of the data we're working with, it's not quite the same.\n",
302+
"\n",
303+
"We can also check that P(E|F) = P(E,F)/P(F) and sure enough, it is:"
304+
]
305+
},
306+
{
307+
"cell_type": "code",
308+
"execution_count": 11,
309+
"metadata": {
310+
"collapsed": false,
311+
"deletable": true,
312+
"editable": true
313+
},
314+
"outputs": [
315+
{
316+
"name": "stdout",
317+
"output_type": "stream",
318+
"text": [
319+
"0.29929598652145134\n"
320+
]
321+
}
322+
],
323+
"source": [
324+
"print((purchases[30] / 100000.0) / PF)"
325+
]
326+
},
327+
{
328+
"cell_type": "markdown",
329+
"metadata": {
330+
"deletable": true,
331+
"editable": true
332+
},
333+
"source": [
334+
"## Your Assignment"
335+
]
336+
},
337+
{
338+
"cell_type": "markdown",
339+
"metadata": {
340+
"deletable": true,
341+
"editable": true
342+
},
343+
"source": [
344+
"Modify the code above such that the purchase probability does NOT vary with age, making E and F actually independent.\n",
345+
"\n",
346+
"Then, confirm that P(E|F) is about the same as P(E), showing that the conditional probability of purchase for a given age is not any different than the a-priori probability of purchase regardless of age.\n"
347+
]
348+
},
349+
{
350+
"cell_type": "code",
351+
"execution_count": null,
352+
"metadata": {
353+
"collapsed": false,
354+
"deletable": true,
355+
"editable": true
356+
},
357+
"outputs": [],
358+
"source": [
359+
"PE = "
360+
]
361+
}
362+
],
363+
"metadata": {
364+
"kernelspec": {
365+
"display_name": "Python 3",
366+
"language": "python",
367+
"name": "python3"
368+
},
369+
"language_info": {
370+
"codemirror_mode": {
371+
"name": "ipython",
372+
"version": 3
373+
},
374+
"file_extension": ".py",
375+
"mimetype": "text/x-python",
376+
"name": "python",
377+
"nbconvert_exporter": "python",
378+
"pygments_lexer": "ipython3",
379+
"version": "3.5.2"
380+
}
381+
},
382+
"nbformat": 4,
383+
"nbformat_minor": 0
384+
}

0 commit comments

Comments
 (0)