|
35 | 35 | "### Some probability vocabulary:\n", |
36 | 36 | "A **random variable**: \"A variable quantity whose value depends on possible outcomes.\" In this case, $T_1$ and $T_2$ are \"random variables\" \n", |
37 | 37 | "An **event**: A measurable outcome of the experiment (i.e., \"$T_1 = t_1$,\" where $t_1$ is some specific number of Retweets in the sample) \n", |
38 | | - "**Independent events**: Two (or more) events whose outcomes do not affect eachother. In this example, all of our events are pretty much independent of eachother.\n", |
39 | | - "A **PMF** or **probabilty mass function**: For a discrete probabilty distribution (in this case, we're dealing with a discrete distribution, because we cannot collect a sample of 10.5 Tweets), the probability mass function is a function that gives the probability of each possible event, and the sum over all possible events is 1. \n", |
| 38 | + "**Independent events**: Two (or more) events whose outcomes do not affect eachother. In this example, all of our events are pretty much independent of eachother. \n", |
| 39 | + "A **PMF** or **probabilty mass function**: For a discrete probabilty distribution (in this case, we're dealing with a discrete distribution, because we cannot collect a sample of 10.5 Tweets), the probability mass function is a function that gives the probability of each possible event, and the sum over all possible events is 1. \n", |
40 | 40 | "A **joint PMF**: A function that gives the probabilty of two events occurring together. For two independent events, we can just multiply the two individual PMFs together. Easy.\n", |
41 | 41 | "\n", |
42 | 42 | "\n", |
|
110 | 110 | "source": [ |
111 | 111 | "# Set up the problem. We have to fix tau_1, tau_2, and s\n", |
112 | 112 | "tau1 = 100\n", |
113 | | - "tau2 = 90\n", |
| 113 | + "tau2 = 75\n", |
114 | 114 | "s = .1\n", |
115 | 115 | "\n", |
116 | 116 | "# We can use the binomial PMF from scipy, which is better than coding it up myself \n", |
|
144 | 144 | { |
145 | 145 | "cell_type": "code", |
146 | 146 | "execution_count": null, |
147 | | - "metadata": {}, |
| 147 | + "metadata": { |
| 148 | + "collapsed": false |
| 149 | + }, |
148 | 150 | "outputs": [], |
149 | 151 | "source": [ |
150 | 152 | "plt.plot([x[0] for x in probabilities_t1],[x[1] for x in probabilities_t1])\n", |
|
157 | 159 | "cell_type": "markdown", |
158 | 160 | "metadata": {}, |
159 | 161 | "source": [ |
160 | | - "#### How likely is it that we get _exactly_ X% of $tau_1$? \n", |
| 162 | + "#### How likely is it that we get _exactly_ X% of $\\tau_1$? \n", |
161 | 163 | "$$ P(t_1 = s\\tau_1) = {\\tau_1\\choose s\\tau_1} s^{t_1}(1-s)^{(\\tau_1 - s\\tau_1)}$$" |
162 | 164 | ] |
163 | 165 | }, |
164 | 166 | { |
165 | 167 | "cell_type": "code", |
166 | 168 | "execution_count": null, |
167 | | - "metadata": {}, |
| 169 | + "metadata": { |
| 170 | + "collapsed": false |
| 171 | + }, |
168 | 172 | "outputs": [], |
169 | 173 | "source": [ |
170 | 174 | "print(\"The probability of getting exactly s*tau_1 samples is: {:f}\".format(binomial_tau1.pmf(s*tau1)))" |
|
181 | 185 | { |
182 | 186 | "cell_type": "code", |
183 | 187 | "execution_count": null, |
184 | | - "metadata": {}, |
| 188 | + "metadata": { |
| 189 | + "collapsed": false |
| 190 | + }, |
185 | 191 | "outputs": [], |
186 | 192 | "source": [ |
187 | | - "plus_or_minus_percent = .01\n", |
| 193 | + "plus_or_minus_percent = .1\n", |
188 | 194 | "lower_bound_t1 = int((1 - plus_or_minus_percent)*s*tau1)\n", |
189 | 195 | "upper_bound_t1 = int((1 + plus_or_minus_percent)*s*tau1)\n", |
190 | 196 | "probability_t1_interval = 0\n", |
191 | 197 | "for i in range(lower_bound_t1,upper_bound_t1):\n", |
192 | 198 | " probability_t1_interval += binomial_tau1.pmf(i)\n", |
193 | | - "print(\"P({} < t_1 < {}) = {:f}\".format(lower_bound_t1,upper_bound_t1,probability_t1_interval))" |
| 199 | + "print(\"P({} <= t_1 < {}) = {:f}\".format(lower_bound_t1,upper_bound_t1,probability_t1_interval))" |
194 | 200 | ] |
195 | 201 | }, |
196 | 202 | { |
|
202 | 208 | "It's relatively simple to work out the standard deviation of the Binomial distribution, if we remember a few things about expected values (averages), standard deviation and variance: \n", |
203 | 209 | "1. Expected value is a linear operator (i.e., $E[A + 2B] = E[A] + 2E[B]$) \n", |
204 | 210 | "2. We'll call $E[X] = \\mu$ \n", |
205 | | - "3. Variance: $Var[X] = E[(X - \\mu^2)] = E[X^2] - E[X]^2$\n", |
| 211 | + "3. Variance: $Var[X] = E[(X - \\mu)^2] = E[X^2] - E[X]^2$\n", |
206 | 212 | "4. If two random variables are uncorrelated, Variance is a linear operator: $Var[\\sum_i X_i] = \\sum_i Var[X_i]$\n", |
207 | 213 | "5. Standard deviation: $\\sqrt{Var[X]}$ \n", |
208 | 214 | "\n", |
|
221 | 227 | { |
222 | 228 | "cell_type": "code", |
223 | 229 | "execution_count": null, |
224 | | - "metadata": {}, |
| 230 | + "metadata": { |
| 231 | + "collapsed": false |
| 232 | + }, |
225 | 233 | "outputs": [], |
226 | 234 | "source": [ |
227 | 235 | "# standard deviation varies as the size of tau1 varies\n", |
|
250 | 258 | { |
251 | 259 | "cell_type": "code", |
252 | 260 | "execution_count": null, |
253 | | - "metadata": {}, |
| 261 | + "metadata": { |
| 262 | + "collapsed": false |
| 263 | + }, |
254 | 264 | "outputs": [], |
255 | 265 | "source": [ |
256 | 266 | "# Let's show how likely it is to get exactly t_1 of T_1. Just for practice\n", |
|
271 | 281 | { |
272 | 282 | "cell_type": "code", |
273 | 283 | "execution_count": null, |
274 | | - "metadata": {}, |
| 284 | + "metadata": { |
| 285 | + "collapsed": false |
| 286 | + }, |
275 | 287 | "outputs": [], |
276 | 288 | "source": [ |
277 | 289 | "# the total probability of falling within 3 standard deviations of the mean (s*tau_1):\n", |
278 | 290 | "plus_or_minus_stdv = .01\n", |
279 | 291 | "lower_bound_t1 = max(0,int(s*tau1 - 3 * sqrt(tau1*s*(1-s))))\n", |
280 | 292 | "upper_bound_t1 = int(s*tau1 + 3 * sqrt(tau1*s*(1-s)))\n", |
281 | 293 | "probability_t1_interval = 0\n", |
282 | | - "for i in range(lower_bound_t1,upper_bound_t1):\n", |
| 294 | + "for i in range(lower_bound_t1,upper_bound_t1+1):\n", |
283 | 295 | " probability_t1_interval += binomial_tau1.pmf(i)\n", |
284 | 296 | "print(\"P({} < t_1 < {}) = {:f} (very nearly 99.7%)\".format(lower_bound_t1,upper_bound_t1,probability_t1_interval))" |
285 | 297 | ] |
|
294 | 306 | { |
295 | 307 | "cell_type": "code", |
296 | 308 | "execution_count": null, |
297 | | - "metadata": {}, |
| 309 | + "metadata": { |
| 310 | + "collapsed": false |
| 311 | + }, |
298 | 312 | "outputs": [], |
299 | 313 | "source": [ |
300 | 314 | "# t1 and t2 most likely fall within some range\n", |
|
352 | 366 | { |
353 | 367 | "cell_type": "code", |
354 | 368 | "execution_count": null, |
355 | | - "metadata": {}, |
| 369 | + "metadata": { |
| 370 | + "collapsed": false |
| 371 | + }, |
356 | 372 | "outputs": [], |
357 | 373 | "source": [ |
358 | 374 | "figure, ax = plt.subplots(1,1,figsize = (6,12))\n", |
|
369 | 385 | { |
370 | 386 | "cell_type": "code", |
371 | 387 | "execution_count": null, |
372 | | - "metadata": {}, |
| 388 | + "metadata": { |
| 389 | + "collapsed": false |
| 390 | + }, |
373 | 391 | "outputs": [], |
374 | 392 | "source": [ |
375 | 393 | "# The total probabilties should sum to 1\n", |
|
381 | 399 | { |
382 | 400 | "cell_type": "code", |
383 | 401 | "execution_count": null, |
384 | | - "metadata": {}, |
| 402 | + "metadata": { |
| 403 | + "collapsed": false |
| 404 | + }, |
385 | 405 | "outputs": [], |
386 | 406 | "source": [ |
387 | 407 | "# Now, sum only the lower triangle--the piece where t_2 >= t_1\n", |
|
421 | 441 | "Call each PMF $P(X_i = x_i; \\theta)$ (where $\"; \\theta\"$ denotes the value of the parameter that we're trying to estimate, which the PMF is dependent on) then the probability (likelihood) of a set of $n$ results is dependent on a independent variable $\\theta$:\n", |
422 | 442 | "$$ L(\\theta) = P(X_1 = x_1, X_2 = x_2, X_3 = x_3, \\dots, X_n = x_n; \\theta) = \\prod_{i = 0}^n P(X_i= x_i; \\theta)$$\n", |
423 | 443 | "\n", |
424 | | - "Now, if we find the value of the parameter $\\theta$ that maximizes the likelihood of our results $x_i$ coming from the distribution, we have found the MAximum Likelihood Estimator for $\\theta$. Super.\n", |
| 444 | + "Now, if we find the value of the parameter $\\theta$ that maximizes the likelihood of our results $x_i$ coming from the distribution, we have found the Maximum Likelihood Estimator for $\\theta$. Super.\n", |
425 | 445 | "\n", |
426 | 446 | "#### 2. A simple example of MLE estimation\n", |
427 | 447 | "\n", |
|
458 | 478 | { |
459 | 479 | "cell_type": "code", |
460 | 480 | "execution_count": null, |
461 | | - "metadata": {}, |
| 481 | + "metadata": { |
| 482 | + "collapsed": true |
| 483 | + }, |
462 | 484 | "outputs": [], |
463 | 485 | "source": [ |
464 | 486 | "# set up the experiment\n", |
|
474 | 496 | { |
475 | 497 | "cell_type": "code", |
476 | 498 | "execution_count": null, |
477 | | - "metadata": {}, |
| 499 | + "metadata": { |
| 500 | + "collapsed": true |
| 501 | + }, |
478 | 502 | "outputs": [], |
479 | 503 | "source": [ |
480 | 504 | "# get the likelihood results\n", |
|
485 | 509 | { |
486 | 510 | "cell_type": "code", |
487 | 511 | "execution_count": null, |
488 | | - "metadata": {}, |
| 512 | + "metadata": { |
| 513 | + "collapsed": false |
| 514 | + }, |
489 | 515 | "outputs": [], |
490 | 516 | "source": [ |
491 | 517 | "# plot\n", |
|
525 | 551 | { |
526 | 552 | "cell_type": "code", |
527 | 553 | "execution_count": null, |
528 | | - "metadata": {}, |
| 554 | + "metadata": { |
| 555 | + "collapsed": true |
| 556 | + }, |
529 | 557 | "outputs": [], |
530 | 558 | "source": [ |
531 | 559 | "# get the likelihood results\n", |
|
536 | 564 | { |
537 | 565 | "cell_type": "code", |
538 | 566 | "execution_count": null, |
539 | | - "metadata": {}, |
| 567 | + "metadata": { |
| 568 | + "collapsed": false |
| 569 | + }, |
540 | 570 | "outputs": [], |
541 | 571 | "source": [ |
542 | 572 | "# plot\n", |
|
572 | 602 | { |
573 | 603 | "cell_type": "code", |
574 | 604 | "execution_count": null, |
575 | | - "metadata": {}, |
| 605 | + "metadata": { |
| 606 | + "collapsed": false |
| 607 | + }, |
576 | 608 | "outputs": [], |
577 | 609 | "source": [ |
578 | 610 | "# let's see what we would estimate s to be analytically\n", |
|
593 | 625 | { |
594 | 626 | "cell_type": "code", |
595 | 627 | "execution_count": null, |
596 | | - "metadata": {}, |
| 628 | + "metadata": { |
| 629 | + "collapsed": false |
| 630 | + }, |
597 | 631 | "outputs": [], |
598 | 632 | "source": [ |
599 | 633 | "# set up the experiment\n", |
600 | 634 | "# number of trials\n", |
601 | | - "n = 1000\n", |
| 635 | + "n = 100\n", |
602 | 636 | "# probabilty, s. i'll use the same \"s\" that we're using for the problem setup\n", |
603 | 637 | "print(\"s = {}\".format(s))\n", |
604 | 638 | "# now, fix a \"real\" tau\n", |
|
616 | 650 | { |
617 | 651 | "cell_type": "code", |
618 | 652 | "execution_count": null, |
619 | | - "metadata": {}, |
| 653 | + "metadata": { |
| 654 | + "collapsed": true |
| 655 | + }, |
620 | 656 | "outputs": [], |
621 | 657 | "source": [ |
622 | 658 | "# Now, evaluate the binomial distribution likelihood for this collection of results\n", |
|
636 | 672 | { |
637 | 673 | "cell_type": "code", |
638 | 674 | "execution_count": null, |
639 | | - "metadata": {}, |
| 675 | + "metadata": { |
| 676 | + "collapsed": false |
| 677 | + }, |
640 | 678 | "outputs": [], |
641 | 679 | "source": [ |
642 | 680 | "plt.plot([x[0] for x in likelihood_tau], [x[1] for x in likelihood_tau])\n", |
|
667 | 705 | { |
668 | 706 | "cell_type": "code", |
669 | 707 | "execution_count": null, |
670 | | - "metadata": {}, |
| 708 | + "metadata": { |
| 709 | + "collapsed": false |
| 710 | + }, |
671 | 711 | "outputs": [], |
672 | 712 | "source": [ |
673 | 713 | "%%time\n", |
|
692 | 732 | " binomial_tau2 = binom(j_tau1_i_tau2[1],s)\n", |
693 | 733 | " return (j_tau1_i_tau2, binomial_tau1.pmf(t1)*binomial_tau2.pmf(t2))\n", |
694 | 734 | "# create the processes\n", |
695 | | - "pool = Pool(processes=12)\n", |
| 735 | + "pool = Pool(processes=4)\n", |
696 | 736 | "tau2_and_tau1_list = []\n", |
697 | 737 | "probabilities_tau2_and_tau1_list = []\n", |
698 | 738 | "for j_tau1 in range(window_est_tau1[0],window_est_tau1[1]):\n", |
|
708 | 748 | { |
709 | 749 | "cell_type": "code", |
710 | 750 | "execution_count": null, |
711 | | - "metadata": {}, |
| 751 | + "metadata": { |
| 752 | + "collapsed": false |
| 753 | + }, |
712 | 754 | "outputs": [], |
713 | 755 | "source": [ |
714 | 756 | "figure, ax = plt.subplots(1,1,figsize = (6,12))\n", |
|
723 | 765 | { |
724 | 766 | "cell_type": "code", |
725 | 767 | "execution_count": null, |
726 | | - "metadata": {}, |
| 768 | + "metadata": { |
| 769 | + "collapsed": false |
| 770 | + }, |
727 | 771 | "outputs": [], |
728 | 772 | "source": [ |
729 | 773 | "print(\"Estimate of the probabilty that tau_1 < tau_2 given that t_1 = {} and t_2 = {}: {}\".format(\n", |
|
0 commit comments