|
35 | 35 | "### Some probability vocabulary:\n",
|
36 | 36 | "A **random variable**: \"A variable quantity whose value depends on possible outcomes.\" In this case, $T_1$ and $T_2$ are \"random variables\" \n",
|
37 | 37 | "An **event**: A measurable outcome of the experiment (i.e., \"$T_1 = t_1$,\" where $t_1$ is some specific number of Retweets in the sample) \n",
|
38 |
| - "**Independent events**: Two (or more) events whose outcomes do not affect eachother. In this example, all of our events are pretty much independent of eachother.\n", |
39 |
| - "A **PMF** or **probabilty mass function**: For a discrete probabilty distribution (in this case, we're dealing with a discrete distribution, because we cannot collect a sample of 10.5 Tweets), the probability mass function is a function that gives the probability of each possible event, and the sum over all possible events is 1. \n", |
| 38 | + "**Independent events**: Two (or more) events whose outcomes do not affect eachother. In this example, all of our events are pretty much independent of eachother. \n", |
| 39 | + "A **PMF** or **probabilty mass function**: For a discrete probabilty distribution (in this case, we're dealing with a discrete distribution, because we cannot collect a sample of 10.5 Tweets), the probability mass function is a function that gives the probability of each possible event, and the sum over all possible events is 1. \n", |
40 | 40 | "A **joint PMF**: A function that gives the probabilty of two events occurring together. For two independent events, we can just multiply the two individual PMFs together. Easy.\n",
|
41 | 41 | "\n",
|
42 | 42 | "\n",
|
|
110 | 110 | "source": [
|
111 | 111 | "# Set up the problem. We have to fix tau_1, tau_2, and s\n",
|
112 | 112 | "tau1 = 100\n",
|
113 |
| - "tau2 = 90\n", |
| 113 | + "tau2 = 75\n", |
114 | 114 | "s = .1\n",
|
115 | 115 | "\n",
|
116 | 116 | "# We can use the binomial PMF from scipy, which is better than coding it up myself \n",
|
|
144 | 144 | {
|
145 | 145 | "cell_type": "code",
|
146 | 146 | "execution_count": null,
|
147 |
| - "metadata": {}, |
| 147 | + "metadata": { |
| 148 | + "collapsed": false |
| 149 | + }, |
148 | 150 | "outputs": [],
|
149 | 151 | "source": [
|
150 | 152 | "plt.plot([x[0] for x in probabilities_t1],[x[1] for x in probabilities_t1])\n",
|
|
157 | 159 | "cell_type": "markdown",
|
158 | 160 | "metadata": {},
|
159 | 161 | "source": [
|
160 |
| - "#### How likely is it that we get _exactly_ X% of $tau_1$? \n", |
| 162 | + "#### How likely is it that we get _exactly_ X% of $\\tau_1$? \n", |
161 | 163 | "$$ P(t_1 = s\\tau_1) = {\\tau_1\\choose s\\tau_1} s^{t_1}(1-s)^{(\\tau_1 - s\\tau_1)}$$"
|
162 | 164 | ]
|
163 | 165 | },
|
164 | 166 | {
|
165 | 167 | "cell_type": "code",
|
166 | 168 | "execution_count": null,
|
167 |
| - "metadata": {}, |
| 169 | + "metadata": { |
| 170 | + "collapsed": false |
| 171 | + }, |
168 | 172 | "outputs": [],
|
169 | 173 | "source": [
|
170 | 174 | "print(\"The probability of getting exactly s*tau_1 samples is: {:f}\".format(binomial_tau1.pmf(s*tau1)))"
|
|
181 | 185 | {
|
182 | 186 | "cell_type": "code",
|
183 | 187 | "execution_count": null,
|
184 |
| - "metadata": {}, |
| 188 | + "metadata": { |
| 189 | + "collapsed": false |
| 190 | + }, |
185 | 191 | "outputs": [],
|
186 | 192 | "source": [
|
187 |
| - "plus_or_minus_percent = .01\n", |
| 193 | + "plus_or_minus_percent = .1\n", |
188 | 194 | "lower_bound_t1 = int((1 - plus_or_minus_percent)*s*tau1)\n",
|
189 | 195 | "upper_bound_t1 = int((1 + plus_or_minus_percent)*s*tau1)\n",
|
190 | 196 | "probability_t1_interval = 0\n",
|
191 | 197 | "for i in range(lower_bound_t1,upper_bound_t1):\n",
|
192 | 198 | " probability_t1_interval += binomial_tau1.pmf(i)\n",
|
193 |
| - "print(\"P({} < t_1 < {}) = {:f}\".format(lower_bound_t1,upper_bound_t1,probability_t1_interval))" |
| 199 | + "print(\"P({} <= t_1 < {}) = {:f}\".format(lower_bound_t1,upper_bound_t1,probability_t1_interval))" |
194 | 200 | ]
|
195 | 201 | },
|
196 | 202 | {
|
|
202 | 208 | "It's relatively simple to work out the standard deviation of the Binomial distribution, if we remember a few things about expected values (averages), standard deviation and variance: \n",
|
203 | 209 | "1. Expected value is a linear operator (i.e., $E[A + 2B] = E[A] + 2E[B]$) \n",
|
204 | 210 | "2. We'll call $E[X] = \\mu$ \n",
|
205 |
| - "3. Variance: $Var[X] = E[(X - \\mu^2)] = E[X^2] - E[X]^2$\n", |
| 211 | + "3. Variance: $Var[X] = E[(X - \\mu)^2] = E[X^2] - E[X]^2$\n", |
206 | 212 | "4. If two random variables are uncorrelated, Variance is a linear operator: $Var[\\sum_i X_i] = \\sum_i Var[X_i]$\n",
|
207 | 213 | "5. Standard deviation: $\\sqrt{Var[X]}$ \n",
|
208 | 214 | "\n",
|
|
221 | 227 | {
|
222 | 228 | "cell_type": "code",
|
223 | 229 | "execution_count": null,
|
224 |
| - "metadata": {}, |
| 230 | + "metadata": { |
| 231 | + "collapsed": false |
| 232 | + }, |
225 | 233 | "outputs": [],
|
226 | 234 | "source": [
|
227 | 235 | "# standard deviation varies as the size of tau1 varies\n",
|
|
250 | 258 | {
|
251 | 259 | "cell_type": "code",
|
252 | 260 | "execution_count": null,
|
253 |
| - "metadata": {}, |
| 261 | + "metadata": { |
| 262 | + "collapsed": false |
| 263 | + }, |
254 | 264 | "outputs": [],
|
255 | 265 | "source": [
|
256 | 266 | "# Let's show how likely it is to get exactly t_1 of T_1. Just for practice\n",
|
|
271 | 281 | {
|
272 | 282 | "cell_type": "code",
|
273 | 283 | "execution_count": null,
|
274 |
| - "metadata": {}, |
| 284 | + "metadata": { |
| 285 | + "collapsed": false |
| 286 | + }, |
275 | 287 | "outputs": [],
|
276 | 288 | "source": [
|
277 | 289 | "# the total probability of falling within 3 standard deviations of the mean (s*tau_1):\n",
|
278 | 290 | "plus_or_minus_stdv = .01\n",
|
279 | 291 | "lower_bound_t1 = max(0,int(s*tau1 - 3 * sqrt(tau1*s*(1-s))))\n",
|
280 | 292 | "upper_bound_t1 = int(s*tau1 + 3 * sqrt(tau1*s*(1-s)))\n",
|
281 | 293 | "probability_t1_interval = 0\n",
|
282 |
| - "for i in range(lower_bound_t1,upper_bound_t1):\n", |
| 294 | + "for i in range(lower_bound_t1,upper_bound_t1+1):\n", |
283 | 295 | " probability_t1_interval += binomial_tau1.pmf(i)\n",
|
284 | 296 | "print(\"P({} < t_1 < {}) = {:f} (very nearly 99.7%)\".format(lower_bound_t1,upper_bound_t1,probability_t1_interval))"
|
285 | 297 | ]
|
|
294 | 306 | {
|
295 | 307 | "cell_type": "code",
|
296 | 308 | "execution_count": null,
|
297 |
| - "metadata": {}, |
| 309 | + "metadata": { |
| 310 | + "collapsed": false |
| 311 | + }, |
298 | 312 | "outputs": [],
|
299 | 313 | "source": [
|
300 | 314 | "# t1 and t2 most likely fall within some range\n",
|
|
352 | 366 | {
|
353 | 367 | "cell_type": "code",
|
354 | 368 | "execution_count": null,
|
355 |
| - "metadata": {}, |
| 369 | + "metadata": { |
| 370 | + "collapsed": false |
| 371 | + }, |
356 | 372 | "outputs": [],
|
357 | 373 | "source": [
|
358 | 374 | "figure, ax = plt.subplots(1,1,figsize = (6,12))\n",
|
|
369 | 385 | {
|
370 | 386 | "cell_type": "code",
|
371 | 387 | "execution_count": null,
|
372 |
| - "metadata": {}, |
| 388 | + "metadata": { |
| 389 | + "collapsed": false |
| 390 | + }, |
373 | 391 | "outputs": [],
|
374 | 392 | "source": [
|
375 | 393 | "# The total probabilties should sum to 1\n",
|
|
381 | 399 | {
|
382 | 400 | "cell_type": "code",
|
383 | 401 | "execution_count": null,
|
384 |
| - "metadata": {}, |
| 402 | + "metadata": { |
| 403 | + "collapsed": false |
| 404 | + }, |
385 | 405 | "outputs": [],
|
386 | 406 | "source": [
|
387 | 407 | "# Now, sum only the lower triangle--the piece where t_2 >= t_1\n",
|
|
421 | 441 | "Call each PMF $P(X_i = x_i; \\theta)$ (where $\"; \\theta\"$ denotes the value of the parameter that we're trying to estimate, which the PMF is dependent on) then the probability (likelihood) of a set of $n$ results is dependent on a independent variable $\\theta$:\n",
|
422 | 442 | "$$ L(\\theta) = P(X_1 = x_1, X_2 = x_2, X_3 = x_3, \\dots, X_n = x_n; \\theta) = \\prod_{i = 0}^n P(X_i= x_i; \\theta)$$\n",
|
423 | 443 | "\n",
|
424 |
| - "Now, if we find the value of the parameter $\\theta$ that maximizes the likelihood of our results $x_i$ coming from the distribution, we have found the MAximum Likelihood Estimator for $\\theta$. Super.\n", |
| 444 | + "Now, if we find the value of the parameter $\\theta$ that maximizes the likelihood of our results $x_i$ coming from the distribution, we have found the Maximum Likelihood Estimator for $\\theta$. Super.\n", |
425 | 445 | "\n",
|
426 | 446 | "#### 2. A simple example of MLE estimation\n",
|
427 | 447 | "\n",
|
|
458 | 478 | {
|
459 | 479 | "cell_type": "code",
|
460 | 480 | "execution_count": null,
|
461 |
| - "metadata": {}, |
| 481 | + "metadata": { |
| 482 | + "collapsed": true |
| 483 | + }, |
462 | 484 | "outputs": [],
|
463 | 485 | "source": [
|
464 | 486 | "# set up the experiment\n",
|
|
474 | 496 | {
|
475 | 497 | "cell_type": "code",
|
476 | 498 | "execution_count": null,
|
477 |
| - "metadata": {}, |
| 499 | + "metadata": { |
| 500 | + "collapsed": true |
| 501 | + }, |
478 | 502 | "outputs": [],
|
479 | 503 | "source": [
|
480 | 504 | "# get the likelihood results\n",
|
|
485 | 509 | {
|
486 | 510 | "cell_type": "code",
|
487 | 511 | "execution_count": null,
|
488 |
| - "metadata": {}, |
| 512 | + "metadata": { |
| 513 | + "collapsed": false |
| 514 | + }, |
489 | 515 | "outputs": [],
|
490 | 516 | "source": [
|
491 | 517 | "# plot\n",
|
|
525 | 551 | {
|
526 | 552 | "cell_type": "code",
|
527 | 553 | "execution_count": null,
|
528 |
| - "metadata": {}, |
| 554 | + "metadata": { |
| 555 | + "collapsed": true |
| 556 | + }, |
529 | 557 | "outputs": [],
|
530 | 558 | "source": [
|
531 | 559 | "# get the likelihood results\n",
|
|
536 | 564 | {
|
537 | 565 | "cell_type": "code",
|
538 | 566 | "execution_count": null,
|
539 |
| - "metadata": {}, |
| 567 | + "metadata": { |
| 568 | + "collapsed": false |
| 569 | + }, |
540 | 570 | "outputs": [],
|
541 | 571 | "source": [
|
542 | 572 | "# plot\n",
|
|
572 | 602 | {
|
573 | 603 | "cell_type": "code",
|
574 | 604 | "execution_count": null,
|
575 |
| - "metadata": {}, |
| 605 | + "metadata": { |
| 606 | + "collapsed": false |
| 607 | + }, |
576 | 608 | "outputs": [],
|
577 | 609 | "source": [
|
578 | 610 | "# let's see what we would estimate s to be analytically\n",
|
|
593 | 625 | {
|
594 | 626 | "cell_type": "code",
|
595 | 627 | "execution_count": null,
|
596 |
| - "metadata": {}, |
| 628 | + "metadata": { |
| 629 | + "collapsed": false |
| 630 | + }, |
597 | 631 | "outputs": [],
|
598 | 632 | "source": [
|
599 | 633 | "# set up the experiment\n",
|
600 | 634 | "# number of trials\n",
|
601 |
| - "n = 1000\n", |
| 635 | + "n = 100\n", |
602 | 636 | "# probabilty, s. i'll use the same \"s\" that we're using for the problem setup\n",
|
603 | 637 | "print(\"s = {}\".format(s))\n",
|
604 | 638 | "# now, fix a \"real\" tau\n",
|
|
616 | 650 | {
|
617 | 651 | "cell_type": "code",
|
618 | 652 | "execution_count": null,
|
619 |
| - "metadata": {}, |
| 653 | + "metadata": { |
| 654 | + "collapsed": true |
| 655 | + }, |
620 | 656 | "outputs": [],
|
621 | 657 | "source": [
|
622 | 658 | "# Now, evaluate the binomial distribution likelihood for this collection of results\n",
|
|
636 | 672 | {
|
637 | 673 | "cell_type": "code",
|
638 | 674 | "execution_count": null,
|
639 |
| - "metadata": {}, |
| 675 | + "metadata": { |
| 676 | + "collapsed": false |
| 677 | + }, |
640 | 678 | "outputs": [],
|
641 | 679 | "source": [
|
642 | 680 | "plt.plot([x[0] for x in likelihood_tau], [x[1] for x in likelihood_tau])\n",
|
|
667 | 705 | {
|
668 | 706 | "cell_type": "code",
|
669 | 707 | "execution_count": null,
|
670 |
| - "metadata": {}, |
| 708 | + "metadata": { |
| 709 | + "collapsed": false |
| 710 | + }, |
671 | 711 | "outputs": [],
|
672 | 712 | "source": [
|
673 | 713 | "%%time\n",
|
|
692 | 732 | " binomial_tau2 = binom(j_tau1_i_tau2[1],s)\n",
|
693 | 733 | " return (j_tau1_i_tau2, binomial_tau1.pmf(t1)*binomial_tau2.pmf(t2))\n",
|
694 | 734 | "# create the processes\n",
|
695 |
| - "pool = Pool(processes=12)\n", |
| 735 | + "pool = Pool(processes=4)\n", |
696 | 736 | "tau2_and_tau1_list = []\n",
|
697 | 737 | "probabilities_tau2_and_tau1_list = []\n",
|
698 | 738 | "for j_tau1 in range(window_est_tau1[0],window_est_tau1[1]):\n",
|
|
708 | 748 | {
|
709 | 749 | "cell_type": "code",
|
710 | 750 | "execution_count": null,
|
711 |
| - "metadata": {}, |
| 751 | + "metadata": { |
| 752 | + "collapsed": false |
| 753 | + }, |
712 | 754 | "outputs": [],
|
713 | 755 | "source": [
|
714 | 756 | "figure, ax = plt.subplots(1,1,figsize = (6,12))\n",
|
|
723 | 765 | {
|
724 | 766 | "cell_type": "code",
|
725 | 767 | "execution_count": null,
|
726 |
| - "metadata": {}, |
| 768 | + "metadata": { |
| 769 | + "collapsed": false |
| 770 | + }, |
727 | 771 | "outputs": [],
|
728 | 772 | "source": [
|
729 | 773 | "print(\"Estimate of the probabilty that tau_1 < tau_2 given that t_1 = {} and t_2 = {}: {}\".format(\n",
|
|
0 commit comments