-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy path03.0_MathSymbols.Rmd
1240 lines (894 loc) · 60.8 KB
/
03.0_MathSymbols.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output:
html_document:
fig_caption: yes
number_sections: yes
theme: readable
toc: yes
toc_depth: 3
editor_options:
chunk_output_type: console
---
```{r setup, echo=FALSE, message=FALSE}
library(kableExtra)
library(knitr)
```
# Required Math Skills and Symbols {#chMath}
In this chapter we introduce most of the mathematics and symbols that are used in the book. We start with the concept and symbols for summation used to represent sums of many numbers generally and succinctly. We use simple numerical examples for which we give the mathematical equations and the corresponding R code to perform the operations. In order to present the symbols in a context that is relevant for the rest of the book, we explore equations for averages and variances, and we present models. At this point, the definitions of mean, variance, etc., and the formulation of specific models are not the main focus, and they will be presented later in more detail.
A **key point** in this chapter is that
```{block StatModels, type = 'stattip'}
- In statistics we are always using models, and making those models explicit by writing them down completely is essential to understanding and communicating what we are doing.
```
This chapter develops models of increasing complexity and uses them to present the equations, symbols, and mathematical concepts that are used throughout the rest of the book.
A second **key point**, related to the first one, is that
```{block UnknownParameters, type = 'stattip'}
- We rarely, if ever, know the actual "true" value of parameters, variables or quantities in the real world (as opposed to simulations).
```
We do not know parameter values because we rarely know or even have access to the whole population, which would be necessary to "measure" a parameter like the true mean. Even if we had access to the whole population to measure each individual, we would not know parameter values with certainty, because measurements always have "errors." Measurements are imperfect because measuring instruments have limited precision, and because they are used by "imperfect" humans that make all sorts of errors. The imperfection of measurements should rarely be an issue for the topics covered in this course, and our focus is on estimation of unknown parameters. However, when dealing with chaotic systems, for example in weather forecasting, the imprecision of the measurements is a major limitation, because even minute errors in measurement grow exponentially over time. Although the physical properties of the atmosphere are well known, the imperfection of measurements prevent long-term prediction of specific weather with high confidence.
In reading this chapter, focus on the symbols and concepts. Before doing the exercises for this chapter, students should make sure that they can reproduce the tables of symbols with subscripts, and the symbols for specific table elements and sums of elements. We recommend that you write down the equations as you read them, and then try to reproduce them from memory and a description of the calculation (e.g., sum of all moisture values for corn in fields 3 through 7). This will greatly facilitate understanding and interpretation of the symbols and details of each equation, and will reveal those aspects that are not yet clear to you.
## Learning Objectives for Chapter {#mathObj}
1. Identify and write down simple statistical models.
1. Read and interpret mathematical equations.
1. Identify random and fixed components of statistical models.
1. Pair symbols with corresponding concepts or definitions.
1. Express sums using the summation symbol and expand sums defined with the summation symbol.
1. Identify the symbols that correspond to standard deviation, variance, population mean, and sample average.
1. List different strategies we can use to find the optimal value for an unknown parameter.
1. Identify and write down polynomials and linear combinations of parameters.
## Data Columns and Summation {#mathSum}
Summations, the sum of several numbers that are organized with subscripts, are used frequently in statistics. To be able to read equations and understand the concepts in them, you have to be familiar with summations. The Greek letter for capital sigma, $\Sigma$, is used to represent summation. For example, $\sum_{i=1}^{4} Y_i$ represents the sum $Y_1 + Y_2 + Y_3 + Y_4$. Sometimes, for simplicity, when we want to sum over all possible values of the subscript $i$ from $i = 1$ to the maximum but unspecified value of $i = k$, we write $\sum_{i=1}^k Y_i$, meaning "sum of all values of $Y_i$."
We can write a symbol for all the $Y_i$ to avoid having to write each one of them. All the values of $Y$ in the correct order constitute a *vector* of values. We use bold letters to refer to vectors. Suppose we have a vector with the numbers 7, 5, 9 and 6. We write this as:
<br>
\begin{equation}
\mathbf{Y} = \begin{pmatrix} Y_1 & Y_2 & Y_3 & Y_4 \end{pmatrix} = \begin{pmatrix} 7 & 5 & 9 & 6 \end{pmatrix} = \left( Y_i \right) \quad i = 1, \dots, 4
(\#eq:math01)
\end{equation}
<br>
where it is implied that $Y_1 = 7$, $Y_2 = 5$, $Y_3 = 9$ and $Y_4 = 6$. There are no calculations or concepts here other than the fact that vectors are sets of numbers with order, and that we can refer to them with symbols. In R we can define the vector and give it values using the `c` function, which stands for "combine," as follows:
```{r, message=FALSE}
Y <- c(7, 5, 9, 6)
print(Y)
# The "[1]" on the left simply indicates that 7 is the first element in the output.
```
These summation equations have parallels in R code. In R, we put all values of Y in a vector, and the subscripts are automatically assigned based on the position of each value of Y.
We can refer to each element in the vector Y by using its index number inside brackets and right next to the name of the vector. For example, `Y[2]` is the second element of Y, which is the number 5. Using different expressions instead of a single number, we can extract any parts of Y we need to operate with.
```{r, message=FALSE}
# Extract the third element of Y
Y[3]
# The "[1]" on the left simply indicates that 3 is the first element
# in the requested output.
# Extract elements 2 to 3 of Y
Y[2:3]
# The "[1]" on the left simply indicates that 5 is the first element
# in the requested output
# The ":" symbol indicates that you would like to retrieve
# the 2nd to the 3rd elements of Y. "2:4" would retrieve
# elements 2, 3, and 4.
# Extract elements 1, 3 and 4 of Y
Y[c(1, 3:4)]
# We used the c() function again to combine elements.
# In this case, we combined the index numbers 1, 3, and 4.
```
Consider the following to get an idea of what the symbols mean and how they work.
<br>
\begin{equation}
\sum\mathbf{Y} = \sum_{i=1}^{i=4} Y_i = Y_1 + Y_2 + Y_3 + Y_4 = 7 + 5 + 9 + 6 = 27
(\#eq:math02)
\end{equation}
<br>
Because most R operations are vectorized, if we want to use the whole vector, we do not need to specify subscripts, R takes `sum(Y)` to mean "sum all elements of Y." To sum parts of a vector we need to use the extraction methods described above.
```{r, message=FALSE}
sum(Y)
```
<br>
\begin{equation}
\displaystyle\sum_{i=2}^{i=4} Y_i = Y_2 + Y_3 + Y_4 = 5 + 9 + 6 = 20
(\#eq:math03)
\end{equation}
<br>
```{r, message=FALSE}
sum(Y[2:4])
```
The subscripts for summations can themselves be equations, which makes these formulas very versatile, although in this course we will not be using many complicated subscripts.
<br>
\begin{equation}
\displaystyle\sum_{i=1}^{i=3} Y_{i} = \displaystyle\sum_{i=2}^{i=4} Y_{i-1} = Y_1 + Y_2 + Y_3 = 7 + 5 + 9 = 21
(\#eq:math04)
\end{equation}
<br>
<br>
\begin{equation}
\begin{aligned}
\displaystyle\sum_{i=1}^{i=3} \left( Y_{i} + Y_{i+1} \right) &= \left( Y_1 + Y_2 \right) + \left( Y_2 + Y_3 \right) + \left( Y_3 + Y_4 \right) \\ &=
Y_1 + Y_2 + Y_2 + Y_3 + Y_3 + Y_4 \\ \\ &=
7 + 5 + 5 + 9 + 9 + 6 = 41
\end{aligned}
(\#eq:math05)
\end{equation}
<br>
As usual, there are many ways to complete a task in R. To perform the sum above we can use the following alternatives:
```{r, message=FALSE}
sum(Y[1:3] + Y[2:4])
sum(Y[1:3]) + sum(Y[2:4])
i <- 1:3
sum(Y[i] + Y[i + 1])
```
In order to explain further properties of the summation notation, let's use a second vector represented by the letter $\mathbf{X}$, where
<br>
\begin{equation}
\mathbf{X} = \begin{pmatrix} X_1 & X_2 & X_3 & X_4 \end{pmatrix} = \begin{pmatrix} 3 & 2 & 4 & 5 \end{pmatrix} = \left( X_i \right) \quad i = 1, \dots, 4
(\#eq:math06)
\end{equation}
<br>
```{r, message=FALSE}
X <- c(3, 2, 4, 5)
print(X)
```
The notation is very general and can be used to express sums like the following sum of products of corresponding elements in $\mathbf{X}$ and $\mathbf{Y}$:
<br>
\begin{equation}
\begin{aligned}
\displaystyle\sum_{i=1}^{i=4} (X_{i} \ Y_{i})
&= (X_1 \ Y_1) + (X_2 \ Y_2) + (X_3 \ Y_3) + (X_4 \ Y_4) \\ &=
3 \times 7 + 2 \times 5 + 4 \times 9 + 5 \times 6 = 97
\end{aligned}
(\#eq:math07)
\end{equation}
<br>
```{r, message=FALSE}
sum(X * Y) # R understands what you mean
# Optional learning:
X %*% Y
# the same operation in matrix algebra yields a 1x1 matrix. You can see the output
# has a row [1,] and column index [,1] for the single number, indicating it is a matrix,
# an R object class that is described below.
```
## Two-dimensional Data Tables and Summation {#math2Dtbl}
The vectors $\mathbf{X}$ and $\mathbf{Y}$ used above are 1-dimensional; they are single columns of numbers. Frequently, data are organized in more than one dimension, as in a 2-dimensional table with rows and columns. A conventional and very clear way to represent those tables is a matrix where each element is identified by its "address" in the table, given by the row and column numbers, always in that order [row, column].
> In matrices, subscripts are always row, column. \
> $Y_{3,7}$ is the value in row 3 and column 7
The subscript $i$ refers to rows and $j$ refers to columns. We illustrate this with an example in which soil moisture was measured under wheat, corn, and tomatoes in each of 7 fields (numbers are fictitious). The matrix is built by using rows for crop and columns for fields. Note that the subscripts are not numbers with 2 digits, but two single-digit integers written together; subscript 26 does **not** mean 2 times 10 plus 6 times 1, but simply the second row and sixth column. For situations when there are more than 9 rows and or columns, we use a separator between the row and column numbers to avoid ambiguity (for example, to be able to tell if $Y_{162}$ is row 1 column 62 or row 16 column 2 we use $Y_{1,62}$ or $Y_{16,2}$. The comma can be omitted when neither subscript is ever greater than 9. The following is an example of a 3 row x 7 column matrix filled with the soil moisture data:
<br>
\
\
\begin{equation}
\mathbf{Y} =
\begin{pmatrix}
Y_{1,1} & Y_{1,2} & Y_{1,3} & Y_{1,4} & Y_{1,5} & Y_{1,6} & Y_{1,7} \\
Y_{2,1} & Y_{2,2} & Y_{2,3} & Y_{2,4} & Y_{2,5} & Y_{2,6} & Y_{2,7} \\
Y_{3,1} & Y_{3,2} & Y_{3,3} & Y_{3,4} & Y_{3,5} & Y_{3,6} & Y_{3,7}
\end{pmatrix} \\[35pt] =
\begin{pmatrix}
34 & 32 & 27 & 37 & 25 & 23 & 30 \\
30 & 26 & 26 & 33 & 24 & 22 & 27 \\
29 & 27 & 24 & 34 & 23 & 20 & 26
\end{pmatrix}
(\#eq:math08)
\end{equation}
<br>
In R we can define a matrix or a data frame with rows and columns. We use a data frame because it is the most common form for data in R and in science in general. Instead of Y, in the code below we call the data "Ydata" to avoid conflicts with the names of the vectors used above. First we make vectors of data for each crop and then we join them into a data frame with the `rbind` function, which stands for "row bind."
In R we can specify that 2 dimensional data (rows and columns) is either a matrix or a dataframe. In a matrix, all values in the array must be the same type of data. In a dataframe, columns can differ in the type of value they contain. For example, column 1 can contain integer numbers, while column 2 can contain character values (e.g., "wheat", "corn", etc), and column 3 can contain factor levels. We often use dataframes even when the data are all the same type of values, because dataframes are more versatile, and are the most common form of data in R and science in general. Instead of Y, in the code below we call the data "Ydata" to avoid conflicts with the names of the vectors used above. First, we make vectors of data for each crop, and then we join them into a data frame with the `rbind` function, which stands for "row bind."
```{r, message=FALSE}
wheat <- c(34, 32, 27, 37, 25, 23, 30)
corn <- c(30, 26, 26, 33, 24, 22, 27)
tomato <- c(29, 27, 24, 34, 23, 20, 26)
Ydata <- as.data.frame(
rbind(wheat, corn, tomato)
)
names(Ydata) <- c("field1",
"field2",
"field3",
"field4",
"field5",
"field6",
"field7")
Ydata
Ydata[2, 5] # Soil moisture for field 5 with corn
Ydata["corn", "field5"] # You can also directly use the row and column names to extract values.
```
In general, for a table of values with k rows and r columns, we write:
<br>
\begin{equation}
\mathbf{Y}_{k,r} =
\begin{pmatrix}
Y_{1,1} & Y_{1,2} & \cdots & Y_{1,r} \\
Y_{2,1} & Y_{2,2} & \cdots & Y_{2,r} \\
\vdots & \vdots & \ddots & \vdots \\
Y_{i,1} & Y_{i,2} & \cdots & Y_{k,r}
\end{pmatrix}
(\#eq:math09)
\end{equation}
<br>
As an exercise to make sure you understand the subscripts, write down the matrix $\mathbf{Y}$ with all subscripts without looking at the book.
If we want to average all values in the table, we need to use a double summation to indicate that the sum is supposed to proceed over both rows and columns (Equation \@ref(eq:math10)).
<br>
\begin{equation}
\begin{aligned}
&\sum_{i=1}^{i=3} \displaystyle\sum_{j=1}^{j=7} \frac{Y_{ij}} {3 \times 7} =
\frac{1}{3 \times 7} \ \sum_{i=1}^{i=3}\sum_{j=1}^{j=7}Y_{ij} \\[20pt]
&= \frac{1}{3 \times 7} \left(
\sum_{j=1}^{j=7}Y_{1j} +
\sum_{j=1}^{j=7}Y_{2j} +
\sum_{j=1}^{j=7}Y_{3j} \right) \\[18pt]
&= \frac{1}{3 \times 7} \ (Y_{11} + Y_{12} + Y_{13} + Y_{14} + Y_{15} + Y_{16} + Y_{17} \\
& \qquad \qquad + Y_{21} + Y_{22} + Y_{23} + Y_{24} + Y_{25} + Y_{26} + Y_{27} \\[8pt]
& \qquad \qquad + Y_{31} + Y_{32} + Y_{33} + Y_{34} + Y_{35} + Y_{36} + Y_{37}) \\[20pt]
& = \frac{1}{3 \times 7} \ (34 + 32 + 27 + 37 + 25 + 23 + 30 \\
& \qquad \qquad + 30 + 26 + 26 + 33 + 24 + 22 + 27 \\
& \qquad \qquad + 29 + 27 + 24 + 34 + 23 + 20 + 26) = \frac{579}{21} = 27.57
\end{aligned}
(\#eq:math10)
\end{equation}
<br>
Using R, we can calculate sums and averages very easily. Note that the functions `nrow` and `ncol` extract the number of rows and columns of a data frame or matrix. The function `mean` yields the average for the whole table.
```{r, message=FALSE}
sum(Ydata) # Sums all values in table. Double summation.
# (Works only when all columns are numeric.)
nrow(Ydata) # Ydata has 3 rows: k = 3
ncol(Ydata) # and 7 columns: r = 7
sum(Ydata) / (nrow(Ydata) * ncol(Ydata)) # Overall average
mean(as.matrix(Ydata)) # Average of all values in table
# (Works only when all columns are numeric.)
```
If we just want the average for corn, we extract just the row for corn (row 2) and sum over all columns for row 2. If we want the results for a field or a set of fields, we specify the corresponding column numbers. Let's calculate the average moisture for corn and then the average for fields 3, 4, 5 and 7.
<br>
\begin{equation}
\begin{aligned}
&corn \ average = \frac{1}{7} \left( \sum_{j=1}^{i=7}Y_{2j} \right) \\[15pt]
&= \frac{1}{7} (Y_{21} + Y_{22} + Y_{23} + Y_{24} + Y_{25} + Y_{26} + Y_{27}) \\[15pt]
&= \frac{1}{7} (30 + 26 + 26 + 33 + 24 + 22 + 27) = 26.86
\end{aligned}
(\#eq:math11)
\end{equation}
<br>
<br>
$$\begin{equation}
\begin{aligned}
&fields \ 3, \ 4, \ 5 \ \& \ 7 \ average = \frac{1}{4 \times 3} \left( \sum_{i=1}^{i=3}Y_{i3} +
\sum_{i=1}^{i=3}Y_{i4} +
\sum_{i=1}^{i=3}Y_{i5} +
\sum_{i=1}^{i=3}Y_{i7} \right) \\[20pt]
&= \frac{1}{4 \times 3} (Y_{13} + Y_{23} + Y_{33} + Y_{14} + Y_{24} + Y_{34} + Y_{15} + Y_{25} + Y_{35} + Y_{17} + Y_{27} + Y_{37}) \\[20pt]
&= \frac{1}{4 \times 3} (27 + 26 + 24 + 37 + 33 + 34 + 25 + 24 + 23 + 30 + 27 + 26) \\[20pt]
&= \frac{336}{12} = 28.0
\end{aligned}
(\#eq:math12)
\end{equation}$$
<br>
We obtain those averages in R with the following code. Note that when we leave the place for row (or column) number empty, it is interpreted as all rows (or columns).
```{r, message=FALSE}
mean(as.numeric(Ydata[2, ])) # Average for corn across fields
# Note that the row has to be transformed into a numeric vector using as.numeric()
# before it can be used with the function "mean."
# Alternatively, you can calculate the corn row mean using the following functions:
sum(Ydata[2, ]) # This sums all values for row 2, which is corn
length(Ydata[2, ]) # This calculates the value for r for corn
sum(Ydata[2, ]) / length(Ydata[2, ]) # Average for corn
```
## Models {#mathModl}
In statistics we use models all the time, and it is necessary to make those models explicit. Explicit models make it easier to understand and critically evaluate the use of statistics. Specifically, we can determine what parts of the model may be inadequate to solve the problem at hand, and we can modify the model to improve it. This section of the book presents a brief overview of the role of modeling in statistics. I strongly recommend <a href="http://r4ds.had.co.nz/model-basics.html" target="_blank">Chapter 24</a> of *R for Data Science* [@r4ds] for a more extensive treatment of the topic.
### The simplest model{#mathSmplMod}
Say that we are interested in studying the milk production per dairy cow in $kg \ day^{-1}$ for all dairy cows in the US today. For example, we want to estimate what proportion of cows produce less than $24 \ kg \ day^{-1}$ to gauge the number of cows that could be targets of a program to increase productivity. One way to do this is to create a *model* for the statistical distribution of milk production, take a sample to estimate the parameters of the model and then use the model with estimated parameters to make the estimation of the proportion of cows for which production is less than $24 \ kg \ day^{-1}$ or any other quantity.
A first attempt to estimate the needed proportion is to *model* milk production per cow as a normal distributed random variable with unknown mean and variance. Using the letter $Y$ to represent milk production of a cow, we write $Y \sim N(\mu, \sigma^2)$, which is read "Y is a random variable with normal distribution, with mean $\mu$ and variance $\sigma^2$." Sometimes we write $\sigma$, the standard deviation, instead of $\sigma^2$. Obviously, the standard deviation is the square root of the variance.
The model is not correct, and it could not be correct, for a number of reasons. First, normal distributions can take any values between $-\infty$ and $+\infty$, whereas milk production is zero or positive, but it cannot be negative. Moreover, if we are considering the population to be all the dairy cows in the US, although there are many dairy cows in the US (USDA estimated 9.3 million cows and heifers in 2014), the number is finite, whereas the normal distribution is for infinite populations. Yet, these flaws of our model are **not** important!
```{block, type = 'stattip'}
- Models are not supposed to be perfect representations of reality,
they are supposed to be **useful** representations of reality.
```
Assuming that the variance of milk production is small relative to the mean, the fact that production has to be positive is not a problem, because a tiny and irrelevant piece of the normal distribution used will fall below zero. The fact the the number of cows is finite is not a problem because we can either think of the total population as a very, very large sample of a truly infinite theoretical cow population or simply use the normal as an approximation to the truly discrete and finite real population.
The simple model that we are using can also be written as:
<br>
\begin{equation}
\begin{aligned}
Y_i = \mu + \epsilon_i \\[10pt]
\epsilon_i \sim N(0, \sigma^2)
\end{aligned}
(\#eq:math13)
\end{equation}
<br>
where $Y_i$ is the milk produced by cow number $i$, $\mu$ is the population mean and $\epsilon$ is a random variable with a normal distribution with mean 0 and variance $\sigma^2$. This is exactly the same model as before, but now we have isolated the deviations of each cow from the mean for the whole population using the random variable $\epsilon$, which we call "errors," not because there is anything wrong, but because we choose not to be interested in why cows differ in milk production per day. In this model, milk production for cow i is partitioned into an overall mean for the population of cows and a random "error."
The equation $\epsilon_i \sim N(0, \sigma^2)$ is actually a simplified version of $\epsilon_i \sim iid \quad N(0, \sigma^2) \quad \forall \ i$, which means that the errors for all^[$\forall$ means "for all" in math] observations have independent and identical normal distributions with mean 0 and variance $\sigma^2$. Why is it necessary to state that the errors come from "identical"" distributions?! Isn't $\epsilon$ just one random variable? Actually, in reality, it is possible for the error for each observation to come from a different distribution, so we need to specify how we are modeling the errors. However, a model with such complexity would probably not be useful. Errors can also come from distributions that are not independent, so we need to explicitly state that the errors among observations are independent. At this point, you should just keep in mind that in real-world applications of statistics, one frequently finds errors that are not normal, have unequal variances, and/or are correlated. More advanced models exist that include parts and parameters to deal with lack of normality, heterogeneity of variances and lack of independence, but they are beyond the scope of this course.
In order to use the simple model to estimate the proportion of cows that produce less than $24 \ kg \ day^{-1}$, we can get a random sample of r cows and, based on that sample, estimate the mean as
<br>
\begin{equation}
\hat{\mu} = \frac{1}{r} \sum_{i=1}^{i=r} Y_i
(\#eq:math14)
\end{equation}
<br>
and the variance as
<br>
\begin{equation}
\hat{\sigma^2} =\frac{1}{r-1} \sum_{i=1}^{i=r} (Y_i-\hat{\mu})^2
(\#eq:math15)
\end{equation}
<br>
where $e = \hat{\epsilon} = (Y_i-\hat{\mu})$ are the deviations of each cow from the estimated overall mean. These equations are presented with more detail in a later [chapter](\@ref(ch.anova)). Finally, the proportion of cows whose daily milk production is less than $24 \ kg \ day^{-1}$ is estimated as the area under the normal distribution curve between $-\infty$ and $24 \ kg \ day^{-1}$. Using the capital letter $P$ to indicate "probability" we can write:
<br>
\begin{equation}
\begin{aligned}
&\hat{P}(Y \le 24 \ kg \ day^{-1}) = area \ under \ curve = \int_{-\infty}^{24} f(Y) \ \mathrm{d}Y \\[20pt]
&where \ f(Y) = \frac{1}{{\hat{\sigma} \sqrt{2\pi}}} \ \mathrm{exp} \left( { - \left( {x - \hat{\mu} } \right) ^ 2 / {2\hat{\sigma}^2}} \right)
\end{aligned}
(\#eq:math16)
\end{equation}
<br>
Suppose that the estimated mean and variance were $28.3 \ kg \ day^{-1}$ and $25 \ kg^2 \ day^{-2}$. USDA reported an average production of $28.3 \ kg \ day^{-1}$ of milk per milking cow in 2016. <a href="https://www.nass.usda.gov/Publications/Ag_Statistics/2017/Chapter08.pdf" target="_blank">See report here</a>. The variance was not reported, so we use a fictitious number: $\hat\sigma^2 = 25$. The area under the curve (Figure \@ref(fig:propLtMean)) is calculated in R as follows (note how the code is written in several lines and indented for better readability):
```{r, message=FALSE}
pnorm( # function that calculates areas under the normal
q = 24, # quantile desired, max. milk production in set
mean = 28.3, # mean of the normal distribution
sd = 5 # standard distribution of the normal distribution
)
```
<br>
```{r propLtMean, message=FALSE, warning=FALSE, paged.print=FALSE, echo=FALSE, fig.cap="Simplest model for daily milk production per cow. The blue line is the probability density function (pdf) for a normal distribution with mean 28.3 and variance 25. According to this model, the area under the curve between any two values of Y (horizontal axis) represents the proportion of cows that have production between those values. The light blue area is the estimated proportion of cows that produce less than 24 kg per day. This is an estimated proportion because the actual mean and variance of the cow population are not known and were estimated based on a sample."}
library(plotrix)
x = seq(5, 50, length = 400)
y = dnorm(x, mean = 28.3, sd = 5)
plot(x, y, type = "l",
lwd = 2, col = "blue",
xlab = "Y, milk production per cow in kg per day",
ylab = "f(Y), Normal probability density function")
x = seq(0, 24, length = 200)
y = dnorm(x, mean = 28.3, sd = 5)
polygon(c(0, x, 24), c(0, y, 0), col = "lightblue")
cow.prop <- as.character(
round(
pnorm(q = 24, mean = 28.3, sd = 5),
2)
)
text(x = 20.5,
y = 0.01,
labels = cow.prop)
temp <- textbox(x = c(4, 15),
y = 0.05,
cex = 0.9,
textlist =
c("Estimated proportion of cows ",
"that produce less than 24 kg per day"),
box = FALSE)
arrows(x0 = 13.5, y0 = 0.030, x1 = 20, y1 = 0.02,
length = 0.15, lwd = 2)
```
<br>
We used a very simple model for milk production with just two parameters, mean and variance. Mean and variance are called the "fixed" or non-random parts, and the error is the "random" component. When we use statistics and create models, we need to choose what will be represented with fixed parameters and what parts will be represented with random variables or components. Except for the case where we consider sub-sampling, this course will only deal with models that have a single random component and we use the Greek letter $\epsilon$ (epsilon) for it.
<br>
$$ data = Systematic \ Structure + Random \ Variation \\[10pt]
= Fixed \ Component + Random \ Component \\[10pt]
= Fixed \ Component + Structured \ Random + White \ Noise$$
<br>
Models that partition the random component into structured and unstructured (white noise) random parts are beyond the scope of this book.
<br>
```{r mathTab01, echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}
#------dairy data-----
milk <- read.table(header = TRUE, text = "
State Kcows lb.cow.yr
Alabama 7.0 13143
Alaska 0.3 11667
Arizona 196.0 24429
Arkansas 6.0 13167
California 1762.0 22968
Colorado 151.0 25980
Connecticut 19.0 21474
Delaware 5.0 19100
Florida 123.0 20350
Georgia 84.0 21786
Hawaii 2.4 14542
Idaho 595.0 24647
Illinois 94.0 20245
Indiana 184.0 22560
Iowa 213.0 23634
Kansas 146.0 22801
Kentucky 58.0 18069
Louisiana 12.0 14083
Maine 30.0 21000
Maryland 48.0 19917
Massachusetts 12.0 17917
Michigan 419.0 25957
Minnesota 461.0 20967
Mississippi 10.0 14400
Missouri 88.0 15602
Montana 14.0 21071
Nebraska 60.0 23317
Nevada 30.0 22000
NewHampshire 14.0 20286
NewJersey 7.0 17429
NewMexico 315.0 24479
NewYork 620.0 23815
NorthCarolina 46.0 20978
NorthDakota 16.0 21563
Ohio 265.0 20875
Oklahoma 37.0 18703
Oregon 125.0 20744
Pennsylvania 529.0 20454
RhodeIsland 0.8 17500
SouthCarolina 15.0 16667
SouthDakota 115.0 22139
Tennessee 42.0 16571
Texas 475.0 22680
Utah 92.0 22772
Vermont 130.0 20954
Virginia 90.0 19144
Washington 276.0 24094
WestVirginia 9.0 14889
Wisconsin 1279.0 23552
Wyoming 6.0 23300
"
)
milk$milk.tot.Mlb <- c(92.0, 3.5, 4788.0, 79.0, 40469.0, 3923.0, 408.0, 95.5,
2503.0, 1830.0, 34.9, 14665.0, 1903.0, 4151.0, 5034.0,
3329.0, 1048.0, 169.0, 630.0, 956.0, 215.0, 10876.0,
9666.0, 144.0, 1373.0, 295.0, 1399.0, 660.0, 284.0,
122.0, 7711.0, 14765.0, 965.0, 345.0, 5532.0, 692.0,
2593.0, 10820.0, 14.0, 250.0, 2546.0, 696.0, 10773.0,
2095.0, 2724.0, 1723.0, 6650.0, 134.0, 30123.0, 139.8)
library(measurements)
milk$tot.Mkg <- conv_unit(milk$milk.tot.Mlb, "lbs", "kg")
milk$kg.cow.day <- conv_unit(milk$lb.cow.yr, "lbs", "kg") / 365
# milk$kg.cow.d.2 <- with(milk, 1000 * tot.Mkg / Kcows) / 365
milk.a <- milk[1:25, c("State", "kg.cow.day")]
milk.b <- milk[26:50, c("State", "kg.cow.day")]
us.avg <- with(milk, sum(kg.cow.day * Kcows) / sum(Kcows) )
knitr::kable(
list(
milk.a,
milk.b
),
booktabs = TRUE,
row.names = FALSE,
digits = 1,
caption = 'Milk production per lactating dairy cow in the US, by state in 2016. Based on USDA reports.'
)
milk <- milk[, c("State", "Kcows", "tot.Mkg", "kg.cow.day")]
```
<br>
### A slightly more complex model {#mathBttrMod}
The model used above and stated in equation \@ref(eq:math13) can be improved in many ways. For example, we may guess that a lot the variance in productivity can be between states, due to different levels of investment and development of the industry in different states. By adding a part to account for differences among states in the fixed part of the model we can have much better predictions of the proportion of cows below $15 \ kg \ day^{-1}$ for each state. The model would look like this:
<br>
\begin{equation}
\begin{aligned}
Y_{ij} &= \mu_{i.} + \epsilon_{ij} \\[10pt]
&= \mu_{..} + \tau_{i.} + \epsilon_{ij} \\[10pt]
\epsilon &\sim N(0, \sigma^2)
\end{aligned}
(\#eq:math17)
\end{equation}
<br>
where $\mu_{i.}$ is the mean for state $i$ and $\tau_{i.}$ is called the "effect" of state $i$. In general, when we have observations that are grouped like cows in states, we can view the model as allowing a separate mean for each group or as each group having an effect on the mean. Both views are equivalent and their relationship is given by the fact that the effect is defined as the difference between each group mean and the overall mean: $\tau_{i.} = \mu_{i.} - \mu_{..}$. The "dots" in the subscript are used as place-holders for the subscript and to indicate that the average (or sum) has been done over all values of that subscript. Because the number of cows differs among states, the overall mean in this model has to be estimated as the average of the state averages, where each state receives equal weight: $\hat{\mu}_{..} = \sum_{i=1}^{50} \hat{\mu}_{i.} / 50$. This estimated overall mean is `r mean(milk$kg.cow.d)` $kg \ day^{-1}$.
### Grouping or Indicator Variables {#mathGroups}
The model above can be written in a different form that is useful for understanding of more advanced linear models. This form will be used in the formulation of contrasts, custom comparisons, and linear combinations of parameters, which are presented in the chapter on [treatment structure and comparisons](#chTrtstr). It is presented here to prepare students for that topic and for more advanced studies that require writing down the models in matrix notation. Keep in mind that the use of matrix notation does not have to involve any more complicated concepts than the understanding that we are working with vectors and tables of numbers, instead of single numbers at a time.
Grouping or indicator variables take a value of 0 or 1. We multiply these binary variables by parameters to determine if the parameters are included in the model for a given observation. The trick is that the equation includes all the parameters, but grouping variables are used to apply each parameter to the right rows of data. There are as many grouping variables as there are groups. For example, if an experiment has 3 treatments, there will be 3 grouping variables. Essentially, each variable "marks" each observation that belongs to a group.
Perhaps an analogy will help clarify the concept before we show the equations. Think of a model as a recipe where the parameters are the ingredients and the grouping variables indicate if the ingredient is included in the recipe. Consider a simple recipe that can be made with chicken, tofu or beef, and with mild or spicy chili peppers. The recipe will have a list of amounts and ingredients ^[For this simple example we merged the units with the ingredients. Explicit use of units is always necessary. Measure twice, cut once! One of many costly examples: NASA lost a Mars Explorer due to confusion between metric and imperial units.]:
<br>
```{r recipe, message=FALSE, warning=FALSE, paged.print=FALSE, echo=FALSE}
library(kableExtra)
amounts <- c(2, 1, 1, 1, 1, 1, 1, 2)
ingredient <- c("egg",
"pinch of salt",
"dash of black pepper",
"cup of diced chicken, tofu, or beef",
"cup of chopped onion",
"tbs of curry paste",
"cup of jalapeños or poblano peppers",
"cup of red cabbage")
knitr::kable(cbind(amounts, ingredient),
align = c("c", "l"),
caption = "Flexible recipe") %>%
kable_styling(full_width = FALSE)
```
<br>
This looks like a single recipe, but it can yield 6 different dishes depending on whether you use beef, tofu or chicken, and jalapeños or poblano peppers. It is a lot simpler and shorter to write the recipe as above than to write 6 different recipes that are almost the same, at least in structure. The same happens with models. In fact, theoretically, any recipe can be expressed with this format if all ingredients are listed separately. With one row per ingredient we can decide how much (possibly none) of each ingredient to add. Let's look at the ingredients in the example. The idea can be extended to any number of ingredients, and it is based on having a list of ingredients, so we know what the quantities refer to.
<br>
```{r recipe2, message=FALSE, warning=FALSE, paged.print=FALSE, echo=FALSE}
hot_chicken <- c(2, 1, 1, 1, 0, 0, 1, 1, 1, 0, 2)
mild_chicken <- c(2, 1, 1, 1, 0, 0, 1, 1, 0, 1, 2)
hot_tofu <- c(2, 1, 1, 0, 1, 0, 1, 1, 1, 0, 2)
mild_tofu <- c(2, 1, 1, 0, 1, 0, 1, 1, 0, 1, 2)
hot_beef <- c(2, 1, 1, 0, 0, 1, 1, 1, 1, 0, 2)
mild_beef <- c(2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 2)
ingredient <- c("eggs",
"pinch of salt",
"dash of black pepper",
"cup of diced chicken",
"cup of diced tofu",
"cup of diced beef",
"cup of chopped onion",
"tbs of curry paste",
"cup of jalapeños",
"cup of poblano peppers",
"cup of red cabbage")
knitr::kable(cbind(ingredient,
hot_chicken,
mild_chicken,
hot_tofu,
mild_tofu,
hot_beef,
mild_beef),
align = c("c", "c", "c", "c", "c", "c", "l"),
caption = "Numerical recipes") %>%
kable_styling(full_width = FALSE)
```
<br>
Now, the table represents each recipe with a set of numbers, and all recipes use the same list of potential ingredients. Note that when we want the chicken, the row for chicken shows a 1 and the rows for tofu and beef **must** show 0's. The same happens with jalapeños and poblano peppers. The amounts for those ingredients are the grouping or indicator variables that indicate which recipe type or group we want.
We can easily show that the statistical models are like recipes ^[In reality, linear statistical models are *almost* like recipes because in statistical models we can have negative values for the amounts of a parameter put into the combination, but in recipes we cannot have negative amounts of ingredients.] by changing the symbols we use. We can simplify things by noting that all recipes we are considering take eggs, salt, pepper, onions, curry paste, and cabbage in the same amounts across recipes. We will refer to all of those ingredients together in the specified quantities as "common" ingredients, and to conform with the usual mathematical notation, we transpose the recipe table and put the dishes in rows and ingredients in columns with shortened names.
<br>
```{r recipe3, message=FALSE, warning=FALSE, paged.print=FALSE, echo=FALSE}
hot_chicken <- c(1, 1, 0, 0, 1, 0)
mild_chicken <- c(1, 1, 0, 0, 0, 1)
hot_tofu <- c(1, 0, 1, 0, 1, 0)
mild_tofu <- c(1, 0, 1, 0, 0, 1)
hot_beef <- c(1, 0, 0, 1, 1, 0)
mild_beef <- c(1, 0, 0, 1, 0, 1)
ingredient <- c("common",
"chicken",
"tofu",
"beef",
"jalap",
"pobla")
curries <- data.frame(rbind(hot_chicken,
mild_chicken,
hot_tofu,
mild_tofu,
hot_beef,
mild_beef))
names(curries) <- ingredient
knitr::kable(curries,
align = c("c", "c", "c", "c", "c", "c", "l"),
caption = "Mathematical recipes") %>%
kable_styling(full_width = FALSE)
```
<br>
For example, the first row of the table indicates that to make the hot chicken version you add all common ingredients plus the chicken plus none of the tofu plus none of the beef, plus the jalapeño plus none of the poblanos. Note that we have three columns for protein type and two for pepper type; there is one column per option or group in each category. This is because any protein option can be combined with any pepper option, but only one protein and only one pepper type is allowed in any one dish. Additionally, whatever dish you make, it must have a protein source and a pepper type. As a consequence of this, if you add the columns for protein or the columns for chili peppers, you always get 1, regardless of the row.
The column for chicken is the grouping variable that groups all recipes with chicken protein. Its value is defined as follows:
<br>
\begin{equation}
chicken \ group = G_1 =
\begin{cases}
1 & \quad \text{if dish has chicken, i = 1}\\
0 & \quad \text{otherwise}
\end{cases}
(\#eq:math18)
\end{equation}
<br>
We name that column $G_1$ because it is the first grouping column, i.e., the subscript "1" is for chicken (think of chicken as treatment 1). Analogously, the columns for tofu and beef are defined as
<br>
\begin{equation}
tofu \ group = G_2 =
\begin{cases}
1 & \quad \text{if dish has tofu, i = 2}\\
0 & \quad \text{otherwise}
\end{cases}
(\#eq:math19)
\end{equation}
<br>
\begin{equation}
beef \ group = G_3 =
\begin{cases}
1 & \quad \text{if dish has beef, i = 3}\\
0 & \quad \text{otherwise}
\end{cases}
(\#eq:math20)
\end{equation}
<br>
The same sort of concepts are involved with the choice of pepper. For simplicity, now we consider only the recipes that differ in protein source and assume that all recipes are with poblanos, which are now included in the common ingredients. We can write the recipes as follows:
<br>
$$dish = 1 \times common + G_1 \times chicken + G_2 \times tofu + G_3 \times beef$$
<br>
We can think of this equation/recipe as follows: if we want the chicken version, then $G1 = 1$, $G_2 = 0$ and $G_3 = 0$. The result indicates to include the common ingredients, plus the chicken plus none of the tofu plus none of the beef, which is simply: common ingredients plus chicken. The common ingredients are like an overall mean, and chicken, tofu and beef are the effects of each of the protein choices; think of them as treatments. In which case the equation becomes:
<br>
$$\mu + G_1 \ \tau_1 + G_2 \ \tau_2 + G_3 \ \tau_3 = \mu + \sum_{i=1}^3G_i \ \tau_i$$
<br>
Therefore, the model is like a recipe where the amounts before each ingredient are 1, $G_1$, $G_2$ and $G_3$, and the ingredients are $\mu$, $\tau_1$, $\tau_2$ and $\tau_3$. **The $G$'s are grouping or indicator variables.**
## Deviations, Errors and Residuals {#mathDevs}
The difference between an observation and the mean is called a *deviation*, *residual* or *error*. The specific name depends on the context, but in general, we will use *deviation* for the total difference between observation and overall average, *residual* for the difference between an observation and the value estimated by a model, and *error* for the random component of a model.
<br>
\begin{equation}
\begin{aligned}
&total \ deviation = Y_{ij} - \mu{..} &\textrm{difference between observation and mean} \\[15pt]
&residual = e = \hat{\epsilon} = Y_{ij} - \hat{Y_{ij}} &\textrm{estimated error based on sample}\\[15pt]
&error = \epsilon &\textrm{a random variable}
\end{aligned}
(\#eq:math18)
\end{equation}
<br>
The symbol $\hat{Y}_{ij}$ is read "Y hat sub i j" and it represents the value that would be expected for that observation if the experiment were repeated many, many times. Intuitively, it is the best guess that you have for the value that would be observed if a new measurement were done for $Y_{ij}$, based on the data and model. Keep in mind that the subscripts depend on the organization of the data. For the simple model, we only need one subscript because there are no groups.
Unfortunately, the terms *error*, *deviation* and *residual* are frequently used interchangeably in practice, but the good news is that the meaning can be easily derived from the context. **Errors** $\epsilon$ are usually not observable and are unknown, unless we create them in a computer simulation (see [Simulation](#mathSimu) section). The main reason for this is that all measurements include some sort of random component that we cannot control, and we usually do not know all the values in the population of interest; we only have a sample.
Usually, population parameters are unknown and have to be estimated using samples, which is a main focus in statistics. With a model and a sample we can estimate parameter values, such as $\hat{\mu}$, and then estimate the errors by the difference between the model and our observations. Estimated errors are called **residuals** in part because they are the parts of the observations that remain after all other parts accounted for by the model are subtracted. The total **deviation** is the difference between the observed $Y_{ij}$ and the overall mean $\mu_{..}$, typically estimated using the overall sample average $\bar{Y}_{..}$.
In the two models above (equations \@ref(eq:math13) and \@ref(eq:math17)), the equations to calculate the residuals are different, because observations are modeled differently. For the simple model, the best guess or estimate that we have for the production of a randomly selected cow (observation) is the overall average given in equation \@ref(eq:math19). Because we are considering all cows **not** grouped by state, the overall average is equal to the weighted average of state averages, or more simply, the average of all cows, regardless of state:
<br>
\begin{equation}
\hat{Y}_i = \hat{\mu} = \bar{Y}_. \\[10pt]
e = \hat{\epsilon} = Y_i - \hat{Y}_i = Y_i - \hat{\mu} = Y_i - \bar{Y}_. \\[10pt]
= Observation - Fixed \ Component
(\#eq:math19)
\end{equation}
<br>
In the case of the simplest model, the error is equal to the total deviation, because the model does not have any other fixed component beyond the overall mean.
For the better model in equation \@ref(eq:math17), the fixed part of the model includes both an overall mean and a state effect $\tau_{i.}$ where $i$ is the subscript that specifies what state is being considered. The estimation for any given observation includes both components. Thus, the residual is the observed value $Y_{ij}$ minus the estimated overall mean $\hat{\mu}_{..}$ minus the estimated state effect $\hat{\tau}_{i.}$.
<br>
\begin{equation}
\hat{Y}_{ij} = \hat{\mu}_{i.} = \bar{Y}_{i.} = \hat{\mu} + \hat{\tau}_{i.} = \bar{Y}_{..} + \hat{\tau}_{i.}\\[10pt]
e = \hat{\epsilon} = Y_{ij} - \hat{Y}_{ij} = Y_{ij} - \bar{Y}_{i.} = Y_{ij} - \bar{Y}_{..} - \hat{\tau}_{i.}. \\[10pt]
(\#eq:math20)
\end{equation}
<br>
Keep in mind that the effect of state will be a negative number for those states that have mean milk production that is lower than the average of all states. We can use the values reported by USDA (2017) to determine the state effects for all states.
The data frame `milk` has the data, including the state name, average number of milking cows in 2016 (in thousands), total milk production per year (in millions of kg) and average production per cow (in kg per day). First, we calculate the overall average to estimate $\mu$. Then, the effect of each state is obtained by subtracting the overall mean from the average for each state. (Residuals cannot be calculated because the data reported is at the state level, not for individual cows.)
<br>
```{r, message=FALSE}
str(milk) # str() shows the structure of the dataframe
(mu.hat <- mean(milk$kg.cow.day)) # mean of milk production
milk$state.effect <- milk$kg.cow.d - mu.hat # subtract the overall mean (mu.hat) from
# the average for each state (milk$kg.cow.d)
```
<br>
```{r, echo = F, message=F}
knitr::kable(
list(
milk[1:25, c("State", "state.effect")],
milk[26:50, c("State", "state.effect")]
),
booktabs = TRUE,
row.names = FALSE,
digits = 1,
caption = 'Effect of state on average milk production per cow per day.'
)
```
<br>
### Simulation: Population is Known {#mathSimu}
A simulation consists of making up a model with known parameters and generating data by taking random samples from the theoretical distribution specified in the model. Then, the data are analyzed as if the model were not known and results compared to the known model values. This is an empirical method to check if the method of analysis leads to good estimates. By repeating the random sampling many times, we can get simulated distributions of the estimated parameters.
Here, we will use a simulation to show how sample results and estimated quantities are different from the "real values." We cannot do this with real data because we typically do not know the true parameter values.
The simulation will create data analogous to the US milk production data. For simplicity, we will use a sample of 100 cows per state. Thus, the weighted and unweighted averages will be the same. We will use the same values as above for state means. Note that here we use the averages reported by USDA as if they were the known means for each state. Because the variance within each state is not known, we use $\sigma = 4$ for all states. The model used to generate the data is:
<br>
\begin{equation}
Y_{ij} = \mu_{i.} + \epsilon_{ij} = \mu_{..} + \tau_{i.} + \epsilon_{ij} \\
\epsilon_{ij} \sim N(0, \ \sigma = 4)
(\#eq:math21)
\end{equation}
<br>
For each state, we simulate 100 observations by generating 100 random draws from a normal distribution with mean 0 and standard deviation 4, and to each draw, we add the state average to simulate the effect of being in that state. Figure \@ref(fig:resdlVsError) shows that residuals are not equal to the true errors, but they are related.
<br>
```{r resdlVsError, fig.cap="Plot of estimated error or residuals vs. the true errors based on a simulation. The true errors come from a normal distribution, whereas the residuals were estimated as the difference between each observation and the corresponding group average. The green line is Y = X.", message=FALSE}
set.seed(39)
milk.sim.dat <- data.frame(State = rep(milk$State, 100),
state.mean = rep(milk$kg.cow.day, 100),
epsilon = rnorm(5000, mean = 0, sd = 4))
# Construct a data frame with a column for
# state names, the mean for each state, and
# a random draw from the normal distribution
# using rnorm() to simulate residuals
milk.sim.dat$Y <- with(milk.sim.dat, state.mean + epsilon)
# Add the state mean and the residual to simulate
# random observations
state.avg <- aggregate(Y ~ State,
data = milk.sim.dat,
FUN = mean)
# Obtain the simulated sample averages for each state
names(state.avg)[2] <- "state.average"
milk.sim.dat <- merge(milk.sim.dat, state.avg, by = "State", all = TRUE)
milk.sim.dat$residual = with(milk.sim.dat, Y - state.average)
# Substract the simulated state averages from each
# simulated observation i
str(milk.sim.dat)
plot(residual ~ epsilon, milk.sim.dat, pch = ".")
abline(0,1, col = "green")
```
<br>
## Optimization {#mathOptim}
One way to "guess" or estimate the mean of a distribution using a sample is to calculate the *average*. The average minimizes the sum of the squared deviations of the observations. Minimization of sums of squares - called "least-squares" - is the most common method we will use to make estimates, but it is not the only method. Another common method used is likelihood maximization, whereby parameters are estimated by choosing the values that maximize the likelihood of observing the data. The good news is that for most of the situations we will study, which involve the use of **linear models**, least-squares and maximum likelihood yield identical results.
In many cases, the optimal value for a parameter can be found analytically, which means that we can use an equation that yields the best estimate directly. This formula is obtained by taking the first derivative of the equation that calculates the SS as a function of the estimate, set it to zero and solve for the estimate. For some advanced statistical methods, there are no equations available, so numerical computations are used to approximate the optimal estimates.
## Linear Models
Linear models are models that can be represented by a linear combination of parameters. THey are like recipes where the ingredients are the parameters. For example, to make a cup of sweet tea with milk, you combine 1 teaspoon of tea, 1 cup of hot water, 1 tablespoon of honey and 0.25 cups of milk.
<br>
\begin{equation}
1 \ cup \ tea = 1 \ tsp \ tea + 1 \ cup \ hot \ water + 1 \ tbsp \ honey + 0.25 \ cup \ milk \\
Y = X_1 \beta_1 + X_2 \beta_2 + X_3 \beta_3 + X_4 \beta_4
\end{equation}
<br>
What makes the equation *linear in the parameters* is that the parameters represented by the $\beta$'s are not in exponents, powers, quotients or operations other than multiplication and addition. Even if the variables that multiply the parameters are involved in any kind of function, as long as that function does not involve any parameters, the equation is linear *_*in the parameters*_*. For example, the following equations are linear in the parameters represented by the $beta$'s.
<br>
\begin{equation}
Y = X_1^2 \beta_1 + log(X_2) \beta_2 + e^{X_3} \beta_3 + X_4 \beta_4 \\
Y = X_1^0 \beta_0 + X_1^1 \beta_1 + X_1^2 \beta_2 + X_1^3 \beta_2
\end{equation}
<br>
The second equation is called a **polynomial** because it only involves powers of the single variable $X_1$. Because the highest power of X is 3, the polynomial is said to be of *3rd degree* or a *cubic* polynomial. Polynomials of 2nd degree are called *quadratic*. All polynomials are linear combinations of parameters. Typical polynomials of order or degree 0 to 3 are plotted in Figure \@ref(fig:mathPolys).
<br>
```{r mathPolys, message=FALSE, warning=FALSE, paged.print=FALSE, echo=FALSE, fig.cap="Types of shapes that can be produced by polynomials of first (black), second (blue) and third (orange) degree. Note that all have ranges where they are almost straight lines. The orientation, location and range of each shape can be changed by changing the values of the parameters of the polynomial."}
plot((-20:20)/2, poly((-20:20)/2),
ylab = "Y", xlab = "X",
type = "l", col = "black",
lwd = 1.5)
lines((-20:20)/2,
poly((-20:20)/2, degree = 2)[, 2],
type = "l", col = "blue",
lwd = 1.5)
lines((-20:20)/2,
poly((-20:20)/2, degree = 3)[, 3],
type = "l", col = "orange",
lwd = 1.5)
```
<br>
Linear combinations are used throughout this book and are important to understand where the basic equations that we use come from. For example, we refer to linear combinations to derive the statistical properties of the sample average, and again to determine if two averages are significantly different [(see Chapter 8)](ch.2pops).
<br>
```{block, type = "stattip"}
The shape of polynomials can be modified by changing their parameter values. The basic shapes shown in the figure can be stretched, tilted and flipped. For example, if the parameter for $X^2$ is negative in a second-degree polynomial, the "smile" blue shape will turn into a "frown."
```
<br>
<br>
## Symbols and Terms{#mathSymbls}
<br>
Term or symbol | Explanation
<br>
------------| ----------------------------------------------------------
<br>
Observation | Result of measuring one or several variables in one unit of experimental material. For example, the milk production of one cow in a day.
<br>
Expectation | The mean of a random variable. For example the expectation of number of dots when rolling a cubic die is 3.5.
<br>
Variance | The expected value of the squared total deviation of a random variable. For a die the variance is 2.9166667
<br>
Mean | Expectation
Average | Sum of sample values divided by sample size. For example, the average of a sample of 4 rolls of a die {5, 5, 5, 4} is (5 + 5 + 5 + 4)/4 = 4.75. The average is usually used to estimate the mean.
<br>
Sample Variance | Estimated variance based on a sample. Average of the square deviations from observation to average, but divided by sample size minus the number of other parameters estimated. In the simplest case, the only estimated parameter is the mean so the divisor is n-1.
<br>
Model | Sets of equations that describe the components of each observation.
<br>
White noise | Random variation with unstructured normal distribution. Independent errors with normal distribution and equal variance.
<br>