-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathR_for_MBM.Rmd
2408 lines (1359 loc) · 84.1 KB
/
R_for_MBM.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Practical R for MBM"
author: "Chuck Frost"
date: "7/10/2020"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Practical R for MBM field data manipulation and analysis
The purpose of this ongoing short course is to familiarize MBM personnel with the concepts of tidy data, basic data manipulation, and basic programming techniques as implemented in R. It is expected that participants have a working understanding of R and RStudio. We will not go into depth on other types of programming, specifically statistical programming and computer programming. The topics covered here will foster appropriate treatment of MBM data throughout the data life cycle and lead to cleaner and more useful data in the future.
## Topics
1. R, RStudio, packages, help, intro to functions
2. Loading and saving, data types and structures
3. Tidy Data
4. Visualization
5. Working with dates
6. Creating functions, conditional statements, looping
## 1.0.1 What is R?
An open-source programming environment that serves as:
1. A giant calculator
2. An interface for data analysis and visualization
3. An interface for data manipulation and file handling
4. An interface for a simple and efficient programming language (S, that became R)
R isn't the "method." For example, "We used R to estimate population size." yuck
## 1.0.2 What is RStudio?
An integrated development environment (IDE) for R.
Provides tools, menus, help, and easy access to the best parts of R.
## 1.1.1 Packages
Base R automatically provides you with many common functions.
```{r eval=TRUE}
mean(c(1,2,3))
sum(c(1,2,3))
var(c(1,2,3))
```
Other useful functions exist thanks to R users. Collections of related functions get wrapped into packages.
If you're looking for something NOT in base R, you need to install the related package.
The CRAN provides R community tested and approved packages. Check the packages tab in RStudio for some you already have installed but (probably) not loaded. Just click them to load. If you know what package you need, there are 2 easy ways to install it. The first is by clicking the install button in the packages tab. Search the CRAN for the package you want and install it. The second is:
```{r eval=FALSE}
install.packages("PackageName")
```
Don't forget to load the package after installation, either by checking the box under packages, or by calling:
```{r eval=FALSE}
library(PackageName)
```
Note that in install.packages() you are searching a string in quotes, "PackageName"
while in library() you are calling a package object without quotes, PackageName.
## 1.1.2 Installing from GitHub
If a package isn't uploaded to CRAN, it can still be made available through GitHub. Packages installed from GitHub require an intermediate step:
```{r eval=FALSE}
install.packages("devtools")
library(devtools)
install_github("USFWS/AKaerial", ref = "development")
library(AKaerial)
```
This code does 4 things:
1. Install devtools package
2. Load devtools package
3. Call install_github (from devtools), look for the GitHub account "USFWS" and repository "AKaerial" and load the package in the "development" branch
4. Load AKaerial package
Installing and loading AKaerial may take a while since it includes dependencies. Dependencies are packages that are specified within another package that must be included for that package to operate appropriately. Akaerial "depends" on several other packages and will check if you have these installed.
## 1.1.3 Functions
Once you have a package loaded, you can access functions written and included in that package. Typing the name of a function will spit out the code that makes up the function. This can be messy for a long or complicated function. Typing ?FunctionName will access the help file for a given function.
```{r eval=FALSE}
AdjustCounts
?AdjustCounts
```
This can be extremely helpful (if the help file is!) when troubleshooting why your code isn't working as you want it to. Note that the top of the help file also tells you FunctionName {PackageName} in case you can't figure out where your function is coming from. This could be the case if you have dozens of packages loaded or have borrowed someone else's code to help run your analysis. Don't underestimate the power of internet searches to find if something already exists to help you do your thing!
We will talk about functions in more depth later, but for now, the general idea is that functions take arguments, run processes, and (usually) return products saved as R objects.
```{r eval=TRUE}
numbers = c(1,2,3)
avg = mean(numbers)
avg
```
In this example, numbers is initiated as a vector of 3 integers: 1,2,3. We then run the function mean with numbers as its argument, and save the result as the object avg.
It can be helpful to think of functions as recipes and arguments as ingredients.
```{r eval=FALSE}
Mix = function(what.to.mix){
dough = sum(what.to.mix)
return(dough)
}
Bake = function(what.to.bake, time.to.bake){
cookies = what.to.bake * time.to.bake
return(cookies)
}
my.ingredients = c("sugar", "eggs", "butter", "flour")
my.dough = Mix(my.ingredients)
my.cookies = Bake(my.dough, 10)
```
## 2.1.0 Loading Files
There are many ways to read data into R. We will go over the most common way data is stored and used in R; as a data frame. A data frame is a 2-dimensional array (table) where each column represents a variable (category) and each row represents a unique observation. Let's start with creating your own data frame. This might be useful if you have a small data set.
```{r eval=TRUE}
small.data = data.frame(
name = c("Laura", "Zak", "Chuck"),
size = c("small", "medium", "large"),
tenure = c(6, "drifter",9)
)
```
We can now check the structure of the data frame. This is a great first QA/QC check! If we know that tenure should be numeric, the structure should confirm it.
```{r eval=TRUE}
str(small.data)
```
Notice how all 3 variables are shown as Factor, which is incorrect. Factors, in general are tough to work with. If we try to change the value here (use $ to grab a column and brackets to select a row [ ], or use only brackets to select [row, column]), it throws an error.
```{r eval=TRUE}
small.data$tenure[2] = 3
small.data[2,3] = 3
levels(small.data$tenure)
```
Since R couldn't determine the structure of our tenure column, it defaulted to calling it a Factor, which can take any value. The unique set of values gets set as the levels of that factor. If you try to define a value outside of that set, you will get an error that results in NA. This is (almost) never what you want. So we can fix this in 2 ways:
First, we can use the "as." functions in base R to force R to evaluate something as a different data type. In this case, we can't just let R evaluate the factor as numeric since it will then take a numeric representation of that particular level of a factor in the set of all levels of that factor. What a mess. We have to first convert the factor to a character string using as.character, then to a numeric using as.numeric. This is called wrapping functions, when you pass as an argument to a new function the output of another function without saving the intermediate result. In the outer-most function call to as.numeric, anything that R can't represent as numeric will get NA. This is already messy and hard to trace changes to your data!
```{r eval=TRUE}
small.data$tenure=as.numeric(as.character(small.data$tenure))
str(small.data)
```
Or, since we are scripting our analysis, we can change it in our original code chunk and just re-run. Score 1 for a nice scripted workflow...but imagine doing this through thousands or hundreds of thousands of rows of data.
```{r eval=TRUE}
small.data = data.frame(
name = c("Laura", "Zak", "Chuck"),
size = c("small", "medium", "large"),
tenure = c(6,3,9)
)
```
While we are at it, lets just purge the factors from our data entirely. It is easy to make something a factor later, if we ever see the need. To do this, we just pass the argument stringsAsFactors = FALSE to our data frame function. Note that we have to add a comma after the declaration of our final column tenure.
```{r eval=TRUE}
small.data = data.frame(
name = c("Laura", "Zak", "Chuck"),
size = c("small", "medium", "large"),
tenure = c(6,3,9),
stringsAsFactors = FALSE
)
str(small.data)
```
Now we are clean...maybe. Our structure says our name and size columns are chr (character) and our tenure correctly shows (num) numeric. Now we want add a new column (greatness) using the values in an existing column. We can both create and define the values of the new column in one step.
```{r eval=TRUE}
small.data$greatness = small.data$tenure^3
str(small.data)
small.data
```
Now we have our 4 columns and the structure even gives us what it can display of the values. Now what if we want to add another row of data? The function rbind (row bind) makes it easy. As long as our new row is in the same order and type as the columns of our data frame, we can simply call:
```{r eval=TRUE}
small.data = rbind(small.data, c("Hannah", "small", 1, 1))
str(small.data)
small.data
```
As you can imagine, the same thing can be done for columns using cbind. Let's add another one:
```{r eval=TRUE}
small.data = cbind(small.data, interests=c("Bedazzling", "K-Pop", "Couponing", "Oranges"))
small.data
```
## 2.2.0 Reading in a data file
If you have a data file that exists (hopefully) as .csv or .txt, base R has a function that will read your data into a data frame for you. We will demo this with a .csv of fake aerial survey data. Remember that R works best (and sometimes only!) with tidy data. Your file should at least be rectangular (no loose columns, no long or dangling rows) and preferably will have column headings, though those can also be defined later. The function read.table will be the most generic and robust method, but there is also read.csv that saves a little time since you are predefining the formatting.
Plain text format is the simplest way to store text. One step up are delimited files. The most common types of delimited files and tab- and comma-delimited. Comma-delimited files are commonly saved as .csv (comma separated values). R does not care how your files are delimited, you just have to know the delimiting character(s) and pass them as arguments in the function call.
```{r eval=FALSE}
my.data = read.table("SomeData.txt", header = TRUE, sep = " ")
```
This tells R to find the file SomeData.txt (it will default to your working directory, but you can path it anywhere). It will open it, read the header row into the data frame my.data, then scan the file for spaces (sep = " ") and add entries between spaces to the values in each row. What could possibly go wrong with space-delimiting? Or even comma-delimiting?
To illustrate further concepts below we will be using an artificial data set I modified for this demo. It is located on the [GitHub repository](https://github.com/cfrost3/MBM_R_Short_Course) for this short course. To read it in, we will use read.csv, which tells R to expect a comma-delimited file. Oh, and we should also get rid of those nasty factors for now.
```{r eval=TRUE}
aerial=read.csv(
"https://raw.githubusercontent.com/cfrost3/MBM_R_Short_Course/master/BLSC_2018_RawObs_Fake.csv",
header=TRUE,
stringsAsFactors=FALSE)
str(aerial)
```
Hey! We have real (fake) data! Anything stand out immediately in the structure of the file?
## 2.3.1 Data Types
R has 4 basic data types you will use (6 in total, actually, including raw and complex):
1. Integer (1, 2, 3, 4)
2. Numeric (2, 4, 2.4, 45.48)
3. Character ("cat", "DOG", "15d", "and so on")
4. Logical (TRUE, FALSE)
## 2.3.2 Data Structures
Depending on how you combine data types, you can end up with one of several data structures:
1. Vector - a collection of all one data type
2. Matrix - multidimensional collection of vectors
3. Factor - nomical set of unique values and a vector of integer indices
4. List - open format to contain data structure elements
5. Data frame - our most common, discussed above
There are many other data structures in R, including user-defined and package-defined structures. We will mostly stick to the common ones.
## 2.3.3 Useful Data Structure Functions
Base R comes with many quick ways to assess your data structures. We will go into more depth later, but for now, explore commands such as:
```{r eval=TRUE}
a = c(1,2,3,4,10.1)
length(a) #how long is the vector?
mean(a) #mean value
var(a) #variance
typeof(a) #what data type? (double means floating decimal numeric)
is.na(a) #any NA or missing values?
summary(a) #some basic summarizing statistics
max(a) #maximum value
min(a) #minimum value
dim(small.data) #what are the dimensions of a structure?
class(small.data) #what class of structure is it?
names(small.data) #what are the names of the objects that make up the structure?
unique(small.data$size) #what are all of the unique values in a range?
```
## 3.0.0 Tidy Data
Tidy data refers to a data set that satisfies 3 conditions:
1. Every column is 1 variable
2. Every row is 1 observation
3. Every cell is 1 value
In R, tidy data lends itself seamlessly to the vectorized functions we discussed above (and thousands of others). In addition to summarizing functions, such as length(), mean(), max(), min(), etc., plot functions in R work sometimes exclusively and other times much cleaner and more efficiently with tidy data.
It is important to note here that quality controlled data (typos removed, missing data correctly recorded, etc.) and tidy data are 2 separate and important characteristics.
Messy data (data that is not tidy) is often the result of database design by a data collector, rather than an analyst or data manager. This is not meant to place blame. A data collector has one primary purpose; translate the raw observations in the field into some sort of tabular archive. That archive will generally, then, take the form and function that was easiest and fastest for the collector to enter the data (a very reasonable result since it is often preferrable to get data collected and archived quickly). A data analyst similarly can have one primary purpose; to analyze data. While the analyst would also like to do things relatively as quickly as possible, messy data can make analysis lengthy, difficult, costly, and sometimes impossible. In MBM, for example, we have spent easily over 90% of our time cleaning data instead of on analysis (and the general consensus among similar analysts across all fields is 75-80% of time on cleaning).
## 3.1.0 Common Problems
There are 2 common problems with data that we run into frequently in MBM:
1. Using values as column names
2. Multiple variables in one column
## 3.1.1 Values as Column Names
Consider the following data frame:
```{r eval=TRUE}
messy1 = data.frame(Species = rep(c("COGO", "BOOW"), each = 3),
Box=c(1:6),
Visit1 = c(5,6,6,4,5,4),
Visit2 = c(5,6,5,4,5,7),
Visit3 = c(0,6,5,4,5,0),
Visit4 = c(NA,0,0,1,0,NA),
Visit5 = c(NA,NA,NA,0,NA,NA),
stringsAsFactors = FALSE
)
messy1
```
Imagine they are visits to nests of 2 different species and the counts of the eggs that were found on each visit. Why is this messy? Looks clean enough. Easy enough for the data collector. Notice how columns 3-7 are labeled? This is a prime example of using a value for a variable (in this case, visit number) as the column header. It now becomes awkward for those nests that don't share the same number of visits. As an analyst, how can I script a quick summary of the number of times a nest was visited? Maybe by counting the number of cells that have a value greater than or equal to 0 in a row?
```{r eval=TRUE}
#give me the total of row 1 values greater than or equal to 0
v = sum(messy1[1,]>0)
v
```
Ouch. Oh yeah, the NA values. A sum containing an NA is always NA. So let's remove those.
```{r eval=TRUE}
#give me the total of row 1 values greater than or equal to 0, and skip the NAs
v = sum(messy1[1,]>0, na.rm = TRUE)
v
```
Ok, looks like it worked. But not when we spot check it. We forgot that Box will (probably) always be greater than or equal to 0. So let's tack on a modifier.
```{r eval=TRUE}
#give the total of row 1 values >= to 0, skip the NAs, subtract 1 for Box
v = sum(messy1[1,]>0, na.rm = TRUE)-1
v
```
Well, we now correctly show 3 visits. But what could possibly go wrong with this in a larger script? Or imagine in an ACTUAL data set, where we have dozens of columns. This is the kind of thing that would drive an analyst bonkers AND would be a nightmare to fix after it has been done for years on a large data set. We won't even attempt to reshape this using base R commands. Lucky for us, we have the tidyverse package.
Tidyverse is actually a collection of packages developed by data scientists, for data scientists. When you install tidyverse, you get them all. It streamlines importing, cleaning, tidying, visualizing, analyzing, and reporting any size data set. Most (all) of the functions and functionality available in the tidyverse are also possible in base R, but often with significantly more (and generally more confusing) code. We can't possibly describe them all, but we will hit some of the main functions.
First, let's install the tidyverse package. This may take a while depending on how many of the packages you already have. We will also reload our fake aerial survey data to play with.
```{r eval=FALSE}
install.packages("tidyverse")
```
```{r eval=TRUE, message=FALSE}
library(tidyverse)
aerial=read.csv(
"https://raw.githubusercontent.com/cfrost3/MBM_R_Short_Course/master/BLSC_2018_RawObs_Fake.csv",
header=TRUE,
stringsAsFactors=FALSE)
```
If you recall, we used the $ operator to reference a column in a data frame. Tidyverse provides a cleaner syntax with the select function.
```{r eval=TRUE}
#grab Transect, Species, and Num only from aerial
aerial.subset = select(aerial, Transect, Species, Num)
head(aerial.subset) #only show me the first few rows
```
Quick, easy, and intuitive. Now let's only keep observations of 3 or more.
```{r eval=TRUE}
aerial.subset2 = filter(aerial.subset, Num > 2) #give me observations greater than 2
head(aerial.subset2)
```
These are useful, but notice how we keep having to save out our results as new names with an increasing number on the end? This will get messy and out of control fast as we add on more filters and subsets and summaries. To combat this, the Tidyverse (specifically the magrittr package) has the pipe operator (%>%). The pipe operator chains together a series of functions and uses the results of the previous pipes in the subsequent pipes. It makes for a cleaner sequence that you can track later. For example, the sequence below will subset our columns and filter to bigger observations as we did above, but also filter out the START and ENDPT observations, then count the number of observations by species that met the conditions leading up to the aggregate.
```{r eval=TRUE}
obs.by.species = aerial %>%
#grab Transect, Species, and Num only from aerial
select(Transect, Species, Num) %>%
#Num greater than 2, but remove START and ENDPT observations
filter(Num > 2 & !(Species %in% c("START", "ENDPT")) ) %>%
aggregate(Num~Species, .,FUN=length)
names(obs.by.species)[2]="N.obs"
obs.by.species
```
We have a lot going on there. Note the command %in% that will only keep string values that it finds within a set of other string values. But in this case, I wrap it in !, which can be read as "not." So that line becomes "filter down to observations greater than 2 and where Species is not START or ENDPT."
Now that we have the pipe in our toolbox, let's revisit the actual messy data problem.
```{r eval=TRUE}
messy1
```
We have 6 nest boxes that were visited up to 5 times and we just want a summary of the number of visits and the average number of eggs in the nest each visit. We have incorrectly recorded values of our variable Visit as column headers. Let's fix it.
```{r eval=TRUE}
tidy1 = messy1 %>%
pivot_longer(c("Visit1", "Visit2","Visit3","Visit4","Visit5",),
names_to = "Visit",
values_to = "eggs")
tidy1
```
This result is...ehhhh. It looks ok, but we can do better.
```{r eval=TRUE}
tidy1 = messy1 %>%
pivot_longer(
cols = starts_with("Visit"),
names_to = "Visit",
names_prefix = "Visit",
values_to = "Eggs",
values_drop_na = TRUE
)
head(tidy1,10)
```
We took messy1 and applied pivot_longer, which will translate our column values into expanded row values. We further parameterized it by asking for all columns that started with Visit (cols = starts_with("Visit")), and removed the prefix Visit (names_prefix = "Visit") and sent the remainder to a new Visit column (names_to = "Visit"). We then took the row values (counts of eggs, though you can't tell from the data) and sent them to the appropriate new Species-Box-Visit row combination and labelled the new column Eggs (values_to = "Eggs"). Finally, we removed the NAs. These were causing problems above and since they were really just empty cells and not visits, they are better off gone. Now we can cleanly summarize our data!
```{r eval=TRUE}
vis = tidy1 %>%
aggregate(Visit~Species + Box, ., FUN=max)
names(vis)[3] = "Num.Visits"
vis
avg.egg = tidy1 %>%
filter(Visit == 1) %>%
aggregate(Eggs~Species, ., FUN=mean)
names(avg.egg)[2] = "Avg.Eggs"
avg.egg
#Or using just tidyverse
tidy1 %>%
filter(Visit == 1) %>%
group_by(Species) %>%
summarize(Avg.Eggs=mean(Eggs))
```
## 3.1.2 Multiple Variables in One Column
Another common problem in MBM data sets is combining the values of multiple variables into a single column. Sometimes this is what we really thought was the best way to store values, and other times we make excuses about the limitations of our data collection programs or devices to store things certain ways. The results are messy data either way, and this serves to increase the rate of decay of the actual information we can extract from our data. Consider the following (fake) data:
```{r eval=TRUE}
messy2 = data.frame(monthday=paste(rep(6,10), c(1:10), sep="/"), Grade=rep(c("A", "F"), 5))
messy2
```
In this simple example, we have monthday, which is the month and day separated by a period, and a fictitious grade received on that day. We should really have month and day as their own columns. Let's separate.
```{r eval=TRUE}
messy2 %>%
separate(monthday, into=c("Month", "Day"), sep="/")
```
That was suspiciously easy...so let's try it with our aerial data. The column Species sounds like it should be the species. But it isn't. It is actually 3 values: species, behavior, and distance. But that isn't the end of it. The behavior was only recorded if the bird was flying. Pretty messy. So let's tidy it.
```{r eval=TRUE}
unique(aerial$Species)
aerial.tidy = aerial %>%
filter(!(Species %in% c("START", "ENDPT"))) %>% #remove start and end points
select(Lat, Long, Species, Num) %>% #filter to just 4 columns for illustration
separate(Species, into=c("Species", "Distance"), sep = -1) %>% #take away the rightmost string value
separate(Species, into=c("Species", "Behavior"), sep = 4) #take away the first 4 string values
head(aerial.tidy)
unique(aerial.tidy$Species)
```
An initial look at the Species column showed that the Distance was always the last value in the string. So we were able to tell separate that we need to pull that one out.
```{r eval=FALSE}
separate(Species, into=c("Species", "Distance"), sep = -1)
```
Then after pulling that one out, we saw that the species code was always the first 4 characters in the string, so we were able to tell separate to always give us those.
```{r eval=FALSE}
separate(Species, into=c("Species", "Behavior"), sep = 4)
```
With the pipe, we did it all in one command, and we can move on to bigger and better things, such as visualization!
## 4.0.0 Tables and Figures
The first step in any data manipulation, analysis, QA/QC process, or almost any other data-related activity should be to visualize your data. This will be the best way to catch any potential inconsistencies or errors before you embed them deep in an analysis or report. Base R provides many ways to visualize data quickly and easily, but contributed packages such as DT, kableExtra, ggplot2, and leaflet expand on base R to provide publication-quality tables and figures. We will start with base R tables.
## 4.1.1 Base R Tables
Let's make sure we have our aerial data loaded and tidy.
```{r eval=TRUE}
library(tidyverse)
aerial=read.csv(
"https://raw.githubusercontent.com/cfrost3/MBM_R_Short_Course/master/BLSC_2018_RawObs_Fake.csv",
header=TRUE,
stringsAsFactors=FALSE)
aerial.tidy = aerial %>%
filter(!(Species %in% c("START", "ENDPT"))) %>% #remove start and end points
select(Lat, Long, Species, Num) %>% #filter to just 4 columns for illustration
separate(Species, into=c("Species", "Distance"), sep = -1) %>% #take away the rightmost string value
separate(Species, into=c("Species", "Behavior"), sep = 4) #take away the first 4 string values
head(aerial.tidy)
table(aerial.tidy$Species,aerial.tidy$Distance)
```
The format is table(rows, columns), and we can even add multidimensionality:
```{r eval=TRUE}
table(aerial.tidy$Species,aerial.tidy$Distance, aerial.tidy$Behavior)
```
One important thing to remember with table in base R is that the default summary will simply give you a count of the number of times the 2 observations occurred together (NOT the total number of things we counted by species in this example). Always keep track of what you are visualizing. You can also easily calculate the proportion of observations in a table.
```{r eval=TRUE}
prop.table(table(aerial.tidy$Species,aerial.tidy$Distance))
```
This proportion table tells us that our most frequent observation (roughly 28% of the time) was BLSC in distance bin 3. Notice how we had to wrap the table() function with prop.table()? That is because while table() expects (rows, columns) format, prop.table expects a table() as input. We can even wrap the whole thing in a max() to verify our 28%.
```{r eval=TRUE}
max(prop.table(table(aerial.tidy$Species,aerial.tidy$Distance)))
```
One thing we would certainly want to know from our aerial data collection is the number (Num) of each species we counted. We can get that number (albeit inefficiently) using tables.
```{r eval=TRUE}
table(aerial.tidy$Species,aerial.tidy$Num)
str(aerial.tidy)
```
Well, that won't work. It seems we've identified a QA/QC issue. We could've caught this by checking the structure of our data frame as well (see how Num shows as chr?). Let's pluck those 2 offenders out and fix them. We will find them and fix them in 2 different ways for illustration only. One useful function for this is which(). It will return the index or indices (similar to the row number in Excel) of the row(s) of the observation(s) that fit a condition. In our case, we want to know where the d and t are hanging out.
```{r eval=TRUE}
which(aerial.tidy$Num %in% c("d", "t")) #what positions are the d and t in?
aerial.tidy$Num[427] #verify
aerial.tidy[427,] #show me that whole row 427 and all of its columns
aerial.tidy[c(427,438),] #show me both row 427 and 438 and all of their columns
```
Let's imagine that we had our observer go back through the recordings (because both the observer and recordings still exist), and they found that d should be 2 and t should be 1. Since we know the indices of the offenders, we can just overwrite them.
```{r eval=TRUE}
aerial.tidy$Num[427] = 2
```
Alternatively, we can grab the index and do something to it all at once:
```{r eval=TRUE}
aerial.tidy$Num[aerial.tidy$Num == "t"] = 1
#find any values of Num == "t" and replace with 1
```
If we table it again, we can see that we have at least removed the d and t observations. But the structure shows Num still contained as character instead of numeric.
```{r eval=TRUE}
table(aerial.tidy$Species,aerial.tidy$Num)
str(aerial.tidy)
```
To fix that, we can just tell aerial.tidy to reconsider that column as numeric. If we get any warnings or errors, we can assume there were other offenders in there. In our case, there weren't. Now when we table it, the columns even reorder correctly.
```{r eval=TRUE}
aerial.tidy$Num = as.numeric(aerial.tidy$Num)
str(aerial.tidy)
table(aerial.tidy$Species,aerial.tidy$Num)
```
Now if we want a basic summary of our observations, we can use margin.table to sum the table across either or both dimensions. In other words, a sum by rows, a sum by columns, or a total sum (keeping in mind our unit is observation and NOT number of things seen). Note that margin.table is like prop.table and the argument passed must be a table object.
```{r eval=TRUE}
#The 1 specifies the first (row) dimension (how many times did we observe BLSC?)
margin.table(table(aerial.tidy$Species,aerial.tidy$Num), 1)
#The 2 specifies the second (column) dimension (how many times did we see only 1 of something?)
margin.table(table(aerial.tidy$Species,aerial.tidy$Num), 2)
#No dimension argument gives us the sum over both rows and columns (how many total observations were there?)
margin.table(table(aerial.tidy$Species,aerial.tidy$Num))
```
## 4.1.2 Tables in Contributed Packages
We will demonstrate table functions from 2 contributed (user-developed) packages, DT and kableExtra. DT stands for data table and works best to visualize large data sets in a clean, searchable, sortable, filterable way. kableExtra is a handy package for creating tables that use the functionality of html coding to render in either html or pdf reports. We need to make sure we have both installed and loaded.
```{r eval=FALSE}
install.packages(c("DT", "kableExtra", "htmltools", "zoo"))
```
```{r eval=TRUE, message=FALSE}
library(DT)
library(kableExtra)
library(AKaerial)
library(tidyverse)
library(htmltools)
library(zoo)
```
For the first example, we will load a table of real aerial index estimates from the AKaerial package and use some of our tidyverse functions to trim it down for a specific table using datatable from the package DT.
```{r eval=TRUE}
ACPHistoric$combined %>%
filter(Species=="SPEI") %>%
select(Year, itotal, itotal.se) %>%
datatable()
```
That's a solid first effort, but we can do better. Our itotal column really shouldn't allow decimal amounts of birds, and the standard error could use some trimming. I also don't particularly like seeing the row numbers in there. And the column names could be better. And we need a title, maybe shrink the width, center that middle column, and show me all of the entries at once since there are only 13.
```{r eval=TRUE}
ACPHistoric$combined %>%
filter(Species=="SPEI") %>%
select(Year, itotal, itotal.se) %>%
mutate_at("itotal", ~round(., 0)) %>%
mutate_at("itotal.se", ~round(., 2)) %>%
datatable(rownames=FALSE, #cut out those row numbers showing
fillContainer = TRUE, #auto-size the column widths
colnames=c("Year","Indicated Total","SE"), #change column names
#set the caption obtusely using html styling
caption = htmltools::tags$caption( style = 'caption-side: top;
text-align: center;
color:black;
font-size:100% ;',
'Table 1: Indicated Total SPEI on the ACP, 2007-2019.'),
options=list(
autoWidth=TRUE,
#center the column indexed 1 (starts with 0)
columnDefs = list(list(className = 'dt-center', targets = 1)),
pageLength=length(.[,1]), #display all of the data points
dom="")) #take away the search and page functionality
```
Now that's a table! It renders a little small though in this pdf since html styling doesn't jive well with pdf output. It should look great in your console or in an html report. We used a new tidyverse function mutate_at. That changes the styling or content of certain columns only, referenced by name or number. In this case, we applied the function round() to the piped data frame and told it to use 0 and 2 decimal places. The rest of the changes were implemented as arguments to datatable().
Now we will implement a similar table using kableExtra.
```{r eval=TRUE}
YKGHistoric$combined %>%
filter(Species=="BRAN") %>%
select(Year, itotal, itotal.se) %>%
kable()
```
That might be good enough for some journals...but not for me. Let's get our rounding going, change those column names, and even add a 3-year and 10-year rolling average (from the zoo package). Eh...and a caption, a footnote, and a color scheme, just to give it that "next-level" feel.
```{r eval=FALSE}
YKGHistoric$combined %>%
filter(Species=="EMGO") %>%
select(Year, itotal, itotal.se) %>%
#rollapply in zoo rolls through a vector and applies a function to the segments
mutate(avg3=rollapply(itotal,3,mean,align='right',fill=NA)) %>%
mutate(avg10=rollapply(itotal,10,mean,align='right',fill=NA)) %>%
mutate_at(c("itotal", "avg3", "avg10"), ~round(., 0)) %>%
mutate_at("itotal.se", ~round(., 2)) %>%
#apply our conditional population objective coloring
mutate_at("itotal", function(x) {
cell_spec(x, "html", color = ifelse(x > 26000, "green", "red"))
}) %>%
kable(format="html",
escape = F, #html scheme requirement to make the color statement work
#now we paste in the footnote denotation on the indicated total column
col.names = c("Year", "Indicated Total*",
"SE", "3-year Avg", "10-year Avg"),
#and add the top caption
caption = "Table 2: Indicated Total EMGO on the YK-Delta, 1985-2019, including 3- and 10-year Averages.") %>%
#adding the footnote is done in its own function so we pipe the kable to it
footnote(symbol= "Indicated Total is used by the AMBCC to regulate harvest. Values of 26000 and above result in an open harvest.") %>%
#this tells us to use the default bordered style
kable_styling("bordered",
full_width=FALSE,
font_size = 14)
```
```{r eval=TRUE, echo=FALSE, message=FALSE}
YKGHistoric$combined %>%
filter(Species=="EMGO") %>%
select(Year, itotal, itotal.se) %>%
#rollapply in zoo rolls through a vector and applies a function to the segments
mutate(avg3=rollapply(itotal,3,mean,align='right',fill=NA)) %>%
mutate(avg10=rollapply(itotal,10,mean,align='right',fill=NA)) %>%
mutate_at(c("itotal", "avg3", "avg10"), ~round(., 0)) %>%
mutate_at("itotal.se", ~round(., 2)) %>%
#apply our conditional population objective coloring
mutate_at("itotal", function(x) {
cell_spec(x, "html", color = ifelse(x > 26000, "green", "red"))
}) %>%
kable(format="html",
escape = F, #html scheme requirement to make the color statement work
#now we paste in the footnote denotation on the indicated total column
col.names = c("Year", "Indicated Total*",
"SE", "3-year Avg", "10-year Avg"),
#and add the top caption
caption = "Table 2: Indicated Total EMGO on the YK-Delta, 1985-2019, including 3- and 10-year Averages.") %>%
#adding the footnote is done in its own function so we pipe the kable to it
footnote(symbol= "Indicated Total is used by the AMBCC to regulate harvest. Values of 26000 and above result in an open harvest.") %>%
#this tells us to use the default bordered style
kable_styling("bordered",
full_width=FALSE,
font_size = 14) %>%
as_image(file="table2.png")
```
The result could easily be hung on the nearest refrigerator (sorry, the styling will look funny as a pdf). The majority of the work in creating a nice table is getting the formatting just right for your target. It will take some practice to figure out where the option is that changes each of your style elements. Don't forget that a package probably already exists to do exactly whatever manipulation you are dreaming of doing to your data. Search engines are powerful tools.
## 4.2.1 Plotting in Base R
Base R comes loaded with relatively robust plotting capability. Just like with tables, though, the R community identified many features that were either missing, or difficult to implement, and expanded the R universe with contributed plotting content. We will start in base R with the basics.
We will revisit our aerial.tidy that we created earlier. Note that we aren't applying any pipes just yet since these visualizations are usually 1 or 2 contrasts and nothing else gets displayed, unlike a table where we want to choose exactly what data doesn't get displayed.
```{r eval=TRUE}
hist(aerial.tidy$Num)
```
The default x and y axis labels and main title leave us feeling a little empty. Let's fix.
```{r eval=TRUE}
hist(aerial.tidy$Num,
xlab = "Number observed",
ylab = "Frequency of observation",
main = "Observations by size and frequency")
```
Now let's make a barplot. It looks similar to the histogram, but the histogram only works with numeric vectors. A barplot can work with strings and factors converted to tables.
```{r eval=TRUE}
counts=table(aerial$Obs_Type)
barplot(counts,
xlab="Observation Type",
ylab="Frequency",
main="Frequency of Observation Types")
```
Hmmm...looks like more QA/QC problems. They seem pretty intuitive to fix except for one. The second case of "pair" on the x axis is actually "pair " (with a space on the end). This is a common problem with character data and one that R doesn't automatically know how to treat. We would have identified it using a unique(aerial$Obs_Type) call. So let's fix these.
```{r eval=TRUE}
aerial$Obs_Type[aerial$Obs_Type %in% c("2pair", "pair ")] = "pair"
aerial$Obs_Type[aerial$Obs_Type == "shingle"] = "single"
unique(aerial$Obs_Type)
```
Ok, those look reasonable now. Again with the barplot.
```{r eval=TRUE}
counts=table(aerial$Obs_Type)
barplot(counts,
xlab="Observation Type",
ylab="Frequency",
main="Frequency of Observation Types")
```
Looks better. We can even do 2-dimensional tables.
```{r eval=TRUE}
counts=table(aerial.tidy$Distance, aerial.tidy$Species)
barplot(counts,
xlab="Species",
ylab="Frequency",
main="Frequency of Distance Bins by Species",
legend=rownames(counts),
col=c("red","white","yellow","blue"))