-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.xml
1837 lines (1700 loc) · 141 KB
/
index.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<title>Liz McConnell</title>
<link>/</link>
<atom:link href="/index.xml" rel="self" type="application/rss+xml" />
<description>Liz McConnell</description>
<generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><copyright>Liz McConnell © 2020</copyright><lastBuildDate>Sat, 01 Jun 2030 13:00:00 +0000</lastBuildDate>
<image>
<url>/images/icon_hu06d963f22cc5d003786a3f474238bf81_14484_512x512_fill_lanczos_center_2.png</url>
<title>Liz McConnell</title>
<link>/</link>
</image>
<item>
<title>Example Page 1</title>
<link>/courses/example/example1/</link>
<pubDate>Sun, 05 May 2019 00:00:00 +0100</pubDate>
<guid>/courses/example/example1/</guid>
<description><p>In this tutorial, I&rsquo;ll share my top 10 tips for getting started with Academic:</p>
<h2 id="tip-1">Tip 1</h2>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum. Sed ac faucibus dolor, scelerisque sollicitudin nisi. Cras purus urna, suscipit quis sapien eu, pulvinar tempor diam. Quisque risus orci, mollis id ante sit amet, gravida egestas nisl. Sed ac tempus magna. Proin in dui enim. Donec condimentum, sem id dapibus fringilla, tellus enim condimentum arcu, nec volutpat est felis vel metus. Vestibulum sit amet erat at nulla eleifend gravida.</p>
<p>Nullam vel molestie justo. Curabitur vitae efficitur leo. In hac habitasse platea dictumst. Sed pulvinar mauris dui, eget varius purus congue ac. Nulla euismod, lorem vel elementum dapibus, nunc justo porta mi, sed tempus est est vel tellus. Nam et enim eleifend, laoreet sem sit amet, elementum sem. Morbi ut leo congue, maximus velit ut, finibus arcu. In et libero cursus, rutrum risus non, molestie leo. Nullam congue quam et volutpat malesuada. Sed risus tortor, pulvinar et dictum nec, sodales non mi. Phasellus lacinia commodo laoreet. Nam mollis, erat in feugiat consectetur, purus eros egestas tellus, in auctor urna odio at nibh. Mauris imperdiet nisi ac magna convallis, at rhoncus ligula cursus.</p>
<p>Cras aliquam rhoncus ipsum, in hendrerit nunc mattis vitae. Duis vitae efficitur metus, ac tempus leo. Cras nec fringilla lacus. Quisque sit amet risus at ipsum pharetra commodo. Sed aliquam mauris at consequat eleifend. Praesent porta, augue sed viverra bibendum, neque ante euismod ante, in vehicula justo lorem ac eros. Suspendisse augue libero, venenatis eget tincidunt ut, malesuada at lorem. Donec vitae bibendum arcu. Aenean maximus nulla non pretium iaculis. Quisque imperdiet, nulla in pulvinar aliquet, velit quam ultrices quam, sit amet fringilla leo sem vel nunc. Mauris in lacinia lacus.</p>
<p>Suspendisse a tincidunt lacus. Curabitur at urna sagittis, dictum ante sit amet, euismod magna. Sed rutrum massa id tortor commodo, vitae elementum turpis tempus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean purus turpis, venenatis a ullamcorper nec, tincidunt et massa. Integer posuere quam rutrum arcu vehicula imperdiet. Mauris ullamcorper quam vitae purus congue, quis euismod magna eleifend. Vestibulum semper vel augue eget tincidunt. Fusce eget justo sodales, dapibus odio eu, ultrices lorem. Duis condimentum lorem id eros commodo, in facilisis mauris scelerisque. Morbi sed auctor leo. Nullam volutpat a lacus quis pharetra. Nulla congue rutrum magna a ornare.</p>
<p>Aliquam in turpis accumsan, malesuada nibh ut, hendrerit justo. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Quisque sed erat nec justo posuere suscipit. Donec ut efficitur arcu, in malesuada neque. Nunc dignissim nisl massa, id vulputate nunc pretium nec. Quisque eget urna in risus suscipit ultricies. Pellentesque odio odio, tincidunt in eleifend sed, posuere a diam. Nam gravida nisl convallis semper elementum. Morbi vitae felis faucibus, vulputate orci placerat, aliquet nisi. Aliquam erat volutpat. Maecenas sagittis pulvinar purus, sed porta quam laoreet at.</p>
<h2 id="tip-2">Tip 2</h2>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum. Sed ac faucibus dolor, scelerisque sollicitudin nisi. Cras purus urna, suscipit quis sapien eu, pulvinar tempor diam. Quisque risus orci, mollis id ante sit amet, gravida egestas nisl. Sed ac tempus magna. Proin in dui enim. Donec condimentum, sem id dapibus fringilla, tellus enim condimentum arcu, nec volutpat est felis vel metus. Vestibulum sit amet erat at nulla eleifend gravida.</p>
<p>Nullam vel molestie justo. Curabitur vitae efficitur leo. In hac habitasse platea dictumst. Sed pulvinar mauris dui, eget varius purus congue ac. Nulla euismod, lorem vel elementum dapibus, nunc justo porta mi, sed tempus est est vel tellus. Nam et enim eleifend, laoreet sem sit amet, elementum sem. Morbi ut leo congue, maximus velit ut, finibus arcu. In et libero cursus, rutrum risus non, molestie leo. Nullam congue quam et volutpat malesuada. Sed risus tortor, pulvinar et dictum nec, sodales non mi. Phasellus lacinia commodo laoreet. Nam mollis, erat in feugiat consectetur, purus eros egestas tellus, in auctor urna odio at nibh. Mauris imperdiet nisi ac magna convallis, at rhoncus ligula cursus.</p>
<p>Cras aliquam rhoncus ipsum, in hendrerit nunc mattis vitae. Duis vitae efficitur metus, ac tempus leo. Cras nec fringilla lacus. Quisque sit amet risus at ipsum pharetra commodo. Sed aliquam mauris at consequat eleifend. Praesent porta, augue sed viverra bibendum, neque ante euismod ante, in vehicula justo lorem ac eros. Suspendisse augue libero, venenatis eget tincidunt ut, malesuada at lorem. Donec vitae bibendum arcu. Aenean maximus nulla non pretium iaculis. Quisque imperdiet, nulla in pulvinar aliquet, velit quam ultrices quam, sit amet fringilla leo sem vel nunc. Mauris in lacinia lacus.</p>
<p>Suspendisse a tincidunt lacus. Curabitur at urna sagittis, dictum ante sit amet, euismod magna. Sed rutrum massa id tortor commodo, vitae elementum turpis tempus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean purus turpis, venenatis a ullamcorper nec, tincidunt et massa. Integer posuere quam rutrum arcu vehicula imperdiet. Mauris ullamcorper quam vitae purus congue, quis euismod magna eleifend. Vestibulum semper vel augue eget tincidunt. Fusce eget justo sodales, dapibus odio eu, ultrices lorem. Duis condimentum lorem id eros commodo, in facilisis mauris scelerisque. Morbi sed auctor leo. Nullam volutpat a lacus quis pharetra. Nulla congue rutrum magna a ornare.</p>
<p>Aliquam in turpis accumsan, malesuada nibh ut, hendrerit justo. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Quisque sed erat nec justo posuere suscipit. Donec ut efficitur arcu, in malesuada neque. Nunc dignissim nisl massa, id vulputate nunc pretium nec. Quisque eget urna in risus suscipit ultricies. Pellentesque odio odio, tincidunt in eleifend sed, posuere a diam. Nam gravida nisl convallis semper elementum. Morbi vitae felis faucibus, vulputate orci placerat, aliquet nisi. Aliquam erat volutpat. Maecenas sagittis pulvinar purus, sed porta quam laoreet at.</p>
</description>
</item>
<item>
<title>Example Page 2</title>
<link>/courses/example/example2/</link>
<pubDate>Sun, 05 May 2019 00:00:00 +0100</pubDate>
<guid>/courses/example/example2/</guid>
<description><p>Here are some more tips for getting started with Academic:</p>
<h2 id="tip-3">Tip 3</h2>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum. Sed ac faucibus dolor, scelerisque sollicitudin nisi. Cras purus urna, suscipit quis sapien eu, pulvinar tempor diam. Quisque risus orci, mollis id ante sit amet, gravida egestas nisl. Sed ac tempus magna. Proin in dui enim. Donec condimentum, sem id dapibus fringilla, tellus enim condimentum arcu, nec volutpat est felis vel metus. Vestibulum sit amet erat at nulla eleifend gravida.</p>
<p>Nullam vel molestie justo. Curabitur vitae efficitur leo. In hac habitasse platea dictumst. Sed pulvinar mauris dui, eget varius purus congue ac. Nulla euismod, lorem vel elementum dapibus, nunc justo porta mi, sed tempus est est vel tellus. Nam et enim eleifend, laoreet sem sit amet, elementum sem. Morbi ut leo congue, maximus velit ut, finibus arcu. In et libero cursus, rutrum risus non, molestie leo. Nullam congue quam et volutpat malesuada. Sed risus tortor, pulvinar et dictum nec, sodales non mi. Phasellus lacinia commodo laoreet. Nam mollis, erat in feugiat consectetur, purus eros egestas tellus, in auctor urna odio at nibh. Mauris imperdiet nisi ac magna convallis, at rhoncus ligula cursus.</p>
<p>Cras aliquam rhoncus ipsum, in hendrerit nunc mattis vitae. Duis vitae efficitur metus, ac tempus leo. Cras nec fringilla lacus. Quisque sit amet risus at ipsum pharetra commodo. Sed aliquam mauris at consequat eleifend. Praesent porta, augue sed viverra bibendum, neque ante euismod ante, in vehicula justo lorem ac eros. Suspendisse augue libero, venenatis eget tincidunt ut, malesuada at lorem. Donec vitae bibendum arcu. Aenean maximus nulla non pretium iaculis. Quisque imperdiet, nulla in pulvinar aliquet, velit quam ultrices quam, sit amet fringilla leo sem vel nunc. Mauris in lacinia lacus.</p>
<p>Suspendisse a tincidunt lacus. Curabitur at urna sagittis, dictum ante sit amet, euismod magna. Sed rutrum massa id tortor commodo, vitae elementum turpis tempus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean purus turpis, venenatis a ullamcorper nec, tincidunt et massa. Integer posuere quam rutrum arcu vehicula imperdiet. Mauris ullamcorper quam vitae purus congue, quis euismod magna eleifend. Vestibulum semper vel augue eget tincidunt. Fusce eget justo sodales, dapibus odio eu, ultrices lorem. Duis condimentum lorem id eros commodo, in facilisis mauris scelerisque. Morbi sed auctor leo. Nullam volutpat a lacus quis pharetra. Nulla congue rutrum magna a ornare.</p>
<p>Aliquam in turpis accumsan, malesuada nibh ut, hendrerit justo. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Quisque sed erat nec justo posuere suscipit. Donec ut efficitur arcu, in malesuada neque. Nunc dignissim nisl massa, id vulputate nunc pretium nec. Quisque eget urna in risus suscipit ultricies. Pellentesque odio odio, tincidunt in eleifend sed, posuere a diam. Nam gravida nisl convallis semper elementum. Morbi vitae felis faucibus, vulputate orci placerat, aliquet nisi. Aliquam erat volutpat. Maecenas sagittis pulvinar purus, sed porta quam laoreet at.</p>
<h2 id="tip-4">Tip 4</h2>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum. Sed ac faucibus dolor, scelerisque sollicitudin nisi. Cras purus urna, suscipit quis sapien eu, pulvinar tempor diam. Quisque risus orci, mollis id ante sit amet, gravida egestas nisl. Sed ac tempus magna. Proin in dui enim. Donec condimentum, sem id dapibus fringilla, tellus enim condimentum arcu, nec volutpat est felis vel metus. Vestibulum sit amet erat at nulla eleifend gravida.</p>
<p>Nullam vel molestie justo. Curabitur vitae efficitur leo. In hac habitasse platea dictumst. Sed pulvinar mauris dui, eget varius purus congue ac. Nulla euismod, lorem vel elementum dapibus, nunc justo porta mi, sed tempus est est vel tellus. Nam et enim eleifend, laoreet sem sit amet, elementum sem. Morbi ut leo congue, maximus velit ut, finibus arcu. In et libero cursus, rutrum risus non, molestie leo. Nullam congue quam et volutpat malesuada. Sed risus tortor, pulvinar et dictum nec, sodales non mi. Phasellus lacinia commodo laoreet. Nam mollis, erat in feugiat consectetur, purus eros egestas tellus, in auctor urna odio at nibh. Mauris imperdiet nisi ac magna convallis, at rhoncus ligula cursus.</p>
<p>Cras aliquam rhoncus ipsum, in hendrerit nunc mattis vitae. Duis vitae efficitur metus, ac tempus leo. Cras nec fringilla lacus. Quisque sit amet risus at ipsum pharetra commodo. Sed aliquam mauris at consequat eleifend. Praesent porta, augue sed viverra bibendum, neque ante euismod ante, in vehicula justo lorem ac eros. Suspendisse augue libero, venenatis eget tincidunt ut, malesuada at lorem. Donec vitae bibendum arcu. Aenean maximus nulla non pretium iaculis. Quisque imperdiet, nulla in pulvinar aliquet, velit quam ultrices quam, sit amet fringilla leo sem vel nunc. Mauris in lacinia lacus.</p>
<p>Suspendisse a tincidunt lacus. Curabitur at urna sagittis, dictum ante sit amet, euismod magna. Sed rutrum massa id tortor commodo, vitae elementum turpis tempus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean purus turpis, venenatis a ullamcorper nec, tincidunt et massa. Integer posuere quam rutrum arcu vehicula imperdiet. Mauris ullamcorper quam vitae purus congue, quis euismod magna eleifend. Vestibulum semper vel augue eget tincidunt. Fusce eget justo sodales, dapibus odio eu, ultrices lorem. Duis condimentum lorem id eros commodo, in facilisis mauris scelerisque. Morbi sed auctor leo. Nullam volutpat a lacus quis pharetra. Nulla congue rutrum magna a ornare.</p>
<p>Aliquam in turpis accumsan, malesuada nibh ut, hendrerit justo. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Quisque sed erat nec justo posuere suscipit. Donec ut efficitur arcu, in malesuada neque. Nunc dignissim nisl massa, id vulputate nunc pretium nec. Quisque eget urna in risus suscipit ultricies. Pellentesque odio odio, tincidunt in eleifend sed, posuere a diam. Nam gravida nisl convallis semper elementum. Morbi vitae felis faucibus, vulputate orci placerat, aliquet nisi. Aliquam erat volutpat. Maecenas sagittis pulvinar purus, sed porta quam laoreet at.</p>
</description>
</item>
<item>
<title>Example Talk</title>
<link>/talk/example/</link>
<pubDate>Sat, 01 Jun 2030 13:00:00 +0000</pubDate>
<guid>/talk/example/</guid>
<description><div class="alert alert-note">
<div>
Click on the <strong>Slides</strong> button above to view the built-in slides feature.
</div>
</div>
<p>Slides can be added in a few ways:</p>
<ul>
<li><strong>Create</strong> slides using Academic&rsquo;s
<a href="https://sourcethemes.com/academic/docs/managing-content/#create-slides" target="_blank" rel="noopener"><em>Slides</em></a> feature and link using <code>slides</code> parameter in the front matter of the talk file</li>
<li><strong>Upload</strong> an existing slide deck to <code>static/</code> and link using <code>url_slides</code> parameter in the front matter of the talk file</li>
<li><strong>Embed</strong> your slides (e.g. Google Slides) or presentation video on this page using
<a href="https://sourcethemes.com/academic/docs/writing-markdown-latex/" target="_blank" rel="noopener">shortcodes</a>.</li>
</ul>
<p>Further talk details can easily be added to this page using <em>Markdown</em> and $\rm \LaTeX$ math code.</p>
</description>
</item>
<item>
<title>Joining by Nearby Dates</title>
<link>/post/fuzzyjoin/</link>
<pubDate>Sun, 25 Oct 2020 00:00:00 +0000</pubDate>
<guid>/post/fuzzyjoin/</guid>
<description>
<p>I’ve been wrestling with ideas for how to standardize sampling dates for a while now. The data sets I’m working with have fairly regular sampling intervals, but sometimes they aren’t sampled on the same day - and this drives me crazy. In terms of real-world sampling plans it makes total sense - weekends, rainy days, multi-day sampling events - but in terms of cleaning and working with the data it’s a pain. Here I’ve made up some fake data that shows the problem I’m having, then use fuzzyjoin to make it work. In the process I realized I could probably do approximately the same thing using dplyr. It’s a snowy day in Colorado, so let’s get to it.</p>
<pre class="r"><code>library(tidyverse)
library(lubridate)
well &lt;- t(data.frame(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4)) #making some fake wells
date &lt;- seq.Date(ymd(&quot;1996-01-01&quot;), ymd(&quot;1996-12-01&quot;), by = &quot;quarter&quot;) #making some fake dates
date &lt;- rep(date, times = 4)
date &lt;- date + sample(-7:7,16,replace=T)
concentration &lt;- sample(1:20,16,replace=T) #making some fake concentrations
df_a &lt;- cbind.data.frame(well, date, concentration) #first fake dataframe
df_a &lt;- remove_rownames(df_a)
date &lt;- date + sample(-7:7,16,replace=T)
concentration &lt;- sample(1:20,16,replace=T)
df_b &lt;- cbind.data.frame(well, date, concentration) #second fake dataframe
df_b &lt;- remove_rownames(df_b)
df_a #one set of samples</code></pre>
<pre><code>## well date concentration
## 1 1 1996-01-07 13
## 2 1 1996-03-27 7
## 3 1 1996-07-02 20
## 4 1 1996-09-28 17
## 5 2 1995-12-29 5
## 6 2 1996-03-30 12
## 7 2 1996-06-28 19
## 8 2 1996-10-05 14
## 9 3 1995-12-30 19
## 10 3 1996-04-05 13
## 11 3 1996-07-04 8
## 12 3 1996-10-03 9
## 13 4 1996-01-03 11
## 14 4 1996-04-07 11
## 15 4 1996-06-30 4
## 16 4 1996-09-29 5</code></pre>
<pre class="r"><code>df_b #another set of samples</code></pre>
<pre><code>## well date concentration
## 1 1 1996-01-06 7
## 2 1 1996-04-03 7
## 3 1 1996-06-30 4
## 4 1 1996-09-29 15
## 5 2 1995-12-30 1
## 6 2 1996-03-31 20
## 7 2 1996-07-02 16
## 8 2 1996-09-29 15
## 9 3 1995-12-24 10
## 10 3 1996-04-10 12
## 11 3 1996-07-03 12
## 12 3 1996-10-08 7
## 13 4 1996-01-08 11
## 14 4 1996-04-10 9
## 15 4 1996-06-30 10
## 16 4 1996-10-01 11</code></pre>
<pre class="r"><code>df_ab &lt;- rbind(df_a, df_b)
ggplot(df_ab, aes(x=date, y=concentration, color=well)) + geom_point(alpha=0.8)</code></pre>
<p><img src="/post/fuzzyjoin/index_files/figure-html/unnamed-chunk-1-1.png" width="672" />
Do you see the problem?!?! I can’t make the neat dataframe with both observations that I want. If I join on exact dates I loose a bunch of data. After some googling I found a package called {fuzzyjoin} that allows you to join columns within a certain tolerance, so they don’t have to exactly match up. Great - that’s what I want. Unfortunately, in this case I want to join by two columns, <code>well</code> and <code>date</code>, but the <code>max_dist</code> argument will apply to both, so since I just made the wells 1-4, if I wanted to capture dates within more than four days of each other all wells would be considered matches. So I fuzzy joined first by date, then filtered for the observations where well.x matches well.y. After some cleaning up, it looks pretty good.</p>
<pre class="r"><code>library(fuzzyjoin)
df_c &lt;- difference_inner_join(df_a, df_b, by = &quot;date&quot;, max_dist= 14) #join dates within two weeks
df_c &lt;- df_c %&gt;%
filter(well.x == well.y) %&gt;% #keep only joined observations from the same well
rename(&quot;PCE&quot;=concentration.x, &quot;TCE&quot;=concentration.y, &quot;well&quot;=well.x) %&gt;% #change some names
mutate(date = round_date(date.x, unit = &quot;month&quot;)) %&gt;% #round to nearest month
select(-c(&quot;well.y&quot;, &quot;date.x&quot;, &quot;date.y&quot;)) #remove duplicate well name
head(df_c)</code></pre>
<pre><code>## well PCE TCE date
## 1 1 13 7 1996-01-01
## 2 1 7 7 1996-04-01
## 3 1 20 4 1996-07-01
## 4 1 17 15 1996-10-01
## 5 2 5 1 1996-01-01
## 6 2 12 20 1996-04-01</code></pre>
<pre class="r"><code>df_c %&gt;% pivot_longer(cols = c(PCE, TCE)) %&gt;%
ggplot(aes(x=date, y=value, color=well)) + geom_point(alpha=0.5)</code></pre>
<p><img src="/post/fuzzyjoin/index_files/figure-html/unnamed-chunk-2-1.png" width="672" />
When I did the <code>round_date</code> thing above I thought, well, couldn’t I just have rounded by date first, then joined by exact date after? So I did just that and got the same results. This wouldn’t always give the same results, for example if a well was sampled around the middle of the month and one observation was rounded up while another was rounded down. The {fuzzyjoin} method joins relative to the distance between the dates, not based on the rounded date. In most cases it won’t matter, but there’s a small distinction.</p>
<pre class="r"><code>df_a &lt;- mutate(df_a, date = round_date(date, unit = &quot;month&quot;))
df_b &lt;- mutate(df_b, date = round_date(date, unit = &quot;month&quot;))
#df_c &lt;- inner_join(df_a, df_b, by = date)
# ^this causes an error: Error: `by` must be a (named) character vector, list, or NULL, not a `Date` object.</code></pre>
<pre class="r"><code>df_a$date &lt;- as.double(df_a$date)
df_b$date &lt;- as.double(df_b$date)
df_c &lt;- inner_join(df_a, df_b, by = &quot;date&quot;)
df_c &lt;- df_c %&gt;%
filter(well.x==well.y) %&gt;%
rename(&quot;PCE&quot;=concentration.x, &quot;TCE&quot;=concentration.y, &quot;well&quot;=well.x) %&gt;%
select(-c(&quot;well.y&quot;)) #remove duplicate well name
df_c &lt;- df_c %&gt;%
mutate(date = as.Date(date, origin = &quot;1970-01-01&quot;))
head(df_c)</code></pre>
<pre><code>## well date PCE TCE
## 1 1 1996-01-01 13 7
## 2 1 1996-04-01 7 7
## 3 1 1996-07-01 20 4
## 4 1 1996-10-01 17 15
## 5 2 1996-01-01 5 1
## 6 2 1996-04-01 12 20</code></pre>
<pre class="r"><code>df_c %&gt;% pivot_longer(cols = c(PCE, TCE)) %&gt;%
ggplot(aes(x=date, y=value, color=well)) + geom_point(alpha=0.5)</code></pre>
<p><img src="/post/fuzzyjoin/index_files/figure-html/unnamed-chunk-4-1.png" width="672" />
Now it’s time to apply this to a real dataset and see what problems I’ll run into…because they’re out there…lurking…spooky!</p>
</description>
</item>
<item>
<title>Threatened Plants #TidyTuesday</title>
<link>/post/plant_threats/</link>
<pubDate>Tue, 18 Aug 2020 00:00:00 +0000</pubDate>
<guid>/post/plant_threats/</guid>
<description>
<p>It’s Tidy Tuesday!! This week the focus is threatened and extinct plants. Certainly an issue worth thinking about. I’m a bit of a plant person, but really who isn’t? Let’s dive in!</p>
<pre class="r"><code>#libraries
library(tidyverse)
library(tidytuesdayR)
library(skimr)
library(tidytext)
#load data
tuesdata &lt;- tidytuesdayR::tt_load(2020, week = 34)</code></pre>
<pre><code>##
## Downloading file 1 of 3: `plants.csv`
## Downloading file 2 of 3: `threats.csv`
## Downloading file 3 of 3: `actions.csv`</code></pre>
<pre class="r"><code>plants &lt;- tuesdata$plants
threats &lt;- tuesdata$threats
actions &lt;- tuesdata$actions
threat_filtered &lt;- threats %&gt;%
filter(threatened == 1)
action_filtered &lt;- actions %&gt;%
filter(action_taken == 1)
threat_filtered %&gt;%
count(continent, group, threat_type) %&gt;%
ggplot(aes(y = tidytext::reorder_within(threat_type, n, continent), x = n, fill = group)) +
geom_col() +
tidytext::scale_y_reordered() +
facet_wrap(~continent, scales = &quot;free&quot;, ncol = 2)</code></pre>
<p><img src="/post/plant_threats/index_files/figure-html/setup-1.png" width="672" /></p>
<pre class="r"><code>action_filtered %&gt;%
count(continent, group, action_type) %&gt;%
ggplot(aes(y = tidytext::reorder_within(action_type, n, continent), x = n, fill = group)) +
geom_col() +
tidytext::scale_y_reordered() +
facet_wrap(~continent, scales = &quot;free&quot;, ncol = 2)</code></pre>
<p><img src="/post/plant_threats/index_files/figure-html/unnamed-chunk-1-1.png" width="672" />
Some interesting data. I wonder, could see improvement in the status of the plant after an intervention, and are some threats are easy to mitigate while others can’t be stopped?</p>
<p>The <code>year_last_seen</code> column is seperated into 20-year chunks and the <code>red_list_caregory</code> has two options, “Extinct”, or “Extinct in the Wild”. I guess a successful re-introduction would be represented by a change from “Extinct in the Wild” to off of the list…but let’s see if that’s actually what’s in the data.</p>
<pre class="r"><code>actions_change &lt;- action_filtered %&gt;%
select(c(binomial_name, year_last_seen, red_list_category, action_type)) %&gt;%
mutate(date = case_when(year_last_seen == &quot;2000-2020&quot; ~ &quot;2020&quot;, year_last_seen == &quot;1980-1999&quot; ~ &quot;1999&quot;, year_last_seen == &quot;1960-1979&quot; ~ &quot;1979&quot;, year_last_seen == &quot;1940-1959&quot; ~ &quot;1959&quot;, year_last_seen == &quot;1920-1939&quot; ~ &quot;1939&quot;, year_last_seen == &quot;1900-1919&quot; ~ &quot;1919&quot;, year_last_seen == &quot;Before 1900&quot; ~ &quot;1900&quot;, T ~ &quot;NA&quot;))
most_listed &lt;- actions_change %&gt;%
count(binomial_name) %&gt;%
filter(n &gt;= 2)
most_changes &lt;- actions_change %&gt;%
filter(binomial_name %in% most_listed$binomial_name)</code></pre>
</description>
</item>
<item>
<title>Making Maps with Leaflet in R</title>
<link>/post/leaflet/</link>
<pubDate>Thu, 13 Aug 2020 00:00:00 +0000</pubDate>
<guid>/post/leaflet/</guid>
<description>
<p>I’ve been searching for a way to interactively display some geospatial data. After trying out a few things I found {leaflet} - so easy and beautiful! I started out reading this <a href="https://rstudio.github.io/leaflet/">excellent primer</a>, then got to work with specific examples. Let’s take a look and see what it can do -&gt;</p>
<pre class="r"><code>#libraries
library(tidyverse) #for manipulating and visualizing data
library(leaflet) #for making interactive maps
#upload data from Geotracker - LA County
URL &lt;- &quot;https://geotracker.waterboards.ca.gov/data_download/geo_by_county/LosAngelesGeoXY.zip&quot;
download.file(URL, destfile=&#39;LosAngelesGeoXY.zip&#39;, method=&#39;curl&#39;)
unzip(&#39;LosAngelesGeoXY.zip&#39;)
LA_XY &lt;- read.delim(&quot;LosAngelesGeoXY.txt&quot;)
head(LA_XY)</code></pre>
<pre><code>## COUNTY GLOBAL_ID FIELD_PT_NAME FIELD_PT_CLASS XY_SURVEY_DATE LATITUDE
## 1 Los Angeles T0603701211 GMX-RPZ7 PZ 2001-01-25 34.02664
## 2 Los Angeles T0603701211 GMX-RPZ8 PZ 2001-07-11 34.01503
## 3 Los Angeles T0603701211 GMX-RPZ9 PZ 2001-07-11 34.01506
## 4 Los Angeles T0603701211 KMX-MW1 MW 1999-08-27 34.01560
## 5 Los Angeles T0603701211 KMX-MW2 MW 1999-08-27 34.01787
## 6 Los Angeles T0603701211 KMX-MW3 MW 1999-08-27 34.01728
## LONGITUDE XY_METHOD XY_DATUM XY_ACC_VAL XY_SURVEY_ORG GPS_EQUIP_TYPE
## 1 -118.4214 CGPS NAD83 3 Calvada Survey L530
## 2 -118.4168 CGPS NAD83 3 Calvada Survey L530
## 3 -118.4169 CGPS NAD83 3 Calvada Survey L530
## 4 -118.4266 CGPS NAD83 3 Calvada Survey L530
## 5 -118.4251 CGPS NAD83 3 Calvada Survey L530
## 6 -118.4246 CGPS NAD83 3 Calvada Survey L530
## XY_SURVEY_DESC
## 1
## 2
## 3
## 4
## 5
## 6</code></pre>
<p>Similar to the well measurements data, there is <a href="https://www.waterboards.ca.gov/ust/electronic_submittal/docs/geotrackersurvey_xyz_4_14_05.pdf">documentation</a> on the geotracker for each of these column ids. The <code>GLOBAL_ID</code> represents a site and matches up with the concentration data files. Let’s take a closer look at the <code>FIELD_PT_CLASS</code>.</p>
<pre class="r"><code>LA_XY %&gt;%
select(FIELD_PT_CLASS) %&gt;%
summary()</code></pre>
<pre><code>## FIELD_PT_CLASS
## : 1
## BH: 11
## MW:109
## PZ: 6
## SG: 30</code></pre>
<p>There are four codes in our data - BH, MW, PZ, SG. You might guess that BH is borehole and MW is monitoring well. I would guess that PZ stands for piezometer and I’m actually not sure what SG stands for. The documentation lists 33 valid values for <code>FIELD_PT_CLASS</code>, but PZ and SG are not on the list.</p>
<p>The date on the documentation is 2005, so maybe these codes have been added in the 15 years since then. A disclaimer at the top of the list states that new values are added occasionally, but the link to see the current list is broken :(</p>
<p>In any case, we’ll plot these points and see what we can add to make them useful.</p>
<pre class="r"><code>LA_wells_map &lt;- leaflet(LA_XY) %&gt;%
addProviderTiles(&quot;CartoDB.Positron&quot;) %&gt;% #using the CartoDB.Positron tiles; there are other options!
addCircleMarkers(lng = ~LONGITUDE,
lat = ~LATITUDE,
)
#LA_wells_map </code></pre>
<iframe seamless src="LA_wells_map.html" width="100%" height="500">
</iframe>
<p>Look at that! With just a few lines of code we’ve made a map of points that we can pan and zoom. I was honestly expecting a lot more sites. There are other basemaps that you can use, so take a look at <a href="https://rstudio.github.io/leaflet/basemaps.html">what’s available</a> and pick what’s right for you. At this point my map doesn’t display that much information though, so let’s see what we can improve.</p>
<pre class="r"><code>#make a palette to add colors
pal &lt;- colorFactor(topo.colors(5), LA_XY$FIELD_PT_CLASS)
LA_wells_map_with_color &lt;- leaflet(LA_XY) %&gt;%
addProviderTiles(&quot;CartoDB.Positron&quot;) %&gt;%
addCircleMarkers(lng = ~LONGITUDE,
lat = ~LATITUDE,
label= ~as.character(FIELD_PT_CLASS), #add a label
color = ~pal(FIELD_PT_CLASS)) #add colors
#LA_wells_map_with_color</code></pre>
<iframe seamless src="LA_wells_map_with_color.html" width="100%" height="500">
</iframe>
<p>This shows which class the points are and when you hover over the class it shows the class code. If you zoom in to the site near San Gabriel, you can see three monitoring wells (the two close to each other are probably the same one) and a pattern of <code>SG</code> class points that is fairly uniform. The site near El Monte is made up only of boreholes, as is the one near Manhattan Beach. The others have a mix of monitoring wells and piezometers.</p>
<p>The label field in our call to leaflet creates a tag that shows up when we hover our mouse over it. There’s another field for popup, which will make a box that appears when you click on a point. Let’s add some data and put it in a popup box. To do this we’ll have to get the concentration data for the sites and join by the <code>FIELD_PT_NAME</code></p>
<pre class="r"><code>URL &lt;- &quot;https://geotracker.waterboards.ca.gov/data_download/edf_by_county/LosAngelesEDF.zip&quot;
download.file(URL, destfile=&#39;LosAngelesEDF.zip&#39;, method=&#39;curl&#39;)
unzip(&#39;LosAngelesEDF.zip&#39;)
LA_EDF &lt;- read.delim(&quot;LosAngelesEDF.txt&quot;)
str(LA_EDF)</code></pre>
<pre><code>## &#39;data.frame&#39;: 128666 obs. of 23 variables:
## $ COUNTY : Factor w/ 2 levels &quot;\032&quot;,&quot;Los Angeles&quot;: 2 2 2 2 2 2 2 2 2 2 ...
## $ GLOBAL_ID : Factor w/ 8 levels &quot;&quot;,&quot;SL603799209&quot;,..: 3 3 3 3 3 3 3 3 3 3 ...
## $ FIELD_PT_NAME: Factor w/ 127 levels &quot;&quot;,&quot;15200&quot;,&quot;15210&quot;,..: 50 31 40 50 45 39 39 39 32 40 ...
## $ LOGDATE : Factor w/ 274 levels &quot;&quot;,&quot;2006-07-10&quot;,..: 56 46 46 56 47 46 46 46 46 46 ...
## $ LOGTIME : int 545 1230 945 850 1130 730 850 715 1215 900 ...
## $ LOGCODE : Factor w/ 10 levels &quot;&quot;,&quot;AEII&quot;,&quot;EAIH&quot;,..: 4 4 4 4 4 4 4 4 4 4 ...
## $ SAMPID : Factor w/ 2537 levels &quot;&quot;,&quot;15200&quot;,&quot;15210&quot;,..: 2062 1835 1954 2061 2033 1934 1935 1936 1852 1952 ...
## $ MATRIX : Factor w/ 7 levels &quot;&quot;,&quot;AX&quot;,&quot;GS&quot;,&quot;IA&quot;,..: 6 6 6 6 6 6 6 6 6 6 ...
## $ LABWO : Factor w/ 45 levels &quot;&quot;,&quot;1609-03&quot;,&quot;1803292&quot;,..: 45 45 45 45 45 45 45 45 45 45 ...
## $ LABCODE : Factor w/ 9 levels &quot;&quot;,&quot;AAC&quot;,&quot;ASLL&quot;,..: 5 5 5 5 5 5 5 5 5 5 ...
## $ LABSAMPID : Factor w/ 2656 levels &quot;&quot;,&quot;060703521&quot;,..: 525 437 443 527 455 439 440 435 436 442 ...
## $ ANMCODE : Factor w/ 24 levels &quot;&quot;,&quot;8260+OX&quot;,&quot;8260FA&quot;,..: 23 23 23 23 23 23 23 23 23 23 ...
## $ LABLOTCTL : Factor w/ 652 levels &quot;&quot;,&quot;060711B01&quot;,..: 144 123 127 144 129 123 123 123 123 123 ...
## $ ANADATE : Factor w/ 384 levels &quot;&quot;,&quot;2006-07-11&quot;,..: 87 72 74 87 76 72 72 72 72 72 ...
## $ BASIS : Factor w/ 7 levels &quot;&quot;,&quot;A&quot;,&quot;D&quot;,&quot;L&quot;,..: 5 5 5 5 5 5 5 5 5 5 ...
## $ PARLABEL : Factor w/ 150 levels &quot;&quot;,&quot;4:2FTS&quot;,&quot;6:2FTS&quot;,..: 80 12 56 141 90 141 100 82 141 68 ...
## $ PARVAL : num 10 1 1 1 10 1 5 1 1 2 ...
## $ PARVQ : Factor w/ 5 levels &quot;&quot;,&quot;&lt;&quot;,&quot;=&quot;,&quot;ND&quot;,..: 2 2 2 2 3 2 2 2 2 2 ...
## $ LABDL : num 5.4 0.24 0.36 0.23 4.3 0.23 1.5 0.26 0.23 0.18 ...
## $ REPDL : num 10 1 1 1 10 1 5 1 1 2 ...
## $ UNITS : Factor w/ 10 levels &quot;&quot;,&quot;MG/KG&quot;,&quot;MG/L&quot;,..: 8 8 8 8 8 8 8 8 8 8 ...
## $ DILFAC : num 1 1 1 1 1 1 5 1 1 1 ...
## $ LNOTE : Factor w/ 72 levels &quot;&quot;,&quot;B&quot;,&quot;B,VCO&quot;,..: 1 1 1 1 68 1 1 1 1 1 ...</code></pre>
<p>Ooof that’s a lot of data. The question now is what do we want to put in the popup? Most recent concentration of a small class of chemicals? Maximum concentration? Could we add a plot of the concentration over time? I haven’t been able to get the plot-within-a-plot working (yet), so let’s go with maximum concentration of Trichloroethene (TCE) at each well.</p>
<pre class="r"><code>TCE_conc &lt;- LA_EDF %&gt;%
filter(PARLABEL == &quot;TCE&quot;) %&gt;%
group_by(FIELD_PT_NAME) %&gt;%
mutate(TCE = max(PARVAL)) %&gt;%
select(FIELD_PT_NAME, TCE, UNITS) %&gt;%
filter(UNITS %in% c(&quot;UG/L&quot;, &quot;MG/L&quot;)) %&gt;%
unique() %&gt;%
mutate(value = case_when(UNITS == &quot;MG/L&quot; ~ 1000, T ~ 1)) %&gt;% #create a col with conversion factor
mutate(TCE = TCE*value) #convert mg/L to ug/L
LA_TCE_xy &lt;- TCE_conc %&gt;%
inner_join(LA_XY, by = &quot;FIELD_PT_NAME&quot;) %&gt;%
select(FIELD_PT_NAME, TCE, LATITUDE, LONGITUDE)
TCEpal &lt;- colorNumeric(
palette = &quot;YlGnBu&quot;,
domain = 0:110)
popup1 &lt;- &quot;TCE concentration: &quot; #I bet there&#39;s a better way to do this
popup2 &lt;- &quot; ug/L&quot; #Tell me in the comments!
LA_TCE_xy$pop1 &lt;- popup1 #Like and subscribe!
LA_TCE_xy$pop2 &lt;- popup2
LA_TCE_xy$popup &lt;- paste0(LA_TCE_xy$pop1, LA_TCE_xy$TCE, LA_TCE_xy$pop2)
LA_wells_TCE_map &lt;- leaflet(LA_TCE_xy) %&gt;%
addProviderTiles(&quot;CartoDB.Positron&quot;) %&gt;%
addCircleMarkers(lng = ~LONGITUDE,
lat = ~LATITUDE,
color = ~TCEpal(TCE),
popup = ~popup) %&gt;% #add colors
addLegend(pal = TCEpal, values = ~TCE, opacity = 0.7, title = NULL,
position = &quot;bottomright&quot;)
#LA_wells_TCE_map</code></pre>
<iframe seamless src="LA_wells_TCE_map.html" width="100%" height="500">
</iframe>
<p>You may notice some annoying things about this map; most of the values are very low so they skew the colors to almost zero and the scale starts with highest values on the bottom, which is counterintuitive to me. Also we’ve lost some points - the spatial file had 157 points but when we join it with concentration data we’re left with 69 points.</p>
<p>One common way to visualize data with a large range is with a log transformation. Let’s try it out!</p>
<pre class="r"><code>LA_TCE_xy &lt;- LA_TCE_xy %&gt;% #log transform the concentrations
mutate(TCE = case_when(TCE == 0 ~ 0.0001, T ~ TCE)) %&gt;%
mutate(log_TCE = log(TCE))
logpal &lt;- colorNumeric(
palette = &quot;YlGnBu&quot;,
domain = -9.2105:14.701)
LA_wells_logTCE_map &lt;- leaflet(LA_TCE_xy) %&gt;%
addProviderTiles(&quot;CartoDB.Positron&quot;) %&gt;%
addCircleMarkers(lng = ~LONGITUDE,
lat = ~LATITUDE,
color = ~logpal(log_TCE),
popup = ~popup)
#LA_wells_logTCE_map</code></pre>
<iframe seamless src="LA_wells_logTCE_map.html" width="100%" height="500">
</iframe>
<p>If we use the log of the TCE value it’s easier to see the changes in color, though it could still use some fiddling with. I took out the scale becuse the popup still says the real value but the color is now based on a log value. There seem to be <a href="https://stackoverflow.com/questions/40276569/reverse-order-in-r-leaflet-continuous-legend">some workarounds</a> for this but I didn’t try them out. There are probably better color schemes out there but that’s like entering another world, so I’ll leave it at this.</p>
<p>I’m sure that there are many ways to improve this map and build on these concepts in leaflet. Take a look at what’s out there an let me know if you discover anything you love!</p>
</description>
</item>
<item>
<title>Weapons of Math Destruction</title>
<link>/post/wmds/</link>
<pubDate>Mon, 03 Aug 2020 00:00:00 +0000</pubDate>
<guid>/post/wmds/</guid>
<description>
<p>Earlier this summer I got an email about reading groups for grad students hosted by the department of cell and molecular biology. Who wasn’t hungry for a reading group this summer? The first book I signed up to read was Weapons of Math Destruction by Cathy O’Neil. I recognized it because I had heard an <a href="https://www.npr.org/2016/09/12/493654950/weapons-of-math-destruction-outlines-dangers-of-relying-on-data-analytics">NPR interview</a> with the author on the radio. I actually listened to the audiobook because I wasn’t in Fort Collins and libraries were closed (COVID). It was a really engaging story - basically the whole book is composed of examples of “Weapons of Math Destruction”, or WMDs, a term that O’Neil coins in the first chapted to describe models that are important, opaque, and reinforce inequality.</p>
<hr />
<p>The features of WMDs are highlighted through examples (so many examples!) of real algorithms that are in use every day. But O’Neil doesn’t just throw lots of examples of WMDs at us, she also describes models that have some negative aspects, but are not on the whole that harmful, and models that are sound uses of statistics. It’s complicated and forces you to confront how insidious these algorithms are.</p>
<p>Something that surprised me was her discussion of credit scores. I was all ready to jump on board against credit scores, but that wasn’t her conclusion in the book. This is <em>not</em> a WMD. The FICO model has a feedback loop, but it corrects itself rather than perpetuates inequality. If many people who are predicted to be risky borrowers pay on time, the model is updated to account for the new data and make better predictions. In addition, the model is fairly transparent - if you are trying to improve your credit score you can easily look up ways to do so. And credit scores are also regulated, so you have legal rights to see your score and know what goes into it.</p>
<p>In stark contrast are ubiquitous e-scores used by credit card companies, car dealerships, and advertising agencies that you probably have no idea exist. These scores are created by models that combine lots of seemingly disparate information about you and predict if you’re likely to buy their product. The problem is that the factors the model uses are proxies like zip code, which means that inequality created by racist zoning laws is perpetuated by an algorithm. Predatory for-profit college lead generation companies use algorithms that target people who are earnestly searching for oppurtunities at upward mobility - and make a fortune doing it.</p>
<p>Another outraging story was about scheduling software used by low-wage corporations like McDonald’s, Starbucks, and Walmart. They’re designed to optimize saving for the company, so they create ever-changing and erratic schedules that keep workers from qualifying for benefits. In 2014 the <em>New York Times</em> published an article profiling Jannette Navarro, a Starbucks employee and single mother trying to work her way through college. Her life was so dominated by her work schedule that she couldn’t make her classes consistently or plan for any sort of normal day-care schedule. After the story, legislators drafted a bill to regulate scheduling software but it went nowhere.</p>
<p>I’ll leave you with most shocking statistic I read, related to the insurance industry.</p>
<blockquote>
<p>&quot;In Florida, adults with clean driving records
and poor credit scores paid an average of $1,552
more than the same drivers with excellent credit
and a <em>drunk driving convition</em>.</p>
</blockquote>
<p>A seemingly inocuous number created by an algorithm is plugged in to another algorithm for an utterly ridiculous outcome.</p>
<hr />
<p>In the discussion group one person said they were vaguely aware that things like this were happening, but didn’t know the extent or specific examples. It does get somewhat repetitive since the author really pounds the ideas into you with tons of examples, but that didn’t bother me. Another person asked that now that we’ve read the book, what can we do about it? We didn’t come up with a lot of answers - be aware and support policies that regulate them. Are there other actions that people can take to combat the proliferation of Weapons of Math Destruction?</p>
</description>
</item>
<item>
<title>USGS Data Retrieval</title>
<link>/post/streamflowdata/</link>
<pubDate>Sat, 11 Jul 2020 00:00:00 +0000</pubDate>
<guid>/post/streamflowdata/</guid>
<description>
<p>It took me a few times to get this right - the interface of the USGS website is a rabbit hole of buttons and options for data retrieval. It’s not so hard to get the data you’re interested in, but pay close attention because it’s easy to do the wrong thing. Probably the easiest way is to use the <code>dataRetrieval</code> <a href="https://cran.r-project.org/web/packages/dataRetrieval/vignettes/dataRetrieval.html">package</a>. The last commits on their <a href="https://github.com/USGS-R/dataRetrieval">github</a> are from 2 months ago, so it’s pretty up-to-date.</p>
<p>To use this package, you need to know a few things about the structure of the data. Each USGS streamflow station has a number - you’ll need to know the number for the station you want. There’s also a number associated with each measured value; <code>00060</code> for discharge in feet per second, <code>00065</code> for gauge height in feet, <code>00010</code> for temperature in degrees celcius, and many more. The last critical thing to know is the date range of the data you’re interested in.</p>
<ul>
<li>Station Number</li>
<li>Parameter Code</li>
<li>Date Range of Interest</li>
</ul>
<p>Here’s an example from the documentation:</p>
<pre class="r"><code>#install.packages(&quot;dataRetrieval&quot;) #uncomment to install the package
library(dataRetrieval)
# Choptank River near Greensboro, MD:
siteNumber &lt;- &quot;01491000&quot;
parameterCd &lt;- &quot;00060&quot; # Discharge
startDate &lt;- &quot;2009-10-01&quot;
endDate &lt;- &quot;2012-09-30&quot;
discharge &lt;- readNWISdv(siteNumber,
parameterCd, startDate, endDate)</code></pre>
<p>Here’s what the resulting data frame looks like:</p>
<p><img src="discharge.png" /></p>
<p>The fifth column is a quality code - A stands for approved for release by the USGS.</p>
<p>Another notable argument to the <code>readNWISdv</code> function is <code>statCd</code>. This allows you to request daily data and specify the statistic used. For example, you can request the daily maximum, mean, median, or total.</p>
<p>Pretty simple, right?</p>
<p>There are some other handy functions and options in this library. You’ll notice that the colunm names aren’t really descriptive of the data. You can use another function from the package to rename them. Try this:</p>
<pre class="r"><code>discharge &lt;- renameNWISColumns(discharge)
names(discharge)</code></pre>
<pre><code>## [1] &quot;agency_cd&quot; &quot;site_no&quot; &quot;Date&quot; &quot;Flow&quot; &quot;Flow_cd&quot;</code></pre>
<p>That can clean things up a lot - especially if you’re importing many parameters.</p>
<pre class="r"><code>library(ggplot2)
ggplot(discharge, aes(x=Date, y=Flow)) + geom_point(alpha = 0.4)</code></pre>
<p><img src="/post/streamflowdata/index_files/figure-html/unnamed-chunk-3-1.png" width="672" /></p>
<p>What I am interested is the streamflow data, so this is as far as I need to go. You can also import different kinds of USGS data using other functions of the same package. It’s a treasure trove! Check out the <a href="https://cran.r-project.org/web/packages/dataRetrieval/vignettes/dataRetrieval.html">documentation</a> to get the full scope of its powers.</p>
<hr />
<p>I also came across another package called <code>waterData</code> that can do the same retrieval and maybe some more analysis on the data. I haven’t looked further into this, but the last commits to their <a href="https://github.com/USGS-R/waterData">github</a> are from 3 years ago, so I choose the <code>dataRetrieval</code> package.</p>
<hr />
<p>Now I’ll let you in on the less fun bit, which is the way I originally tried to download the data locally then upload to R. I started out fumbling around the website trying to find the gauge closest to the site I’m interested in. There’s an interactive map on the <a href="https://waterdata.usgs.gov/nwis/rt">homepage</a>; you click on a state to get to a more refined map and click around some more to find the gauge you want.</p>
<p><img src="map.png" /></p>
<p>You get to a really promising page that gives you some options to select parameters, a date range, and whether you want the output to be a graph, table, or tab-seperated format. This looks good, but is not.</p>
<p><img src="options.png" /></p>
<p>I chose the tab-seperated option; After a lengthy wait a new tab with the output opened in the browser - so no file download. I guess I could have copied and pasted all of that into a text file, but I decided to search for a way to actually download the file.</p>
<p>I found a <a href="https://help.waterdata.usgs.gov/tutorials/overview/a-primer-on-downloading-data">USGS primer on dowloading data</a> (always read the instructions first!), which took some different turns than the path I took previoulsy. The user interface is difficult to navigate and not intuitive, but it’s a great resource for data. I won’t go through the whole process, because it’s mostly just clicking on the right spot and choosing your desired options, but here is the most important part:</p>
<p><img src="savetofile.png" />
There are still extra steps of choosing the options in the website, downloading the data, then reading it into R. This is the code I used to read the data once I had downloaded it as a text file and renamed it “streamflow”. If I were to go back, I would <em>not</em> do it this way, but I guess it doesn’t require downloading a new package and is fine if you’re not dowloading a lot of data.</p>
<pre class="r"><code>library(tidyverse)
streamflow &lt;- read.table(&quot;streamflow&quot;, skip = 32) #the first 32 lines are metadata
colnames(streamflow) &lt;- c(&quot;agency&quot;, &quot;site_no&quot;, &quot;date&quot;, &quot;time&quot;, &quot;timezone&quot;, &quot;Flow&quot;, &quot;quality_code&quot;)
#discharge is in cubic feet per second, quality code A means approved
streamflow$date &lt;- as.Date(streamflow$date)
ggplot(streamflow, aes(x=date, y=Flow)) + geom_point(alpha=0.4)</code></pre>
<p><img src="/post/streamflowdata/index_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
<p>Hope this info helps you avoid wasting time like I did on my first try! Do you know of a better way to get the data into R? Did you encounter any pitfalls?</p>
</description>
</item>
<item>
<title>Mining Data from California's Geotracker Database</title>
<link>/post/geotracker/</link>
<pubDate>Fri, 03 Jul 2020 00:00:00 +0000</pubDate>
<guid>/post/geotracker/</guid>
<description><p>
<a href="https://geotracker.waterboards.ca.gov/" target="_blank" rel="noopener">Geotracker</a> is a public database that is used to store environmental data from regulated sites in California. I&rsquo;m going to download some data of a contaminated site and clean it up to try to derive some insights about the site. I picked Alpine county because it is the least populated one and I want to work with a smaller file. Some of the more industrial counties can have thousands of sites.</p>
<pre><code class="language-R">URL &lt;- &quot;https://geotracker.waterboards.ca.gov/data_download/edf_by_county/AlpineEDF.zip&quot;
download.file(URL, destfile='alpine.zip', method='curl')
</code></pre>
<p>After downloading the zipped file I need to unzip it and read in the .txt file. Let&rsquo;s see what&rsquo;s inside.</p>
<pre><code class="language-R">unzip('alpine.zip')
alpine &lt;- read.delim(&quot;AlpineEDF.txt&quot;)
head(alpine)
</code></pre>
<table>
<caption>A data.frame: 6 × 23</caption>
<thead>
<tr><th></th><th scope=col>COUNTY</th><th scope=col>GLOBAL_ID</th><th scope=col>FIELD_PT_NAME</th><th scope=col>LOGDATE</th><th scope=col>LOGTIME</th><th scope=col>LOGCODE</th><th scope=col>SAMPID</th><th scope=col>MATRIX</th><th scope=col>LABWO</th><th scope=col>LABCODE</th><th scope=col>⋯</th><th scope=col>ANADATE</th><th scope=col>BASIS</th><th scope=col>PARLABEL</th><th scope=col>PARVAL</th><th scope=col>PARVQ</th><th scope=col>LABDL</th><th scope=col>REPDL</th><th scope=col>UNITS</th><th scope=col>DILFAC</th><th scope=col>LNOTE</th></tr>
<tr><th></th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;int&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>⋯</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;chr&gt;</th></tr>
</thead>
<tbody>
<tr><th scope=row>1</th><td>Alpine</td><td>T0600300005</td><td>K1 </td><td>2012-10-08</td><td>1110</td><td>CLSR</td><td>K1-15 </td><td>SO</td><td>CVJ0496</td><td>CLSR</td><td>⋯</td><td>2012-10-12</td><td>W</td><td>PHCD </td><td> 0</td><td>ND</td><td>0.033</td><td> 1.0</td><td>MG/KG</td><td>1</td><td> </td></tr>
<tr><th scope=row>2</th><td>Alpine</td><td>T0600300005</td><td>KMPUD#2</td><td>2008-12-30</td><td>1330</td><td>CLSR</td><td>KMPUD #2 </td><td>W </td><td>CSA0056</td><td>CLSR</td><td>⋯</td><td>2009-01-07</td><td>N</td><td>MOIL </td><td>79</td><td>= </td><td>9.100</td><td>50.0</td><td>UG/L </td><td>1</td><td>DU,DU</td></tr>
<tr><th scope=row>3</th><td>Alpine</td><td>T0600300005</td><td>KMPUD#2</td><td>2002-04-16</td><td>1203</td><td>KFR </td><td>KMPUD #2 </td><td>WX</td><td>NA </td><td>ALPS</td><td>⋯</td><td>2002-04-18</td><td>N</td><td>MTBE </td><td> 0</td><td>ND</td><td>0.250</td><td> 0.5</td><td>UG/L </td><td>1</td><td> </td></tr>
<tr><th scope=row>4</th><td>Alpine</td><td>T0600300005</td><td>KMPUD#2</td><td>2002-04-16</td><td>1203</td><td>KFR </td><td>KMPUD #2 </td><td>WX</td><td>NA </td><td>ALPS</td><td>⋯</td><td>2002-04-18</td><td>N</td><td>BZME </td><td> 0</td><td>ND</td><td>0.250</td><td> 0.5</td><td>UG/L </td><td>1</td><td> </td></tr>
<tr><th scope=row>5</th><td>Alpine</td><td>T0600300005</td><td>KMPUD#2</td><td>2002-10-31</td><td>1805</td><td>KFR </td><td>KMPUD WELL #2</td><td>WX</td><td>NA </td><td>ALPS</td><td>⋯</td><td>2002-11-04</td><td>N</td><td>DCE11</td><td> 0</td><td>ND</td><td>0.500</td><td> 1.0</td><td>UG/L </td><td>1</td><td> </td></tr>
<tr><th scope=row>6</th><td>Alpine</td><td>T0600300005</td><td>KMPUD#2</td><td>2002-10-31</td><td>1805</td><td>KFR </td><td>KMPUD WELL #2</td><td>WX</td><td>NA </td><td>ALPS</td><td>⋯</td><td>2002-11-04</td><td>N</td><td>BZME </td><td> 0</td><td>ND</td><td>0.250</td><td> 0.5</td><td>UG/L </td><td>1</td><td> </td></tr>
</tbody>
</table>
<p>There&rsquo;s
<a href="https://www.waterboards.ca.gov/ust/electronic_submittal/docs/edf_data_dict_2001.pdf" target="_blank" rel="noopener">documentation</a> about all of the fields, but we can already make sense of some of them. GLOBAL_ID represents a site, and FIELD_PT_NAME is a well. PARLABEL is the code for the name of the contaminant and PARVAL is the concentration reported. There&rsquo;s also helpful QA/QC information, and the latitude and longitude of the wells are in a different file on the geotracker website. We&rsquo;ll focus on one site with the most observations reported.</p>
<pre><code class="language-R">library(dplyr)
alpine %&gt;%
group_by(GLOBAL_ID) %&gt;%
count() %&gt;%
arrange(desc(n))
</code></pre>
<pre><code>Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
</code></pre>
<table>
<caption>A grouped_df: 7 × 2</caption>
<thead>
<tr><th scope=col>GLOBAL_ID</th><th scope=col>n</th></tr>
<tr><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;int&gt;</th></tr>
</thead>
<tbody>
<tr><td>T0600300011</td><td>11764</td></tr>
<tr><td>T0600300005</td><td> 5776</td></tr>
<tr><td>T0600397314</td><td> 5131</td></tr>
<tr><td>T0600300013</td><td> 3780</td></tr>
<tr><td>T0600300007</td><td> 2145</td></tr>
<tr><td>T0600300008</td><td> 1532</td></tr>
<tr><td> </td><td> 1</td></tr>
</tbody>
</table>
<p>Looks like <code>T0600300011</code> is our site. Let&rsquo;s try to see what the main chemicals of concern and their concentrations are.</p>
<pre><code class="language-R">site &lt;- alpine %&gt;%
filter(GLOBAL_ID == &quot;T0600300011&quot;)
site %&gt;%
group_by(PARLABEL) %&gt;%
tally(sort = TRUE) %&gt;%
head()
</code></pre>
<table>
<caption>A tibble: 6 × 2</caption>
<thead>
<tr><th scope=col>PARLABEL</th><th scope=col>n</th></tr>
<tr><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;int&gt;</th></tr>
</thead>
<tbody>
<tr><td>BZ </td><td>260</td></tr>
<tr><td>BZME </td><td>260</td></tr>
<tr><td>EBZ </td><td>260</td></tr>
<tr><td>XYLENES1314</td><td>203</td></tr>
<tr><td>XYLO </td><td>203</td></tr>
<tr><td>MTBE </td><td>194</td></tr>
</tbody>
</table>
<p>The most results came from benzene (BZ), toluene (BZME), ethylbenzene (EBZ), xylene - isomers m &amp; p (XYLENES1314), o-xylene (XYLO), and methyl-tert-butyl ether (MTBE). This is a hydrocarbon site. Now let&rsquo;s try to see which wells are near the source area.</p>
<pre><code class="language-R">top_wells &lt;- site %&gt;%
filter(PARLABEL %in% c(&quot;BZ&quot;, &quot;BZME&quot;, &quot;EBZ&quot;, &quot;XYLENES1314&quot;, &quot;XYLO&quot;, &quot;MTBE&quot;)) %&gt;%
select(c(&quot;FIELD_PT_NAME&quot;, &quot;LOGDATE&quot;, &quot;MATRIX&quot;, &quot;PARLABEL&quot;, &quot;PARVAL&quot;, &quot;PARVQ&quot;, &quot;LABDL&quot;, &quot;REPDL&quot;, &quot;UNITS&quot;,)) %&gt;%
group_by(FIELD_PT_NAME) %&gt;%
tally(sort = TRUE) %&gt;%
filter(n&gt;100)
# the wells with the most data are:
top_wells
</code></pre>
<table>
<caption>A tibble: 6 × 2</caption>
<thead>
<tr><th scope=col>FIELD_PT_NAME</th><th scope=col>n</th></tr>
<tr><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;int&gt;</th></tr>
</thead>
<tbody>
<tr><td>MW-3</td><td>167</td></tr>
<tr><td>MW-2</td><td>161</td></tr>
<tr><td>MWA4</td><td>157</td></tr>
<tr><td>MW-1</td><td>155</td></tr>
<tr><td>MWA1</td><td>143</td></tr>
<tr><td>MWA2</td><td>143</td></tr>
</tbody>
</table>
<p>The most sampled wells are MW-3, MW-2, MWA4, MW-1, MWA1, and MWA2. Naming conventions for wells can mean nothing or can mean a lot, and get wackier the longer a site has been around.</p>
<pre><code class="language-R">site &lt;- site %&gt;%
filter(FIELD_PT_NAME %in% top_wells$FIELD_PT_NAME) %&gt;%
filter(PARLABEL %in% c(&quot;BZ&quot;, &quot;BZME&quot;, &quot;EBZ&quot;, &quot;XYLENES1314&quot;, &quot;XYLO&quot;, &quot;MTBE&quot;)) %&gt;%
select(c(&quot;FIELD_PT_NAME&quot;, &quot;LOGDATE&quot;, &quot;MATRIX&quot;, &quot;PARLABEL&quot;, &quot;PARVAL&quot;, &quot;PARVQ&quot;, &quot;LABDL&quot;, &quot;REPDL&quot;, &quot;UNITS&quot;,))
head(site)
</code></pre>
<table>
<caption>A data.frame: 6 × 9</caption>
<thead>
<tr><th></th><th scope=col>FIELD_PT_NAME</th><th scope=col>LOGDATE</th><th scope=col>MATRIX</th><th scope=col>PARLABEL</th><th scope=col>PARVAL</th><th scope=col>PARVQ</th><th scope=col>LABDL</th><th scope=col>REPDL</th><th scope=col>UNITS</th></tr>
<tr><th></th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;chr&gt;</th></tr>
</thead>
<tbody>
<tr><th scope=row>1</th><td>MW-1</td><td>2002-10-08</td><td>W</td><td>BZ </td><td>0</td><td>ND</td><td>5.0</td><td>5.0</td><td>UG/L</td></tr>
<tr><th scope=row>2</th><td>MW-1</td><td>2003-09-08</td><td>W</td><td>BZ </td><td>0</td><td>ND</td><td>5.0</td><td>5.0</td><td>UG/L</td></tr>
<tr><th scope=row>3</th><td>MW-1</td><td>2002-11-25</td><td>W</td><td>XYLO</td><td>0</td><td>ND</td><td>5.0</td><td>5.0</td><td>UG/L</td></tr>
<tr><th scope=row>4</th><td>MW-1</td><td>2003-03-05</td><td>W</td><td>MTBE</td><td>0</td><td>ND</td><td>5.0</td><td>5.0</td><td>UG/L</td></tr>
<tr><th scope=row>5</th><td>MW-1</td><td>2003-03-05</td><td>W</td><td>EBZ </td><td>0</td><td>ND</td><td>5.0</td><td>5.0</td><td>UG/L</td></tr>
<tr><th scope=row>6</th><td>MW-1</td><td>2003-03-05</td><td>W</td><td>EBZ </td><td>0</td><td>ND</td><td>0.5</td><td>0.5</td><td>UG/L</td></tr>
</tbody>
</table>
<p>Let's take a closer look at the units. The first six observations are reported in &mu;g/L, but are they all? </p>
<p>First we&rsquo;ll want to clean up the classes - the <code>FIELD_PT_NAME</code> should be a factor and <code>LOGDATE</code> should be a date.</p>
<pre><code class="language-R">site$FIELD_PT_NAME &lt;- as.factor(site$FIELD_PT_NAME)
site$LOGDATE &lt;- as.Date(site$LOGDATE)
site$MATRIX &lt;- as.factor(site$MATRIX)
site$PARLABEL &lt;- as.factor(site$PARLABEL)
site$PARVQ &lt;- as.factor(site$PARVQ)
site$UNITS &lt;- as.factor(site$UNITS)
levels(site$PARLABEL) &lt;- list(Benzene=&quot;BZ&quot;, Toluene=&quot;BZME&quot;, Ethylbenzene=&quot;EBZ&quot;, Xylenes_MP=&quot;XYLENES1314&quot;, Xylene_O=&quot;XYLO&quot;, MTBE=&quot;MTBE&quot;)
summary(site)
</code></pre>
<pre><code> FIELD_PT_NAME LOGDATE MATRIX PARLABEL
MW-1:155 Min. :2002-07-11 W:926 Benzene :179
MW-2:161 1st Qu.:2003-05-27 Toluene :179
MW-3:167 Median :2004-05-05 Ethylbenzene:179
MWA1:143 Mean :2005-11-06 Xylenes_MP :132
MWA2:143 3rd Qu.:2008-05-05 Xylene_O :132
MWA4:157 Max. :2010-04-14 MTBE :125
PARVAL PARVQ LABDL REPDL UNITS
Min. : 0.00 = :135 Min. :0.080 Min. : 0.500 UG/L:926
1st Qu.: 0.00 ND:791 1st Qu.:0.380 1st Qu.: 0.500
Median : 0.00 Median :0.500 Median : 0.500
Mean : 20.95 Mean :1.909 Mean : 3.568
3rd Qu.: 0.00 3rd Qu.:5.000 3rd Qu.: 5.000
Max. :1150.00 Max. :9.900 Max. :50.000
</code></pre>
<p>Turns out that they <i>are</i> all reported in &mu;g/L, which is good, because that means the concentrations are probably fairly low - remember one &mu;g/L is one part per <i>billion</i>. This isn't always the case - be sure to keep an eye out for units and do conversions as necessary before working with the data. The PARVQ tells us if the chemical was detected in the sample or below the detection limit (ND). For 791 of the observations, the chemical was not detected in the sample, while in only 135 observations a true detected concentration is reported. In cases where the chemical is not detected, is NA, zero, the lab detection limit, report detection limit, or something else reported? </p>
<pre><code class="language-R">site %&gt;%
filter(PARVQ == 'ND') %&gt;%
head()
</code></pre>
<table>
<caption>A data.frame: 6 × 9</caption>
<thead>
<tr><th></th><th scope=col>FIELD_PT_NAME</th><th scope=col>LOGDATE</th><th scope=col>MATRIX</th><th scope=col>PARLABEL</th><th scope=col>PARVAL</th><th scope=col>PARVQ</th><th scope=col>LABDL</th><th scope=col>REPDL</th><th scope=col>UNITS</th></tr>
<tr><th></th><th scope=col>&lt;fct&gt;</th><th scope=col>&lt;date&gt;</th><th scope=col>&lt;fct&gt;</th><th scope=col>&lt;fct&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;fct&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;fct&gt;</th></tr>
</thead>
<tbody>
<tr><th scope=row>1</th><td>MW-1</td><td>2002-10-08</td><td>W</td><td>Benzene </td><td>0</td><td>ND</td><td>5.0</td><td>5.0</td><td>UG/L</td></tr>
<tr><th scope=row>2</th><td>MW-1</td><td>2003-09-08</td><td>W</td><td>Benzene </td><td>0</td><td>ND</td><td>5.0</td><td>5.0</td><td>UG/L</td></tr>
<tr><th scope=row>3</th><td>MW-1</td><td>2002-11-25</td><td>W</td><td>Xylene_O </td><td>0</td><td>ND</td><td>5.0</td><td>5.0</td><td>UG/L</td></tr>
<tr><th scope=row>4</th><td>MW-1</td><td>2003-03-05</td><td>W</td><td>MTBE </td><td>0</td><td>ND</td><td>5.0</td><td>5.0</td><td>UG/L</td></tr>
<tr><th scope=row>5</th><td>MW-1</td><td>2003-03-05</td><td>W</td><td>Ethylbenzene</td><td>0</td><td>ND</td><td>5.0</td><td>5.0</td><td>UG/L</td></tr>
<tr><th scope=row>6</th><td>MW-1</td><td>2003-03-05</td><td>W</td><td>Ethylbenzene</td><td>0</td><td>ND</td><td>0.5</td><td>0.5</td><td>UG/L</td></tr>
</tbody>
</table>
<p>It looks like the value reported for non-detected samples is zero. This can cause problems when analyzing the data statistically. Let&rsquo;s visualize what these concentrations look like at MW-1.</p>
<pre><code class="language-R">MW1 &lt;- site %&gt;%
filter(FIELD_PT_NAME == &quot;MW-1&quot;)
MW2 &lt;- site %&gt;%
filter(FIELD_PT_NAME == &quot;MW-2&quot;)
MW3 &lt;- site %&gt;%
filter(FIELD_PT_NAME == &quot;MW-3&quot;)
MWA1 &lt;- site %&gt;%
filter(FIELD_PT_NAME == &quot;MWA1&quot;)
MWA2 &lt;- site %&gt;%
filter(FIELD_PT_NAME == &quot;MWA2&quot;)
MWA4 &lt;- site %&gt;%
filter(FIELD_PT_NAME == &quot;MWA4&quot;)
library(ggplot2)
ggplot(MW1, aes(x = LOGDATE, y = PARVAL, color = PARLABEL)) +
geom_line() +
geom_point(alpha = 0.6) +
labs(title =&quot;MW-1&quot;, x =&quot;Date&quot;, y = &quot;Concentration in ug/L&quot;)
ggplot(MW2, aes(x = LOGDATE, y = PARVAL, color = PARLABEL)) +
geom_line() +
geom_point(alpha = 0.6) +
labs(title =&quot;MW-2&quot;, x =&quot;Date&quot;, y = &quot;Concentration in ug/L&quot;)
ggplot(MW3, aes(x = LOGDATE, y = PARVAL, color = PARLABEL)) +
geom_line() +
geom_point(alpha = 0.6) +
labs(title =&quot;MW-3&quot;, x =&quot;Date&quot;, y = &quot;Concentration in ug/L&quot;)
ggplot(MWA1, aes(x = LOGDATE, y = PARVAL, color = PARLABEL)) +
geom_line() +
geom_point(alpha = 0.6) +
labs(title =&quot;MWA1&quot;, x =&quot;Date&quot;, y = &quot;Concentration in ug/L&quot;)
ggplot(MWA2, aes(x = LOGDATE, y = PARVAL, color = PARLABEL)) +
geom_line() +
geom_point(alpha = 0.6) +
labs(title =&quot;MWA2&quot;, x =&quot;Date&quot;, y = &quot;Concentration in ug/L&quot;)
</code></pre>
<p><img src="./index_19_0.png" alt="png"></p>
<p><img src="./index_19_1.png" alt="png"></p>
<p><img src="./index_19_2.png" alt="png"></p>
<p><img src="./index_19_3.png" alt="png"></p>
<p><img src="./index_19_4.png" alt="png"></p>
<p>At these wells it looks like there were some spikes before 2004, but nothing much going on later. Let&rsquo;s look at the last well from our selection - MWA4.</p>
<pre><code class="language-R">ggplot(MWA4, aes(x = LOGDATE, y = PARVAL, color = PARLABEL)) +
geom_line() +
geom_point(alpha = 0.6) +
labs(title =&quot;MWA4&quot;, x =&quot;Date&quot;, y = &quot;Concentration in ug/L&quot;)
</code></pre>
<p><img src="./index_21_0.png" alt="png"></p>
<p>This one has a lot higher concentrations - take a look at the values on the y axis. Yikes! Let&rsquo;s see what the
<a href="https://www.epa.gov/ground-water-and-drinking-water/national-primary-drinking-water-regulations" target="_blank" rel="noopener">EPA Maximum Contaminant Levels (MCLs)</a> are for these chemicals.</p>
<pre><code class="language-R">ggplot(MWA4, aes(x = LOGDATE, y = PARVAL, color = PARLABEL)) +
geom_line() +
geom_point(alpha = 0.6) +
labs(title =&quot;MWA4&quot;, x =&quot;Date&quot;, y = &quot;Concentration in ug/L&quot;) +
geom_hline(yintercept=5, linetype=&quot;dashed&quot;, color=&quot;#F8766D&quot;) +
geom_hline(yintercept=1000, linetype=&quot;dashed&quot;, color=&quot;#B79F00&quot;) +
geom_hline(yintercept=700, linetype=&quot;dashed&quot;, color=&quot;#00BA38&quot;)
</code></pre>
<p><img src="./index_23_0.png" alt="png"></p>
<p>There is no MCL for MTBE (there are probably state guidelines) and the MCL for total xylenes is 10,000, which is off this chart. Benzene has the lowest MCL by far, and it is usually the chemical that drives cleanup. It&rsquo;s a little hard for us to see it on this plot, so let&rsquo;s zoom in some more.</p>
<pre><code class="language-R">MWA4_benzene &lt;- MWA4 %&gt;%
filter(PARLABEL == &quot;Benzene&quot;)
ggplot(MWA4_benzene, aes(x = LOGDATE, y = PARVAL, color = &quot;Benzene&quot;,)) +
geom_line() +
geom_point(alpha = 0.6) +
labs(title =&quot;MWA4&quot;, x =&quot;Date&quot;, y = &quot;Benzene Concentration in ug/L&quot;) +
geom_hline(yintercept=5, linetype=&quot;dashed&quot;, color=&quot;red&quot;)
</code></pre>
<p><img src="./index_25_0.png" alt="png"></p>
<p>Around 2004 this well has almost seven times the MCL of benzene, but it quickly went down to non-detectable or very low for the rest of the recorded period. If I was especially interested in this site I would try to request some of the accompanying reports and perform a Mann-Kendall analysis of trends in the wells, but mostly I wanted to show how you can download data from Geotracker and manipulate it in R.</p>
<p>This tutorial is done entirely in a jupyter notebook running an R kernal inside of a docker container. The source files are on github and if you have any questions about using jupyter or docker don&rsquo;t hesitate to contact me!</p>
</description>
</item>
<item>
<title>Accessing Historical Weather Data with Dark Sky API</title>
<link>/post/weather/</link>
<pubDate>Wed, 01 Jul 2020 00:00:00 +0000</pubDate>
<guid>/post/weather/</guid>
<description>
<p>In my IoT class last year with <a href="https://soilcrop.agsci.colostate.edu/faculty-2/ham-jay/">Jay Ham</a> we used a website called <a href="https://darksky.net/">Dark Sky</a> to get current weather conditions. I’ve been thinking about this recently, since I would like to see if I can match up weather conditions to the changes in the depth to water of wells at a site. I was inspired to look into this based on a talk from Jonathan Kennel from the Univesity of Guelph ( <em>Happy Canada Day!</em> ) and several conversations with my advisor.</p>
<p>I’ll walk through how I imported the data to R.</p>
<hr />
<p>Dark Sky is a great resource, however, when I went to re-visit the website I found that they have joined Apple. This means they are no longer creating new accounts. Luckily I already had one from my class last fall. The API is still supported through at least the end of 2021. Later I’ll mention some ways that you could get similar (maybe better) data through other channels.</p>
<p>The API allows up to 1000 calls per day for free. Using the <em>Time Machine</em> you can request data from a certain date and location. I focused on hourly data, though it’s probably finer resolution than I need.</p>
<p>“The <code>hourly</code> data block will contain data points starting at midnight (local time) of the day requested, and continuing until midnight (local time) of the following day.”</p>
<p>The docs include a sample API call, which includes your key, the location, and the date requested.</p>
<p><code>GET https://api.darksky.net/forecast/0123456789abcdef9876543210fedcba/42.3601,-71.0589,255657600?exclude=currently,flags</code></p>
<p>A quick visit to my best friend stack overflow provided <a href="https://stackoverflow.com/questions/46069322/r-api-call-for-json-data-and-converting-to-dataframe">a little more clarity</a> about how to use the API in R. The date is in UNIX format. I wanted to start at January 1, 2000, so I used a handy <a href="https://www.unixtimestamp.com/">UNIX converter</a> to find my desired date number, 946684800. I replaced the url and now I’m ready to call the API.</p>
<pre class="r"><code>#GET the url
req &lt;- httr::GET(&quot;https://api.darksky.net/forecast/{key}/30.012188,-94.024525,946684800?exclude=currently,flags&quot;)
req$status_code
# should be 200 if it worked. If you get 400, something has gone wrong.
# extract req$content
cont &lt;- req$content
#Convert to char
char &lt;- rawToChar(req$content)
#Convert to df
df &lt;- jsonlite::fromJSON(char)</code></pre>
<p>It worked! I removed my private API key, so you’ll have to take my word for it. But now I have a new problem - the call only works for one day at a time. I want a lot of days, so I decided to write a loop.</p>
<p>One thing I couldn’t get to work was changing the date inside the string for the API url. I posted in the R for Data Science Slack, and a few minutes later I learned a handy new trick - you can put a variable inside a string by just inserting it with quotes around the variable. Something like this:
<code>&quot;https://api.darksky.net/forecast/{key}/30.012188,-94.024525,&quot;,day,&quot;?exclude=currently,flags&quot;</code></p>
<p>Great! I ran the loop and it worked, kinda. It errored out after the first run because rbind could not combine two data frames with different numbers of columns. After looking at the next few days to see which columns were off I saw that it went from 15 to 16 to 17, then back down. Very annoying! They must have added some new info for some days, but this made the data inconsistent so I had to add a select function to the loop. I selected for the 15 columns that were consistent across all days and ran it again. Success!</p>
<p>Here’s the code I ended up with:</p>
<pre class="r"><code>library(dplyr)
#initialize all_hours so there&#39;s something to rbind to
df &lt;- jsonlite::fromJSON(paste0(&quot;https://api.darksky.net/forecast/{key}/30.012188,-94.024525,946684800?exclude=currently,flags&quot;))
all_hours &lt;- df$hourly$data
# make a vector of unix dates I want (minus the first one, which I already put in all_hours)
unix_day &lt;- seq(946771200, 1033084800, by=86400)
for (day in unix_day){
df &lt;- jsonlite::fromJSON(paste0(&quot;https://api.darksky.net/forecast/{key}/30.012188,-94.024525,&quot;,
day,
&quot;?exclude=currently,flags&quot;))
hourly &lt;- select(df$hourly$data, c(cols))
all_hours &lt;- rbind(hourly, all_hours)}
#convert unix time to date
all_hours$time &lt;- as.POSIXct(all_hours$time, origin=&quot;1970-01-01&quot;)</code></pre>
<p>I selected the columns I want and saved them as weather.csv. Let’s zoom in to two rainy days in May 2000.</p>
<pre class="r"><code>library(ggplot2)
weather &lt;- read.csv(&quot;weather.csv&quot;)
#weather &lt;- weather[2810:2845,]
#convert unix time to date
weather$time &lt;- as.POSIXct(weather$time, origin=&quot;1970-01-01&quot;)
ggplot(weather, aes(x = time, y = precipIntensity, group =1)) +
geom_point(alpha = 0.4, color = &quot;blue&quot;) +
geom_line(alpha = 0.4, color = &quot;blue&quot;, size = 0.5) +
theme_gray() +
theme(axis.text.x=element_text(angle=90, hjust=1)) +
labs(title =&quot;Precipitation Over Time&quot;, x = &quot;Date&quot;, y = &quot;Precipitation in Inches&quot;)</code></pre>
<p><img src="/post/weather/index_files/figure-html/unnamed-chunk-3-1.png" width="672" /></p>
<p>You can clearly see the precipitation-generating storm events over a few months.</p>
<p>It will take me a few days of using my 1000 free API calls to cover the period I’m interested in, but overall it was really easy.</p>
<hr />
<div id="other-weather-options" class="section level2">
<h2>Other Weather Options</h2>
<p>Clearly getting historical weather data for a certain location can be really useful, but since Dark Sky is no longer creating accounts it’s not a very practical resource.</p>
<p>The helpful commenters on the R for Data Science Slack also suggested using <a href="https://www.ncdc.noaa.gov/data-access/land-based-station-data">NOAA</a> to get the data. This is probably a more robust dataset anyway, but the downside is that they don’t have the option to use lat/long as a location - you have to pick from one of their existing stations. In my case there wasn’t one near the field site I was interested in, but they’re spread out across the country so I bet it will be a great source for a lot of people. There’s a good tutorial on accessing this data at <a href="https://ropensci.org/tutorials/rnoaa_tutorial/">R Open Sci</a>.</p>
<p>I’ve also hear the <a href="https://www.wunderground.com/pws/overview">Weather Underground</a> has a good API, but it looks like you need to contribute weather data with an IoT device to access it. Cool! But may not be useful to some.</p>
<p>Are there any other sources of weather data that you know of? Do you have suggestions to improve the approach I used for the Dark Sky data?</p>
</div>
</description>
</item>
<item>
<title>Tidy Tuesday - African American History</title>
<link>/post/2020-06-26-ttafamhistory/</link>
<pubDate>Fri, 26 Jun 2020 21:13:14 -0500</pubDate>
<guid>/post/2020-06-26-ttafamhistory/</guid>
<description>
<p>If you’re not familiar with Tidy Tuesday, it is a weekly project hosted online by the R for data science community. Every Tuesday a new dataset is released and people are encouraged to explore, analyse, and visualize it in interesting ways. This is my first week exploring tidy tuesday data. Information about the project and datasets is at <a href="https://github.com/rfordatascience/tidytuesday">the tidytuesday github</a>. Before working with this data I watched Julia Silge’s excellent <a href="https://juliasilge.com/blog/captive-africans-voyages/">screencast</a> and picked up some great ways to find missing values and recode data.</p>
<div id="a-little-history" class="section level2">
<h2>A Little History</h2>
<p>I learned a lot just by looking at the data provided. I was not previously aware of the history captured in the african_names dataset - which lists the names of enslaved people that were freed as they were being illegally smuggled to the Americas. The most names were recorded at the port of Freetown in Sierra Leone before making the trans-atlantic journey. Here’s the description of the dataset excepted on the tidytuesday github page:</p>
<hr />
<p><em>During the last 60 years of the trans-Atlantic slave trade, courts around the Atlantic basins condemned over two thousand vessels for engaging in the traffic and recorded the details of captives found on board including their African names. The African Names Database was created from these records, now located in the Registers of Liberated Africans at the Sierra Leone National Archives, Freetown, as well as Series FO84, FO313, CO247 and CO267 held at the British National Archives in London. Links are provided to the ships in the Voyages Database from which the liberated Africans were rescued, as well as to the African Origins site where users can hear the names pronounced and help us identify the languages in which they think the names are used.</em></p>
<hr />
<table>
<caption><span id="tab:unnamed-chunk-2">Table 1: </span>Data summary</caption>
<tbody>
<tr class="odd">
<td align="left">Name</td>
<td align="left">african_names</td>
</tr>
<tr class="even">
<td align="left">Number of rows</td>
<td align="left">91490</td>
</tr>
<tr class="odd">
<td align="left">Number of columns</td>
<td align="left">11</td>
</tr>
<tr class="even">
<td align="left">_______________________</td>
<td align="left"></td>
</tr>
<tr class="odd">
<td align="left">Column type frequency:</td>
<td align="left"></td>
</tr>
<tr class="even">
<td align="left">character</td>
<td align="left">6</td>
</tr>
<tr class="odd">
<td align="left">numeric</td>
<td align="left">5</td>
</tr>
<tr class="even">
<td align="left">________________________</td>
<td align="left"></td>
</tr>
<tr class="odd">
<td align="left">Group variables</td>
<td align="left">None</td>
</tr>
</tbody>
</table>
<p><strong>Variable type: character</strong></p>
<table>
<thead>
<tr class="header">
<th align="left">skim_variable</th>
<th align="right">n_missing</th>
<th align="right">complete_rate</th>
<th align="right">min</th>
<th align="right">max</th>
<th align="right">empty</th>
<th align="right">n_unique</th>
<th align="right">whitespace</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">name</td>
<td align="right">0</td>
<td align="right">1.00</td>
<td align="right">2</td>
<td align="right">24</td>
<td align="right">0</td>
<td align="right">62330</td>
<td align="right">0</td>
</tr>
<tr class="even">
<td align="left">gender</td>
<td align="right">12878</td>
<td align="right">0.86</td>
<td align="right">3</td>
<td align="right">5</td>
<td align="right">0</td>
<td align="right">4</td>
<td align="right">0</td>
</tr>
<tr class="odd">
<td align="left">ship_name</td>
<td align="right">1</td>
<td align="right">1.00</td>
<td align="right">2</td>
<td align="right">59</td>
<td align="right">0</td>
<td align="right">443</td>
<td align="right">0</td>
</tr>
<tr class="even">
<td align="left">port_disembark</td>
<td align="right">0</td>
<td align="right">1.00</td>
<td align="right">6</td>
<td align="right">19</td>
<td align="right">0</td>
<td align="right">5</td>
<td align="right">0</td>
</tr>
<tr class="odd">
<td align="left">port_embark</td>
<td align="right">1126</td>
<td align="right">0.99</td>
<td align="right">4</td>
<td align="right">31</td>
<td align="right">0</td>
<td align="right">59</td>
<td align="right">0</td>
</tr>
<tr class="even">
<td align="left">country_origin</td>
<td align="right">79404</td>
<td align="right">0.13</td>
<td align="right">3</td>
<td align="right">31</td>
<td align="right">0</td>
<td align="right">563</td>
<td align="right">0</td>