@@ -294,12 +294,13 @@ str(subset)
294
294
295
295
Ok, thats a lot up unpack! Some things to notice.
296
296
297
- - the object type ` data.frame ` is displayed in the first row along with its
297
+ - The object type ` data.frame ` is displayed in the first row along with its
298
298
dimensions, in this case 801 observations (rows) and 4 variables (columns)
299
- - Each variable (column) has a name (e.g. ` sample_id ` ). This is followed
300
- by the object mode (e.g. chr, int, etc.). Notice that before each
299
+ - Each variable (column) has a name (e.g. ` sample_id ` ). Notice that before each
301
300
variable name there is a ` $ ` - this will be important later.
302
-
301
+ - Each variable name is followed by the data type it contains (e.g. chr, int, etc.).
302
+ The ` int ` type shows an integer, which is a type of numerical data, where it can only
303
+ store whole numbers (i.e. no decimal points ).
303
304
304
305
305
306
::::::::::::::::::::::::::::::::::::::: challenge
@@ -379,11 +380,109 @@ head(alt_alleles)
379
380
```
380
381
381
382
There are 801 alleles (one for each row). To simplify, lets look at just the
382
- single-nucleotide alleles (SNPs). We can use some of the vector indexing skills
383
- from the last episode.
383
+ single-nucleotide alleles (SNPs).
384
+
385
+ Let's review some of the vector indexing skills from the last episode that can help:
386
+
387
+
388
+ ``` r
389
+ # This will find all matching alleles with the single nucleotide "A" and provide a TRUE/FASE vector
390
+ alt_alleles == " A"
391
+ ```
392
+
393
+ ``` output
394
+ [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
395
+ [13] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
396
+ [25] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
397
+ [37] TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
398
+ [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
399
+ [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
400
+ [73] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
401
+ [85] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
402
+ [97] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
403
+ [109] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
404
+ [121] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
405
+ [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
406
+ [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
407
+ [157] FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
408
+ [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
409
+ [181] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
410
+ [193] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
411
+ [205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
412
+ [217] FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
413
+ [229] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
414
+ [241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
415
+ [253] FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE
416
+ [265] FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
417
+ [277] TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
418
+ [289] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE
419
+ [301] FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
420
+ [313] FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE
421
+ [325] FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE
422
+ [337] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
423
+ [349] FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
424
+ [361] FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE
425
+ [373] FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
426
+ [385] FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE
427
+ [397] TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE
428
+ [409] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE
429
+ [421] FALSE TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
430
+ [433] TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
431
+ [445] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE
432
+ [457] TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
433
+ [469] TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
434
+ [481] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
435
+ [493] TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
436
+ [505] TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
437
+ [517] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
438
+ [529] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE
439
+ [541] TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
440
+ [553] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
441
+ [565] FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE
442
+ [577] FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
443
+ [589] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
444
+ [601] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
445
+ [613] FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE
446
+ [625] TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
447
+ [637] FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE
448
+ [649] TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
449
+ [661] FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE FALSE
450
+ [673] TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
451
+ [685] FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
452
+ [697] TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
453
+ [709] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
454
+ [721] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
455
+ [733] TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE TRUE FALSE
456
+ [745] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
457
+ [757] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
458
+ [769] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
459
+ [781] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
460
+ [793] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
461
+ ```
384
462
463
+ ``` r
464
+ # Then, we wrap them into an index to pull all the positions that match this.
465
+ alt_alleles [alt_alleles == " A" ]
466
+ ```
467
+
468
+ ``` output
469
+ [1] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A"
470
+ [19] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A"
471
+ [37] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A"
472
+ [55] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A"
473
+ [73] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A"
474
+ [91] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A"
475
+ [109] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A"
476
+ [127] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A"
477
+ [145] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A"
478
+ [163] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A"
479
+ [181] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A"
480
+ [199] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A"
481
+ ```
385
482
386
483
``` r
484
+ # If we repeat this for each nucleotide A, T, G, and C, and connect them using `c()`,
485
+ # we can index all the single nucleotide changes.
387
486
snps <- c(alt_alleles [alt_alleles == " A" ],
388
487
alt_alleles [alt_alleles == " T" ],
389
488
alt_alleles [alt_alleles == " G" ],
@@ -418,7 +517,18 @@ Error in plot.window(...): need finite 'ylim' values
418
517
```
419
518
420
519
Whoops! Though the ` plot() ` function will do its best to give us a quick plot,
421
- it is unable to do so here. One way to fix this it to tell R to treat the SNPs
520
+ it is unable to do so here. Let's use ` str() ` to see why this might be:
521
+
522
+
523
+ ``` r
524
+ str(snps )
525
+ ```
526
+
527
+ ``` output
528
+ chr [1:707] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" ...
529
+ ```
530
+
531
+ R may not know how to plot a character vector! One way to fix this it to tell R to treat the SNPs
422
532
as categories (i.e. a factor vector); we will create a new object to avoid
423
533
confusion using the ` factor() ` function:
424
534
@@ -463,7 +573,17 @@ summary(factor_snps)
463
573
211 139 154 203
464
574
```
465
575
466
- As you can imagine, this is already useful when you want to generate a tally.
576
+ ``` r
577
+ # Compare the character vector
578
+ summary(snps )
579
+ ```
580
+
581
+ ``` output
582
+ Length Class Mode
583
+ 707 character character
584
+ ```
585
+
586
+ As you can imagine, factors are already useful when you want to generate a tally.
467
587
468
588
::::::::::::::::::::::::::::::::::::::::: callout
469
589
@@ -489,7 +609,7 @@ possible SNP we could generate a plot:
489
609
plot(factor_snps )
490
610
```
491
611
492
- <img src =" fig/03-basics-factors-dataframes-rendered-unnamed-chunk-16 -1.png " style =" display : block ; margin : auto ;" />
612
+ <img src =" fig/03-basics-factors-dataframes-rendered-unnamed-chunk-17 -1.png " style =" display : block ; margin : auto ;" />
493
613
494
614
This isn't a particularly pretty example of a plot but it works. We'll be
495
615
learning much more about creating nice, publication-quality graphics later in
@@ -526,7 +646,7 @@ Now we see our plot has be reordered:
526
646
plot(ordered_factor_snps )
527
647
```
528
648
529
- <img src =" fig/03-basics-factors-dataframes-rendered-unnamed-chunk-18 -1.png " style =" display : block ; margin : auto ;" />
649
+ <img src =" fig/03-basics-factors-dataframes-rendered-unnamed-chunk-19 -1.png " style =" display : block ; margin : auto ;" />
530
650
531
651
Factors come in handy in many places when using R. Even using more
532
652
sophisticated plotting packages such as ggplot2 will sometimes require you
@@ -555,7 +675,7 @@ These packages will be installed into "~/work/genomics-r-intro/genomics-r-intro/
555
675
556
676
# Installing packages --------------------------------------------------------
557
677
- Installing ggplot2 ... OK [linked from cache]
558
- Successfully installed 1 package in 7.2 milliseconds.
678
+ Successfully installed 1 package in 9.9 milliseconds.
559
679
```
560
680
561
681
``` r
@@ -569,7 +689,7 @@ These packages will be installed into "~/work/genomics-r-intro/genomics-r-intro/
569
689
570
690
# Installing packages --------------------------------------------------------
571
691
- Installing dplyr ... OK [linked from cache]
572
- Successfully installed 1 package in 6.4 milliseconds.
692
+ Successfully installed 1 package in 5.5 milliseconds.
573
693
```
574
694
575
695
These two packages are among the most popular add on packages used in R, and they are part of a large set of very useful packages called the [ tidyverse] ( https://www.tidyverse.org ) . Packages in the tidyverse are designed to work well together and are made to work with tidy data (which we described earlier in this lesson).
0 commit comments