Mauricio Matoma
library(dslabs)Para conocer el directorio de trabajo, se utiliza la función getwd() (funciona de manera similar a pwd de Unix).
getwd()[1] "D:/Documentos/R/HarvardX_PH125/PH125.6x/HarvardX_PH125.6x"
Se cambia la dirección de trabajo con setwd(). Para este caso, se utilizarán datos dummy del paquete dslabs
path <- system.file("extdata", package = "dslabs")
list.files(path) [1] "2010_bigfive_regents.xls"
[2] "calificaciones.csv"
[3] "carbon_emissions.csv"
[4] "fertility-two-countries-example.csv"
[5] "HRlist2.txt"
[6] "life-expectancy-and-fertility-two-countries-example.csv"
[7] "murders.csv"
[8] "olive.csv"
[9] "RD-Mortality-Report_2015-18-180531.pdf"
[10] "ssa-death-probability.csv"
Extraer la ruta completa de un archivo a través de file.path()
filename <- "murders.csv"
fullpath <- file.path(path,filename)
fullpath[1] "C:/Users/matom/AppData/Local/R/win-library/4.3/dslabs/extdata/murders.csv"
Copiar el archivo dentro del proyecto
file.copy(fullpath, getwd())[1] FALSE
file.exists(filename)[1] TRUE
La función read_lines() permite revisar los encabezados de un archivo, por ejemplo:
library(readr)
read_lines("murders.csv", n_max=3)[1] "state,abb,region,population,total" "Alabama,AL,South,4779736,135"
[3] "Alaska,AK,West,710231,19"
Leer el archivo:
dat <- read.csv("murders.csv")
head(dat) state abb region population total
1 Alabama AL South 4779736 135
2 Alaska AK West 710231 19
3 Arizona AZ West 6392017 232
4 Arkansas AR South 2915918 93
5 California CA West 37253956 1257
6 Colorado CO West 5029196 65
Leer datos con función de base R
dat2 <- read.csv(filename)
head(dat2) state abb region population total
1 Alabama AL South 4779736 135
2 Alaska AK West 710231 19
3 Arizona AZ West 6392017 232
4 Arkansas AR South 2915918 93
5 California CA West 37253956 1257
6 Colorado CO West 5029196 65
class(dat2)[1] "data.frame"
La función read_csv() importa directamente los archivos a partir de una url
url <- "https://raw.githubusercontent.com/rafalab/dslabs/master/inst/extdata/murders.csv"
dat <- read.csv(url)Para tener un archivo local, se descarga el archivo
download.file(url, "murders.csv")url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data"
data <- read_csv(url, col_names = FALSE)
head(data)# A tibble: 6 × 32
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 8.42e5 M 18.0 10.4 123. 1001 0.118 0.278 0.300 0.147 0.242 0.0787
2 8.43e5 M 20.6 17.8 133. 1326 0.0847 0.0786 0.0869 0.0702 0.181 0.0567
3 8.43e7 M 19.7 21.2 130 1203 0.110 0.160 0.197 0.128 0.207 0.0600
4 8.43e7 M 11.4 20.4 77.6 386. 0.142 0.284 0.241 0.105 0.260 0.0974
5 8.44e7 M 20.3 14.3 135. 1297 0.100 0.133 0.198 0.104 0.181 0.0588
6 8.44e5 M 12.4 15.7 82.6 477. 0.128 0.17 0.158 0.0809 0.209 0.0761
# ℹ 20 more variables: X13 <dbl>, X14 <dbl>, X15 <dbl>, X16 <dbl>, X17 <dbl>,
# X18 <dbl>, X19 <dbl>, X20 <dbl>, X21 <dbl>, X22 <dbl>, X23 <dbl>,
# X24 <dbl>, X25 <dbl>, X26 <dbl>, X27 <dbl>, X28 <dbl>, X29 <dbl>,
# X30 <dbl>, X31 <dbl>, X32 <dbl>
nrow(data)[1] 569
ncol(data)[1] 32
library(dslabs)
library(tidyverse)
path <- system.file("extdata", package = "dslabs")
filename <- file.path(path,"fertility-two-countries-example.csv")
wide_data <- read_csv(filename)
#View(wide_data)select(wide_data, country, "1960":"1967")# A tibble: 2 × 9
country `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Germany 2.41 2.44 2.47 2.49 2.49 2.48 2.44 2.37
2 South Korea 6.16 5.99 5.79 5.57 5.36 5.16 4.99 4.85
pivot longer utiliza dos argumentos. Primero, el data frame que será modificado y el segundo argumento contiene las columnas de los valores a mover.
Por ejemplo, se modificará de esta forma:
select(wide_data, country, "1960":"1967")# A tibble: 2 × 9
country `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Germany 2.41 2.44 2.47 2.49 2.49 2.48 2.44 2.37
2 South Korea 6.16 5.99 5.79 5.57 5.36 5.16 4.99 4.85
A esta forma (nombre de las columnas a niveles de una variable en una sola columna):
wide_data %>% pivot_longer("1960":"2015")# A tibble: 112 × 3
country name value
<chr> <chr> <dbl>
1 Germany 1960 2.41
2 Germany 1961 2.44
3 Germany 1962 2.47
4 Germany 1963 2.49
5 Germany 1964 2.49
6 Germany 1965 2.48
7 Germany 1966 2.44
8 Germany 1967 2.37
9 Germany 1968 2.28
10 Germany 1969 2.17
# ℹ 102 more rows
Nótese que la información se modifica a nombres (years) y values (fertility). Los podemos modificar de la siguiente forma:
new_tidy_data <- wide_data %>%
pivot_longer("1960":"2015", names_to = "year", values_to = "fertility")
head(new_tidy_data)# A tibble: 6 × 3
country year fertility
<chr> <chr> <dbl>
1 Germany 1960 2.41
2 Germany 1961 2.44
3 Germany 1962 2.47
4 Germany 1963 2.49
5 Germany 1964 2.49
6 Germany 1965 2.48
Transformar la variable que aparece como caracter de la forma
new_tidy_data <- wide_data %>%
pivot_longer("1960":"2015", names_to = "year", values_to = "fertility") %>% mutate(year=as.numeric(year))
head(new_tidy_data)# A tibble: 6 × 3
country year fertility
<chr> <dbl> <dbl>
1 Germany 1960 2.41
2 Germany 1961 2.44
3 Germany 1962 2.47
4 Germany 1963 2.49
5 Germany 1964 2.49
6 Germany 1965 2.48
O de la forma:
new_tidy_data <- wide_data %>%
pivot_longer("1960":"2015", names_to = "year", values_to = "fertility",
names_transform = list(year=as.numeric))
head(new_tidy_data)# A tibble: 6 × 3
country year fertility
<chr> <dbl> <dbl>
1 Germany 1960 2.41
2 Germany 1961 2.44
3 Germany 1962 2.47
4 Germany 1963 2.49
5 Germany 1964 2.49
6 Germany 1965 2.48
Finalmente, plotear:
new_tidy_data %>% ggplot(aes(year,fertility, color=country)) +
geom_point() + geom_line() + theme_bw()ggsave("images/plot.png")Función inversa de pivot_longer
names_from : Contiene los niveles de variable o filas que serán transformadas en columnas
values_from: Contiene los valores deseados
new_wide_data <- new_tidy_data %>%
pivot_wider(names_from = year, values_from = fertility)
select(new_wide_data, country, "1960":"1967")# A tibble: 2 × 9
country `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Germany 2.41 2.44 2.47 2.49 2.49 2.48 2.44 2.37
2 South Korea 6.16 5.99 5.79 5.57 5.36 5.16 4.99 4.85
path <- system.file("extdata", package="dslabs")
fname <- "life-expectancy-and-fertility-two-countries-example.csv"
filename <- file.path(path,fname)
raw_dat <- read_csv(filename)
#View(raw_dat)
select(raw_dat, 1:4)# A tibble: 2 × 4
country `1960_fertility` `1960_life_expectancy` `1961_fertility`
<chr> <dbl> <dbl> <dbl>
1 Germany 2.41 69.3 2.44
2 South Korea 6.16 53.0 5.99
dat <- raw_dat %>% pivot_longer(-country)
head(dat)# A tibble: 6 × 3
country name value
<chr> <chr> <dbl>
1 Germany 1960_fertility 2.41
2 Germany 1960_life_expectancy 69.3
3 Germany 1961_fertility 2.44
4 Germany 1961_life_expectancy 69.8
5 Germany 1962_fertility 2.47
6 Germany 1962_life_expectancy 70.0
La función separate() toma tres argumentos. El primero es el nombre de la columna que será separada. El segundo el nombre para las nuevas columnas y el tercero será el caracter que separa las variables
dat %>% separate(name, c("year", "name"), sep = "_")Warning: Expected 2 pieces. Additional pieces discarded in 112 rows [2, 4, 6, 8, 10, 12,
14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, ...].
# A tibble: 224 × 4
country year name value
<chr> <chr> <chr> <dbl>
1 Germany 1960 fertility 2.41
2 Germany 1960 life 69.3
3 Germany 1961 fertility 2.44
4 Germany 1961 life 69.8
5 Germany 1962 fertility 2.47
6 Germany 1962 life 70.0
7 Germany 1963 fertility 2.49
8 Germany 1963 life 70.1
9 Germany 1964 fertility 2.49
10 Germany 1964 life 70.7
# ℹ 214 more rows
Con el argumento convert = TRUE, se transforma automáticamente el año en valor numérico
dat %>% separate(name, c("year", "name"), convert = TRUE)Warning: Expected 2 pieces. Additional pieces discarded in 112 rows [2, 4, 6, 8, 10, 12,
14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, ...].
# A tibble: 224 × 4
country year name value
<chr> <int> <chr> <dbl>
1 Germany 1960 fertility 2.41
2 Germany 1960 life 69.3
3 Germany 1961 fertility 2.44
4 Germany 1961 life 69.8
5 Germany 1962 fertility 2.47
6 Germany 1962 life 70.0
7 Germany 1963 fertility 2.49
8 Germany 1963 life 70.1
9 Germany 1964 fertility 2.49
10 Germany 1964 life 70.7
# ℹ 214 more rows
Se puede agregar otra columna para el “expectancy. Se llenarán algunos valores nulos automáticamente (espacios fertility)
dat %>% separate(name, c("year", "name1", "name2"),
fill = "right" ,
convert = TRUE)# A tibble: 224 × 5
country year name1 name2 value
<chr> <int> <chr> <chr> <dbl>
1 Germany 1960 fertility <NA> 2.41
2 Germany 1960 life expectancy 69.3
3 Germany 1961 fertility <NA> 2.44
4 Germany 1961 life expectancy 69.8
5 Germany 1962 fertility <NA> 2.47
6 Germany 1962 life expectancy 70.0
7 Germany 1963 fertility <NA> 2.49
8 Germany 1963 life expectancy 70.1
9 Germany 1964 fertility <NA> 2.49
10 Germany 1964 life expectancy 70.7
# ℹ 214 more rows
Sin embargo, es un mejor enfoque unir los nombres extra que se generen a partir del argumento extra= “merge”
dat %>% separate(name, c("year", "name"), sep="_",
extra = "merge", convert = TRUE)# A tibble: 224 × 4
country year name value
<chr> <int> <chr> <dbl>
1 Germany 1960 fertility 2.41
2 Germany 1960 life_expectancy 69.3
3 Germany 1961 fertility 2.44
4 Germany 1961 life_expectancy 69.8
5 Germany 1962 fertility 2.47
6 Germany 1962 life_expectancy 70.0
7 Germany 1963 fertility 2.49
8 Germany 1963 life_expectancy 70.1
9 Germany 1964 fertility 2.49
10 Germany 1964 life_expectancy 70.7
# ℹ 214 more rows
Finalmente, se separan los datos de fertility y life_expectancy en columnas
dat %>% separate(name, c("year", "name"), sep = "_",
extra="merge", convert = TRUE) %>%
pivot_wider()# A tibble: 112 × 4
country year fertility life_expectancy
<chr> <int> <dbl> <dbl>
1 Germany 1960 2.41 69.3
2 Germany 1961 2.44 69.8
3 Germany 1962 2.47 70.0
4 Germany 1963 2.49 70.1
5 Germany 1964 2.49 70.7
6 Germany 1965 2.48 70.6
7 Germany 1966 2.44 70.8
8 Germany 1967 2.37 71.0
9 Germany 1968 2.28 70.6
10 Germany 1969 2.17 70.5
# ℹ 102 more rows
Función inversa de separate()
dat %>%
separate(name, c("year", "name1", "name2"),
fill="right", convert = TRUE) %>%
unite(variable_name, name1, name2)# A tibble: 224 × 4
country year variable_name value
<chr> <int> <chr> <dbl>
1 Germany 1960 fertility_NA 2.41
2 Germany 1960 life_expectancy 69.3
3 Germany 1961 fertility_NA 2.44
4 Germany 1961 life_expectancy 69.8
5 Germany 1962 fertility_NA 2.47
6 Germany 1962 life_expectancy 70.0
7 Germany 1963 fertility_NA 2.49
8 Germany 1963 life_expectancy 70.1
9 Germany 1964 fertility_NA 2.49
10 Germany 1964 life_expectancy 70.7
# ℹ 214 more rows
Y finalmente, dispersar las columnas
dat %>%
separate(name, c("year", "name1", "name2"),
fill="right", convert = TRUE) %>% #Separar
unite(name, name1, name2) %>% #Unir como un solo nombre
spread(name, value) %>% #Spread() sirve como pivot_wide()
rename(fertility = fertility_NA)# A tibble: 112 × 4
country year fertility life_expectancy
<chr> <int> <dbl> <dbl>
1 Germany 1960 2.41 69.3
2 Germany 1961 2.44 69.8
3 Germany 1962 2.47 70.0
4 Germany 1963 2.49 70.1
5 Germany 1964 2.49 70.7
6 Germany 1965 2.48 70.6
7 Germany 1966 2.44 70.8
8 Germany 1967 2.37 71.0
9 Germany 1968 2.28 70.6
10 Germany 1969 2.17 70.5
# ℹ 102 more rows
Guardar como RDS
data_final <- dat %>%
separate(name, c("year", "name1", "name2"),
fill="right", convert = TRUE) %>% #Separar
unite(name, name1, name2) %>% #Unir como un solo nombre
spread(name, value) %>% #Spread() sirve como pivot_wide()
rename(fertility = fertility_NA)
saveRDS(data_final, "data/data_final.rds")library(tidyverse)
library(dslabs)co2 Jan Feb Mar Apr May Jun Jul Aug Sep Oct
1959 315.42 316.31 316.50 317.56 318.13 318.00 316.39 314.65 313.68 313.18
1960 316.27 316.81 317.42 318.87 319.87 319.43 318.01 315.74 314.00 313.68
1961 316.73 317.54 318.38 319.31 320.42 319.61 318.42 316.63 314.83 315.16
1962 317.78 318.40 319.53 320.42 320.85 320.45 319.45 317.25 316.11 315.27
1963 318.58 318.92 319.70 321.22 322.08 321.31 319.58 317.61 316.05 315.83
1964 319.41 320.07 320.74 321.40 322.06 321.73 320.27 318.54 316.54 316.71
1965 319.27 320.28 320.73 321.97 322.00 321.71 321.05 318.71 317.66 317.14
1966 320.46 321.43 322.23 323.54 323.91 323.59 322.24 320.20 318.48 317.94
1967 322.17 322.34 322.88 324.25 324.83 323.93 322.38 320.76 319.10 319.24
1968 322.40 322.99 323.73 324.86 325.40 325.20 323.98 321.95 320.18 320.09
1969 323.83 324.26 325.47 326.50 327.21 326.54 325.72 323.50 322.22 321.62
1970 324.89 325.82 326.77 327.97 327.91 327.50 326.18 324.53 322.93 322.90
1971 326.01 326.51 327.01 327.62 328.76 328.40 327.20 325.27 323.20 323.40
1972 326.60 327.47 327.58 329.56 329.90 328.92 327.88 326.16 324.68 325.04
1973 328.37 329.40 330.14 331.33 332.31 331.90 330.70 329.15 327.35 327.02
1974 329.18 330.55 331.32 332.48 332.92 332.08 331.01 329.23 327.27 327.21
1975 330.23 331.25 331.87 333.14 333.80 333.43 331.73 329.90 328.40 328.17
1976 331.58 332.39 333.33 334.41 334.71 334.17 332.89 330.77 329.14 328.78
1977 332.75 333.24 334.53 335.90 336.57 336.10 334.76 332.59 331.42 330.98
1978 334.80 335.22 336.47 337.59 337.84 337.72 336.37 334.51 332.60 332.38
1979 336.05 336.59 337.79 338.71 339.30 339.12 337.56 335.92 333.75 333.70
1980 337.84 338.19 339.91 340.60 341.29 341.00 339.39 337.43 335.72 335.84
1981 339.06 340.30 341.21 342.33 342.74 342.08 340.32 338.26 336.52 336.68
1982 340.57 341.44 342.53 343.39 343.96 343.18 341.88 339.65 337.81 337.69
1983 341.20 342.35 342.93 344.77 345.58 345.14 343.81 342.21 339.69 339.82
1984 343.52 344.33 345.11 346.88 347.25 346.62 345.22 343.11 340.90 341.18
1985 344.79 345.82 347.25 348.17 348.74 348.07 346.38 344.51 342.92 342.62
1986 346.11 346.78 347.68 349.37 350.03 349.37 347.76 345.73 344.68 343.99
1987 347.84 348.29 349.23 350.80 351.66 351.07 349.33 347.92 346.27 346.18
1988 350.25 351.54 352.05 353.41 354.04 353.62 352.22 350.27 348.55 348.72
1989 352.60 352.92 353.53 355.26 355.52 354.97 353.75 351.52 349.64 349.83
1990 353.50 354.55 355.23 356.04 357.00 356.07 354.67 352.76 350.82 351.04
1991 354.59 355.63 357.03 358.48 359.22 358.12 356.06 353.92 352.05 352.11
1992 355.88 356.63 357.72 359.07 359.58 359.17 356.94 354.92 352.94 353.23
1993 356.63 357.10 358.32 359.41 360.23 359.55 357.53 355.48 353.67 353.95
1994 358.34 358.89 359.95 361.25 361.67 360.94 359.55 357.49 355.84 356.00
1995 359.98 361.03 361.66 363.48 363.82 363.30 361.94 359.50 358.11 357.80
1996 362.09 363.29 364.06 364.76 365.45 365.01 363.70 361.54 359.51 359.65
1997 363.23 364.06 364.61 366.40 366.84 365.68 364.52 362.57 360.24 360.83
Nov Dec
1959 314.66 315.43
1960 314.84 316.03
1961 315.94 316.85
1962 316.53 317.53
1963 316.91 318.20
1964 317.53 318.55
1965 318.70 319.25
1966 319.63 320.87
1967 320.56 321.80
1968 321.16 322.74
1969 322.69 323.95
1970 323.85 324.96
1971 324.63 325.85
1972 326.34 327.39
1973 327.99 328.48
1974 328.29 329.41
1975 329.32 330.59
1976 330.14 331.52
1977 332.24 333.68
1978 333.75 334.78
1979 335.12 336.56
1980 336.93 338.04
1981 338.19 339.44
1982 339.09 340.32
1983 340.98 342.82
1984 342.80 344.04
1985 344.06 345.38
1986 345.48 346.72
1987 347.64 348.78
1988 349.91 351.18
1989 351.14 352.37
1990 352.69 354.07
1991 353.64 354.89
1992 354.09 355.33
1993 355.30 356.78
1994 357.59 359.05
1995 359.61 360.74
1996 360.80 362.38
1997 362.49 364.34
#View(co2)
head(co2)[1] 315.42 316.31 316.50 317.56 318.13 318.00
co2_wide <- data.frame(matrix(co2, ncol = 12, byrow = TRUE)) %>%
setNames(1:12) %>%
mutate(year = as.character(1959:1997))
#View(co2_wide)co2_tidy <- pivot_longer(co2_wide, -year, names_to = "month", values_to = "co2")
#View(co2_tidy)co2_tidy %>% ggplot(aes(as.numeric(month), co2, color = year)) + geom_line() + theme_bw() + xlab("Month")ggsave("images/graf2.png")library(dslabs)
data("admissions")
dat <- admissions %>% select(-applicants)
#View(dat)dat_tidy <- pivot_wider(dat, names_from = gender, values_from = admitted)
dat_tidy# A tibble: 6 × 3
major men women
<chr> <dbl> <dbl>
1 A 62 82
2 B 63 68
3 C 37 34
4 D 33 35
5 E 28 24
6 F 6 7
tmp <- admissions %>%
pivot_longer(cols = c(admitted, applicants), names_to = "key", values_to = "value")
tmp# A tibble: 24 × 4
major gender key value
<chr> <chr> <chr> <dbl>
1 A men admitted 62
2 A men applicants 825
3 B men admitted 63
4 B men applicants 560
5 C men admitted 37
6 C men applicants 325
7 D men admitted 33
8 D men applicants 417
9 E men admitted 28
10 E men applicants 191
# ℹ 14 more rows
tmp2 <- unite(tmp, column_name, c(key, gender))
tmp2# A tibble: 24 × 3
major column_name value
<chr> <chr> <dbl>
1 A admitted_men 62
2 A applicants_men 825
3 B admitted_men 63
4 B applicants_men 560
5 C admitted_men 37
6 C applicants_men 325
7 D admitted_men 33
8 D applicants_men 417
9 E admitted_men 28
10 E applicants_men 191
# ℹ 14 more rows
tmp2 <- unite(tmp, column_name, c(key, gender))
tmp2# A tibble: 24 × 3
major column_name value
<chr> <chr> <dbl>
1 A admitted_men 62
2 A applicants_men 825
3 B admitted_men 63
4 B applicants_men 560
5 C admitted_men 37
6 C applicants_men 325
7 D admitted_men 33
8 D applicants_men 417
9 E admitted_men 28
10 E applicants_men 191
# ℹ 14 more rows
library(dslabs)
library(tidyverse)
library(ggrepel)
data(murders)
#View(murders)
data(results_us_election_2016)
#View(results_us_election_2016)
#Ordenar datos por orden alfabético de state a través de arrange
results_us_election_2016 <- results_us_election_2016 %>% arrange(state)Por medio de la función left_join, es posible unir tablas por una variable en común. En este caso, la variable state:
tab <- left_join(murders, results_us_election_2016, by="state")
head(tab) state abb region population total electoral_votes clinton trump others
1 Alabama AL South 4779736 135 9 34.4 62.1 3.6
2 Alaska AK West 710231 19 3 36.6 51.3 12.2
3 Arizona AZ West 6392017 232 11 45.1 48.7 6.2
4 Arkansas AR South 2915918 93 6 33.7 60.6 5.8
5 California CA West 37253956 1257 55 61.7 31.6 6.7
6 Colorado CO West 5029196 65 9 48.2 43.3 8.6
#View(tab)Graficar
tab %>% ggplot(aes(population/10^6, electoral_votes, label=abb)) +
geom_point()+
geom_text_repel()+
scale_x_continuous(trans = "log2") +
scale_y_continuous(trans= "log2") +
geom_smooth(method = "lm", se=FALSE) +
xlab("Population (10^6)")+
ylab("Electoral votes (log)")+
theme_bw()ggsave("images/plot2.png")Otro ejemplo
tab1 <- slice(murders, 1:6) %>% select(state,population)
tab1 state population
1 Alabama 4779736
2 Alaska 710231
3 Arizona 6392017
4 Arkansas 2915918
5 California 37253956
6 Colorado 5029196
tab2 <- slice(results_us_election_2016, c(1:3, 5, 7:8)) %>%
select(state,electoral_votes)
tab2 state electoral_votes
1 Alabama 9
2 Alaska 3
3 Arizona 11
4 California 55
5 Connecticut 7
6 Delaware 3
left_join(tab1, tab2) state population electoral_votes
1 Alabama 4779736 9
2 Alaska 710231 3
3 Arizona 6392017 11
4 Arkansas 2915918 NA
5 California 37253956 55
6 Colorado 5029196 NA
Otras formas de unir:
tab1 %>% left_join(tab2) state population electoral_votes
1 Alabama 4779736 9
2 Alaska 710231 3
3 Arizona 6392017 11
4 Arkansas 2915918 NA
5 California 37253956 55
6 Colorado 5029196 NA
tab1 %>% right_join(tab2) state population electoral_votes
1 Alabama 4779736 9
2 Alaska 710231 3
3 Arizona 6392017 11
4 California 37253956 55
5 Connecticut NA 7
6 Delaware NA 3
Si quiero conservar sólo las filas que tienen información en ambas tablas, utilizo inner_join()
inner_join(tab1, tab2) state population electoral_votes
1 Alabama 4779736 9
2 Alaska 710231 3
3 Arizona 6392017 11
4 California 37253956 55
Si quiero conservar todas las filas, aunque tengan valores nulos, utilizo full_join()
full_join(tab1,tab2) state population electoral_votes
1 Alabama 4779736 9
2 Alaska 710231 3
3 Arizona 6392017 11
4 Arkansas 2915918 NA
5 California 37253956 55
6 Colorado 5029196 NA
7 Connecticut NA 7
8 Delaware NA 3
La función semi_join() conserva la información de la tabla 1 que también está en la tabla 2
semi_join(tab1, tab2) state population
1 Alabama 4779736
2 Alaska 710231
3 Arizona 6392017
4 California 37253956
La función anti_join() conserva la información de la tabla 1 que no está en la tabla 2. Es una función inversa de semi_join()
anti_join(tab1,tab2) state population
1 Arkansas 2915918
2 Colorado 5029196
La función bind_cols() combina tablas por columnas sin importar las variables en común
bind_cols(a=1:3,b=4:6)# A tibble: 3 × 2
a b
<int> <int>
1 1 4
2 2 5
3 3 6
También sirve para concatenar data frames
tab1 <- tab[,1:3]
tab2 <- tab[,4:6]
tab3 <- tab[,7:9]
new_tab <- bind_cols(tab1,tab2,tab3)
head(new_tab) state abb region population total electoral_votes clinton trump others
1 Alabama AL South 4779736 135 9 34.4 62.1 3.6
2 Alaska AK West 710231 19 3 36.6 51.3 12.2
3 Arizona AZ West 6392017 232 11 45.1 48.7 6.2
4 Arkansas AR South 2915918 93 6 33.7 60.6 5.8
5 California CA West 37253956 1257 55 61.7 31.6 6.7
6 Colorado CO West 5029196 65 9 48.2 43.3 8.6
bind_rows() funciona de manera similar, pero concatenando filas
tab1 <- tab[1:2,]
tab2 <- tab[3:4,]
bind_rows(tab1, tab2) state abb region population total electoral_votes clinton trump others
1 Alabama AL South 4779736 135 9 34.4 62.1 3.6
2 Alaska AK West 710231 19 3 36.6 51.3 12.2
3 Arizona AZ West 6392017 232 11 45.1 48.7 6.2
4 Arkansas AR South 2915918 93 6 33.7 60.6 5.8
Intersect() aplica para vectores y para data frames
intersect(1:10, 6:15)[1] 6 7 8 9 10
head(tab,7) state abb region population total electoral_votes clinton trump
1 Alabama AL South 4779736 135 9 34.4 62.1
2 Alaska AK West 710231 19 3 36.6 51.3
3 Arizona AZ West 6392017 232 11 45.1 48.7
4 Arkansas AR South 2915918 93 6 33.7 60.6
5 California CA West 37253956 1257 55 61.7 31.6
6 Colorado CO West 5029196 65 9 48.2 43.3
7 Connecticut CT Northeast 3574097 97 7 54.6 40.9
others
1 3.6
2 12.2
3 6.2
4 5.8
5 6.7
6 8.6
7 4.5
tab1 <- tab[1:5,]
tab1 state abb region population total electoral_votes clinton trump others
1 Alabama AL South 4779736 135 9 34.4 62.1 3.6
2 Alaska AK West 710231 19 3 36.6 51.3 12.2
3 Arizona AZ West 6392017 232 11 45.1 48.7 6.2
4 Arkansas AR South 2915918 93 6 33.7 60.6 5.8
5 California CA West 37253956 1257 55 61.7 31.6 6.7
tab2 <- tab[3:7,]
tab2 state abb region population total electoral_votes clinton trump
3 Arizona AZ West 6392017 232 11 45.1 48.7
4 Arkansas AR South 2915918 93 6 33.7 60.6
5 California CA West 37253956 1257 55 61.7 31.6
6 Colorado CO West 5029196 65 9 48.2 43.3
7 Connecticut CT Northeast 3574097 97 7 54.6 40.9
others
3 6.2
4 5.8
5 6.7
6 8.6
7 4.5
intersect(tab1, tab2) state abb region population total electoral_votes clinton trump others
1 Arizona AZ West 6392017 232 11 45.1 48.7 6.2
2 Arkansas AR South 2915918 93 6 33.7 60.6 5.8
3 California CA West 37253956 1257 55 61.7 31.6 6.7
Unión funciona similar para vectores y data frames
union(c("a","b","c"), c("b","c","d"))[1] "a" "b" "c" "d"
head(tab,7) state abb region population total electoral_votes clinton trump
1 Alabama AL South 4779736 135 9 34.4 62.1
2 Alaska AK West 710231 19 3 36.6 51.3
3 Arizona AZ West 6392017 232 11 45.1 48.7
4 Arkansas AR South 2915918 93 6 33.7 60.6
5 California CA West 37253956 1257 55 61.7 31.6
6 Colorado CO West 5029196 65 9 48.2 43.3
7 Connecticut CT Northeast 3574097 97 7 54.6 40.9
others
1 3.6
2 12.2
3 6.2
4 5.8
5 6.7
6 8.6
7 4.5
tab1 <- tab[1:5,]
tab2 <- tab[3:7,]
union(tab1,tab2) state abb region population total electoral_votes clinton trump
1 Alabama AL South 4779736 135 9 34.4 62.1
2 Alaska AK West 710231 19 3 36.6 51.3
3 Arizona AZ West 6392017 232 11 45.1 48.7
4 Arkansas AR South 2915918 93 6 33.7 60.6
5 California CA West 37253956 1257 55 61.7 31.6
6 Colorado CO West 5029196 65 9 48.2 43.3
7 Connecticut CT Northeast 3574097 97 7 54.6 40.9
others
1 3.6
2 12.2
3 6.2
4 5.8
5 6.7
6 8.6
7 4.5
La función setdiff() permite encontrar todas las filas de x que no están en y. Por lo tanto, el orden de los argumentos importa.
setdiff(1:10,6:15)[1] 1 2 3 4 5
setdiff(6:15,1:10) [1] 11 12 13 14 15
head(tab,7) state abb region population total electoral_votes clinton trump
1 Alabama AL South 4779736 135 9 34.4 62.1
2 Alaska AK West 710231 19 3 36.6 51.3
3 Arizona AZ West 6392017 232 11 45.1 48.7
4 Arkansas AR South 2915918 93 6 33.7 60.6
5 California CA West 37253956 1257 55 61.7 31.6
6 Colorado CO West 5029196 65 9 48.2 43.3
7 Connecticut CT Northeast 3574097 97 7 54.6 40.9
others
1 3.6
2 12.2
3 6.2
4 5.8
5 6.7
6 8.6
7 4.5
tab1 <- tab[1:5,]
tab2 <- tab[3:7,]
setdiff(tab1,tab2) state abb region population total electoral_votes clinton trump others
1 Alabama AL South 4779736 135 9 34.4 62.1 3.6
2 Alaska AK West 710231 19 3 36.6 51.3 12.2
Por último, la función setequal() returna TRUE si x y y contienen las mismas filas sin importar el orden:
setequal(1:5, 1:6)[1] FALSE
setequal(1:5,5:1)[1] TRUE
setequal(tab1, tab2)[1] FALSE
library(Lahman)
library(writexl)
#View(Batting)
top <- Batting %>%
filter(yearID==2016) %>%
arrange(desc(HR)) %>% #arrange by descending HR count
slice(1:10) #Take entries 1:10
top %>% as_tibble()# A tibble: 10 × 22
playerID yearID stint teamID lgID G AB R H X2B X3B HR
<chr> <int> <int> <fct> <fct> <int> <int> <int> <int> <int> <int> <int>
1 trumbma01 2016 1 BAL AL 159 613 94 157 27 1 47
2 cruzne02 2016 1 SEA AL 155 589 96 169 27 1 43
3 daviskh01 2016 1 OAK AL 150 555 85 137 24 2 42
4 doziebr01 2016 1 MIN AL 155 615 104 165 35 5 42
5 encared01 2016 1 TOR AL 160 601 99 158 34 0 42
6 arenano01 2016 1 COL NL 160 618 116 182 35 6 41
7 cartech02 2016 1 MIL NL 160 549 84 122 27 1 41
8 frazito01 2016 1 CHA AL 158 590 89 133 21 0 40
9 bryankr01 2016 1 CHN NL 155 603 121 176 35 3 39
10 canoro01 2016 1 SEA AL 161 655 107 195 33 2 39
# ℹ 10 more variables: RBI <int>, SB <int>, CS <int>, BB <int>, SO <int>,
# IBB <int>, HBP <int>, SH <int>, SF <int>, GIDP <int>
Revisar el data frame People, que tiene información demográfica de todos los jugadores
People %>% as_tibble()# A tibble: 21,010 × 26
playerID birthYear birthMonth birthDay birthCity birthCountry birthState
<chr> <int> <int> <int> <chr> <chr> <chr>
1 aardsda01 1981 12 27 Denver USA CO
2 aaronha01 1934 2 5 Mobile USA AL
3 aaronto01 1939 8 5 Mobile USA AL
4 aasedo01 1954 9 8 Orange USA CA
5 abadan01 1972 8 25 Palm Beach USA FL
6 abadfe01 1985 12 17 La Romana D.R. La Romana
7 abadijo01 1850 11 4 Philadelphia USA PA
8 abbated01 1877 4 15 Latrobe USA PA
9 abbeybe01 1869 11 11 Essex USA VT
10 abbeych01 1866 10 14 Falls City USA NE
# ℹ 21,000 more rows
# ℹ 19 more variables: deathYear <int>, deathMonth <int>, deathDay <int>,
# deathCountry <chr>, deathState <chr>, deathCity <chr>, nameFirst <chr>,
# nameLast <chr>, nameGiven <chr>, weight <int>, height <int>, bats <fct>,
# throws <fct>, debut <chr>, bbrefID <chr>, finalGame <chr>, retroID <chr>,
# deathDate <date>, birthDate <date>
#View(People)Vincular información al top 10:
top_names <- top %>%
left_join(People) %>%
select(playerID,nameFirst, nameLast, HR)
top_names playerID nameFirst nameLast HR
1 trumbma01 Mark Trumbo 47
2 cruzne02 Nelson Cruz 43
3 daviskh01 Khris Davis 42
4 doziebr01 Brian Dozier 42
5 encared01 Edwin Encarnacion 42
6 arenano01 Nolan Arenado 41
7 cartech02 Chris Carter 41
8 frazito01 Todd Frazier 40
9 bryankr01 Kris Bryant 39
10 canoro01 Robinson Cano 39
#View(top_names)
saveRDS(top_names, "data/top_names.rds")
write_csv(top_names, "data/top_names.csv")
write_rds(top_names, "data/top_names2.rds")Inspect the Salaries data frame. Filter this data frame to the 2016
salaries, then use the correct bind join function to add
a salary column to the top_names data frame from the previous
question. Name the new data frame top_salary. Use this code framework:
#View(Salaries)
top_salaries <- Salaries %>%
filter(yearID==2016) %>%
right_join(top_names)
top_salaries yearID teamID lgID playerID salary nameFirst nameLast HR
1 2016 BAL AL trumbma01 9150000 Mark Trumbo 47
2 2016 CHA AL frazito01 8250000 Todd Frazier 40
3 2016 CHN NL bryankr01 652000 Kris Bryant 39
4 2016 COL NL arenano01 5000000 Nolan Arenado 41
5 2016 MIL NL cartech02 2500000 Chris Carter 41
6 2016 MIN AL doziebr01 3000000 Brian Dozier 42
7 2016 OAK AL daviskh01 524500 Khris Davis 42
8 2016 SEA AL canoro01 24000000 Robinson Cano 39
9 2016 SEA AL cruzne02 14250000 Nelson Cruz 43
10 2016 TOR AL encared01 10000000 Edwin Encarnacion 42
write_xlsx(top_salaries, "data/top_salaries.xlsx")Inspect the AwardsPlayers table. Filter awards to include only the
year 2016.
How many players from the top 10 home run hitters won at least one award in 2016?
#View(AwardsPlayers)
AwardsPlayers_2016 <- AwardsPlayers %>%
filter(yearID==2016)
top_names %>% left_join(AwardsPlayers_2016) playerID nameFirst nameLast HR awardID yearID lgID tie
1 trumbma01 Mark Trumbo 47 Silver Slugger 2016 AL <NA>
2 cruzne02 Nelson Cruz 43 <NA> NA <NA> <NA>
3 daviskh01 Khris Davis 42 <NA> NA <NA> <NA>
4 doziebr01 Brian Dozier 42 <NA> NA <NA> <NA>
5 encared01 Edwin Encarnacion 42 <NA> NA <NA> <NA>
6 arenano01 Nolan Arenado 41 Silver Slugger 2016 NL <NA>
7 arenano01 Nolan Arenado 41 Gold Glove 2016 NL <NA>
8 cartech02 Chris Carter 41 <NA> NA <NA> <NA>
9 frazito01 Todd Frazier 40 <NA> NA <NA> <NA>
10 bryankr01 Kris Bryant 39 Most Valuable Player 2016 NL <NA>
11 bryankr01 Kris Bryant 39 Hank Aaron Award 2016 NL <NA>
12 bryankr01 Kris Bryant 39 TSN All-Star 2016 NL <NA>
13 canoro01 Robinson Cano 39 <NA> NA <NA> <NA>
notes
1 OF
2 <NA>
3 <NA>
4 <NA>
5 <NA>
6 3B
7 3B
8 <NA>
9 <NA>
10 <NA>
11 3B
12 3B
13 <NA>
How many players won an award in 2016 but were not one of the top 10 home run hitters in 2016?
dat_awa <- AwardsPlayers_2016 %>%
anti_join(top_names) %>%
select(playerID) %>%
unique()
dat_awa playerID
1 altuvjo01
2 rizzoan01
3 lindofr01
4 hosmeer01
5 lestejo01
6 mcgowdu01
7 perezsa02
8 cabremi01
10 donaljo02
11 bogaexa01
12 bettsmo01
13 troutmi01
14 ortizda01
15 ramoswi01
17 murphda08
18 seageco01
19 blackch02
20 cespeyo01
21 yelicch01
22 arrieja01
24 morelmi01
25 kinslia01
26 beltrad01
28 gardnbr01
29 kiermke01
31 keuchda01
32 poseybu01
34 panikjo01
35 crawfbr01
36 martest01
37 inciaen01
38 heywaja01
39 greinza01
41 porceri01
42 scherma01
43 fulmemi01
45 grandcu01
47 zobribe01
48 millean01
49 baezja01
52 rendoan01
54 brittza01
55 janseke01
58 klubeco01
62 freemfr01
77 bradlja02
Otras formas de resolver los ejercicios:
Awards_2016 <- AwardsPlayers %>% filter(yearID == 2016)
length(intersect(Awards_2016$playerID, top_names$playerID))[1] 3
length(setdiff(Awards_2016$playerID, top_names$playerID))[1] 46
El paquete rvest (incluído dentro de tidyverse), permite extraer información de páginas web (formato html y xml). Esto es posible a través de la función html_table()
library(rvest)Warning: package 'rvest' was built under R version 4.3.3
url <- "https://en.wikipedia.org/wiki/Murder_in_the_United_States_by_state"
h <- read_html(url)
class(h)[1] "xml_document" "xml_node"
h{html_document}
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="skin--responsive skin-vector skin-vector-search-vue mediawik ...
Se utiliza la función html_nodes() para extraer elementos de la página, en este caso table. Revisar sus propiedades y extraer la tabla 2
tab <- h %>% html_nodes("table")
tab <- tab[[2]]
tab{html_node}
<table class="wikitable sortable sticky-header static-row-numbers sort-under col1left" style="text-align:right">
[1] <caption>Gun deaths by intent\n</caption>
[2] <tbody>\n<tr>\n<th>State</th>\n<th>Gun <br> deaths</th>\n<th>Suicide</th> ...
Y se utiliza la función html_table() para convertir en data frame
tab <- tab %>% html_table()
head(tab)# A tibble: 6 × 6
State `Gun deaths` Suicide Homicide Accident Law
<chr> <chr> <chr> <chr> <int> <int>
1 United States 48,830 26,328 20,958 549 537
2 Texas 4,613 2,528 1,942 53 38
3 California 3,576 1,575 1,861 32 89
4 Florida 3,142 1,928 1,150 18 25
5 Georgia 2,200 1,115 1,021 25 22
6 Illinois 1,995 656 1,292 15 18
library(rvest)
url <- "https://web.archive.org/web/20181024132313/http://www.stevetheump.com/Payrolls.htm"
h <- read_html(url)We learned that tables in html are associated with the table node.
Use the html_nodes() function and the table node type to extract the
first table. Store it in an object nodes:
nodes <- html_nodes(h, "table")The html_nodes() function returns a list of objects of
class xml_node. We can see the content of each one using, for example,
the html_text() function. You can see the content for an arbitrarily
picked component like this:
html_text(nodes[[8]])[1] "Team\nPayroll\nAverge\nMedianNew York Yankees\n$ 197,962,289\n$ 6,186,321\n$ 1,937,500Philadelphia Phillies\n$ 174,538,938\n$ 5,817,964\n$ 1,875,000Boston Red Sox\n$ 173,186,617\n$ 5,093,724\n$ 1,556,250Los Angeles Angels\n$ 154,485,166\n$ 5,327,074\n$ 3,150,000Detroit Tigers\n$ 132,300,000\n$ 4,562,068\n$ 1,100,000Texas Rangers\n$ 120,510,974\n$ 4,635,037\n$ 3,437,500Miami Marlins\n$ 118,078,000\n$ 4,373,259\n$ 1,500,000San Francisco Giants\n$ 117,620,683\n$ 3,920,689\n$ 1,275,000St. Louis Cardinals\n$ 110,300,862\n$ 3,939,316\n$ 800,000Milwaukee Brewers\n$ 97,653,944\n$ 3,755,920\n$ 1,981,250Chicago White Sox\n$ 96,919,500\n$ 3,876,780\n$ 530,000Los Angeles Dodgers\n$ 95,143,575\n$ 3,171,452\n$ 875,000Minnesota Twins\n$ 94,085,000\n$ 3,484,629\n$ 750,000New York Mets\n$ 93,353,983\n$ 3,457,554\n$ 875,000Chicago Cubs\n$ 88,197,033\n$ 3,392,193\n$ 1,262,500Atlanta Braves\n$ 83,309,942\n$ 2,776,998\n$ 577,500Cincinnati Reds\n$ 82,203,616\n$ 2,935,843\n$ 1,150,000Seattle Mariners\n$ 81,978,100\n$ 2,927,789\n$ 495,150Baltimore Orioles\n$ 81,428,999\n$ 2,807,896\n$ 1,300,000Washington Nationals\n$ 81,336,143\n$ 2,623,746\n$ 800,000Cleveland Indians\n$ 78,430,300\n$ 2,704,493\n$ 800,000Colorado Rockies\n$ 78,069,571\n$ 2,692,054\n$ 482,000Toronto Blue Jays\n$ 75,489,200\n$ 2,696,042\n$ 1,768,750Arizona Diamondbacks\n$ 74,284,833\n$ 2,653,029\n$ 1,625,000Tampa Bay Rays\n$ 64,173,500\n$ 2,291,910\n$ 1,425,000Pittsburgh Pirates\n$ 63,431,999\n$ 2,187,310\n$ 916,666Kansas City Royals\n$ 60,916,225\n$ 2,030,540\n$ 870,000Houston Astros\n$ 60,651,000\n$ 2,332,730\n$ 491,250Oakland Athletics\n$ 55,372,500\n$ 1,845,750\n$ 487,500San Diego Padres\n$ 55,244,700\n$ 1,973,025\n$ 1,207,500"
If the content of this object is an html table, we can use
the html_table() function to convert it to a data frame:
tab1 <- html_table(nodes[[8]])html_table(nodes[[8]])# A tibble: 30 × 4
Team Payroll Averge Median
<chr> <chr> <chr> <chr>
1 New York Yankees $ 197,962,289 $ 6,186,321 $ 1,937,500
2 Philadelphia Phillies $ 174,538,938 $ 5,817,964 $ 1,875,000
3 Boston Red Sox $ 173,186,617 $ 5,093,724 $ 1,556,250
4 Los Angeles Angels $ 154,485,166 $ 5,327,074 $ 3,150,000
5 Detroit Tigers $ 132,300,000 $ 4,562,068 $ 1,100,000
6 Texas Rangers $ 120,510,974 $ 4,635,037 $ 3,437,500
7 Miami Marlins $ 118,078,000 $ 4,373,259 $ 1,500,000
8 San Francisco Giants $ 117,620,683 $ 3,920,689 $ 1,275,000
9 St. Louis Cardinals $ 110,300,862 $ 3,939,316 $ 800,000
10 Milwaukee Brewers $ 97,653,944 $ 3,755,920 $ 1,981,250
# ℹ 20 more rows
1 point possible (graded)
Create a table called tab_1 using entry 10 of nodes. Create a table
called tab_2 using entry 19 of nodes.
Note that the column names should be c("Team", "Payroll", "Average").
You can see that these column names are actually in the first data row
of each table, and that tab_1 has an extra first column No. that
should be removed so that the column names for both tables match.
Remove the extra column in tab_1, remove the first row of each
dataset, and change the column names for each table to
c("Team", "Payroll", "Average"). Use a full_join() by the Team to
combine these two tables.
How many rows are in the joined data table?
library(tidyverse)
tab1 <- html_table(nodes[[10]])
head(tab1)# A tibble: 6 × 4
X1 X2 X3 X4
<chr> <chr> <chr> <chr>
1 No. Team Payroll Average
2 1. New York Yankees $206,333,389 $8,253,336
3 2. Boston Red Sox $162,747,333 $5,611,977
4 3. Chicago Cubs $146,859,000 $5,439,222
5 4. Philadelphia Phillies $141,927,381 $5,068,835
6 5. New York Mets $132,701,445 $5,103,902
nrow(tab1)[1] 31
tab1 <- tab1 %>% select("X2","X3","X4") %>% slice(2:31) %>% set_names(c("Team", "Payroll", "Average"))
head(tab1)# A tibble: 6 × 3
Team Payroll Average
<chr> <chr> <chr>
1 New York Yankees $206,333,389 $8,253,336
2 Boston Red Sox $162,747,333 $5,611,977
3 Chicago Cubs $146,859,000 $5,439,222
4 Philadelphia Phillies $141,927,381 $5,068,835
5 New York Mets $132,701,445 $5,103,902
6 Detroit Tigers $122,864,929 $4,550,553
tab2 <- html_table(nodes[[19]])
head(tab2)# A tibble: 6 × 3
X1 X2 X3
<chr> <chr> <chr>
1 Team Payroll Average
2 NY Yankees $109,791,893 $3,541,674
3 Boston $109,558,908 $3,423,716
4 Los Angeles $108,980,952 $3,757,964
5 NY Mets $93,174,428 $3,327,658
6 Cleveland $91,974,979 $3,065,833
nrow(tab2)[1] 31
tab2 <- tab2 %>% slice(2:nrow(tab2)) %>% set_names(c("Team", "Payroll", "Average"))
head(tab2)# A tibble: 6 × 3
Team Payroll Average
<chr> <chr> <chr>
1 NY Yankees $109,791,893 $3,541,674
2 Boston $109,558,908 $3,423,716
3 Los Angeles $108,980,952 $3,757,964
4 NY Mets $93,174,428 $3,327,658
5 Cleveland $91,974,979 $3,065,833
6 Atlanta $91,851,687 $2,962,958
full_join(tab1,tab2, by="Team")# A tibble: 58 × 5
Team Payroll.x Average.x Payroll.y Average.y
<chr> <chr> <chr> <chr> <chr>
1 New York Yankees $206,333,389 $8,253,336 <NA> <NA>
2 Boston Red Sox $162,747,333 $5,611,977 <NA> <NA>
3 Chicago Cubs $146,859,000 $5,439,222 $64,015,833 $2,462,147
4 Philadelphia Phillies $141,927,381 $5,068,835 <NA> <NA>
5 New York Mets $132,701,445 $5,103,902 <NA> <NA>
6 Detroit Tigers $122,864,929 $4,550,553 <NA> <NA>
7 Chicago White Sox $108,273,197 $4,164,354 $62,363,000 $2,309,741
8 Los Angeles Angels $105,013,667 $3,621,161 <NA> <NA>
9 Seattle Mariners $98,376,667 $3,513,452 <NA> <NA>
10 San Francisco Giants $97,828,833 $3,493,887 <NA> <NA>
# ℹ 48 more rows
s <- '10"'
cat(s)10"
ss <- "10'"
cat(ss)10'
#Scape the quote a través de /
sss <- '5\'10"'
cat(sss)5'10"
ssss <- "5'10\""
cat(ssss)5'10"
# read in raw murders data from Wikipedia
library(rvest)
url <- "https://en.wikipedia.org/w/index.php?title=Gun_violence_in_the_United_States_by_state&direction=prev&oldid=810166167"
murders_raw <- read_html(url) %>%
html_nodes("table") %>%
html_table() %>%
.[[1]] %>%
setNames(c("state", "population", "total", "murder_rate"))
# inspect data and column classes
head(murders_raw)# A tibble: 6 × 4
state population total murder_rate
<chr> <chr> <chr> <dbl>
1 Alabama 4,853,875 348 7.2
2 Alaska 737,709 59 8
3 Arizona 6,817,565 309 4.5
4 Arkansas 2,977,853 181 6.1
5 California 38,993,940 1,861 4.8
6 Colorado 5,448,819 176 3.2
class(murders_raw$population)[1] "character"
class(murders_raw$total)[1] "character"
library(tidyverse)
murders_raw$population[1:3][1] "4,853,875" "737,709" "6,817,565"
as.numeric(murders_raw$population[1:3]) #La coerción as.numeric() no funciona acá[1] NA NA NA
Por lo tanto debe utilizarse stringr y sus funciones str_ para modificar datos. En este caso, se utilizará str_detect() dentro de una función, para detectar si las columnas tienen comas.
commas <- function(x) any(str_detect(x,","))
murders_raw %>% summarize_all(funs(commas))# A tibble: 1 × 4
state population total murder_rate
<lgl> <lgl> <lgl> <lgl>
1 FALSE TRUE TRUE FALSE
Utilizar la función str_replace_all() para reemplazar todos los valores que tengan coma por un espacio nulo y luego convertir el string a numeric.
test_1 <- str_replace_all(murders_raw$population, ",", "")
test_1 <- as.numeric(test_1)
head(test_1)[1] 4853875 737709 6817565 2977853 38993940 5448819
También, es posible utilizar la función parse_number() que realiza los dos procedimiento de manera simultánea
test_2 <- parse_number(murders_raw$population)
head(test_2)[1] 4853875 737709 6817565 2977853 38993940 5448819
identical(test_1,test_2)[1] TRUE
La tabla final se obtendría a través del siguiente código:
murders_new <- murders_raw %>% mutate_at(2:3, parse_number) #La función mutate_at funciona como sapply. Se especifican las columnas en el primer argumento y en el segundo se aplica la función.
head(murders_new)# A tibble: 6 × 4
state population total murder_rate
<chr> <dbl> <dbl> <dbl>
1 Alabama 4853875 348 7.2
2 Alaska 737709 59 8
3 Arizona 6817565 309 4.5
4 Arkansas 2977853 181 6.1
5 California 38993940 1861 4.8
6 Colorado 5448819 176 3.2
-
Use the
str_detect()function to determine whether a string contains a certain pattern. -
Use the
str_replace_all()function to replace all instances of one pattern with another pattern. To remove a pattern, replace with the empty string (""). -
The
parse_number()function removes punctuation from strings and converts them to numeric. -
mutate_at()performs the same transformation on the specified column numbers.
library(dslabs)
library(tidyverse)
library(writexl)
data("reported_heights")
#View(reported_heights)
class(reported_heights$height)[1] "character"
write_xlsx(reported_heights, "data/reported_heights.xlsx")# convert to numeric, inspect, count NAs
x <- as.numeric(reported_heights$height)Warning: NAs introduced by coercion
head(x)[1] 75 70 68 74 61 65
sum(is.na(x))[1] 81
# keep only entries that result in NAs
reported_heights %>% mutate(new_height = as.numeric(height)) %>%
filter(is.na(new_height)) %>%
head(n=10)Warning: There was 1 warning in `mutate()`.
ℹ In argument: `new_height = as.numeric(height)`.
Caused by warning:
! NAs introduced by coercion
time_stamp sex height new_height
1 2014-09-02 15:16:28 Male 5' 4" NA
2 2014-09-02 15:16:37 Female 165cm NA
3 2014-09-02 15:16:52 Male 5'7 NA
4 2014-09-02 15:16:56 Male >9000 NA
5 2014-09-02 15:16:56 Male 5'7" NA
6 2014-09-02 15:17:09 Female 5'3" NA
7 2014-09-02 15:18:00 Male 5 feet and 8.11 inches NA
8 2014-09-02 15:19:48 Male 5'11 NA
9 2014-09-04 00:46:45 Male 5'9'' NA
10 2014-09-04 10:29:44 Male 5'10'' NA
# calculate cutoffs that cover 99.999% of human population
alpha <- 1/10^6
qnorm(1-alpha/2, 69.1, 2.9)[1] 83.28575
qnorm(alpha/2, 63.7, 2.7)[1] 50.49258
# keep only entries that either result in NAs or are outside the plausible range of heights
not_inches <- function(x, smallest = 50, tallest = 84){
inches <- suppressWarnings(as.numeric(x))
ind <- is.na(inches) | inches < smallest | inches > tallest
ind
}
# number of problematic entries
problems <- reported_heights %>%
filter(not_inches(height)) %>%
.$height
length(problems)[1] 292
# 10 examples of x'y or x'y" or x'y\"
pattern <- "^\\d\\s*'\\s*\\d{1,2}\\.*\\d*'*\"*$"
str_subset(problems, pattern) %>% head(n=10) %>% cat5' 4" 5'7 5'7" 5'3" 5'11 5'9'' 5'10'' 5' 10 5'5" 5'2"
# 10 examples of x.y or x,y
pattern <- "^[4-6]\\s*[\\.|,]\\s*([0-9]|10|11)$"
str_subset(problems, pattern) %>% head(n=10) %>% cat5.3 5.5 6.5 5.8 5.6 5,3 5.9 6,8 5.5 6.2
# 10 examples of entries in cm rather than inches
ind <- which(between(suppressWarnings(as.numeric(problems))/2.54, 54, 81) )
ind <- ind[!is.na(ind)]
problems[ind] %>% head(n=10) %>% cat150 175 177 178 163 175 178 165 165 180
Para detectar una coma
pattern <- ","
str_detect(murders_raw$total,pattern ) [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[49] FALSE FALSE FALSE
Para detectar valores que tienen cm, u otro patrón, se utiliza la función str_subset()-
str_subset(reported_heights$height, "cm")[1] "165cm" "170 cm"
yes <- c("180 cm", "70 inches")
no <- c("180", "70''")
s <- c(yes, no)
s[1] "180 cm" "70 inches" "180" "70''"
Detectar los valores con cm o inches
str_detect(s, "cm") | str_detect(s, "inches")[1] TRUE TRUE FALSE FALSE
Otra forma de detectar cm o inches, a través de la expresión “cm|inches”
str_detect(s, "cm|inches")[1] TRUE TRUE FALSE FALSE
Para detectar patrones que incliyan dígitos , se utiliza la regex “\d”. Por ejemplo:
yes <- c("5","6","5'10","5 feet", "4'11")
no <- c("",".","Five","six")
s <- c(yes,no)
pattern <- "\\d"
str_detect(s,pattern)[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
La función str_view() resalta el primer patrón detectado
str_view(s,pattern)[1] │ <5>
[2] │ <6>
[3] │ <5>'<1><0>
[4] │ <5> feet
[5] │ <4>'<1><1>
Mientras que la función str_view_all() muestra todo el patrón
str_view_all(s,pattern)Warning: `str_view_all()` was deprecated in stringr 1.5.0.
ℹ Please use `str_view()` instead.
[1] │ <5>
[2] │ <6>
[3] │ <5>'<1><0>
[4] │ <5> feet
[5] │ <4>'<1><1>
[6] │
[7] │ .
[8] │ Five
[9] │ six
-
A regular expression (regex) is a way to describe a specific pattern of characters of text. A set of rules has been designed to do this specifically and efficiently.
-
stringr functions can take a regex as a pattern.
-
str_detect()indicates whether a pattern is present in a string. -
The main difference between a regex and a regular string is that a regex can include special characters.
-
The | symbol inside a regex means “or”.
-
Use
'\\d'to represent digits. The backlash is used to distinguish it from the character'd'. In R, you must use two backslashes for digits in regular expressions; in some other languages, you will only use one backslash for regex special characters. -
str_view()highlights the first occurrence of a pattern, and thestr_view_all()function highlights all occurrences of the pattern.
Si quiero resaltar carácteres específicos, debo expresarlos dentro de llaves
str_view(s,"[56]")[1] │ <5>
[2] │ <6>
[3] │ <5>'10
[4] │ <5> feet
Si quiero detectar rangos, utilizo -. Por ejemplo, para las expresiones entre 4 y 7, la expresión regular sería “[4-7]”
yes <- as.character(4:7)
no <- as.character(1:3)
s <- c(yes,no)
s[1] "4" "5" "6" "7" "1" "2" "3"
str_detect(s,"[4-7]")[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE
str_view(s,"[4-7]")[1] │ <4>
[2] │ <5>
[3] │ <6>
[4] │ <7>
Para detectar todas las letras del alfabeto en minúsculas se utiliza la expresión “[a-z]” y en mayúscula “[A-Z]”. Para detectar todas las letras en minúscula o mayúscula la expresión sería “[a-zA-Z]”
^ representa el principio de un string
$ representa el final de un string
Para obtener los string que tienen un sólo dígito, se utilizaría la siguiente expresión:
pattern <- "^\\d$"
yes <- c("1","5","9")
no <- c("12","123", " 1", "a4", "b")
s <- c(yes, no)
str_view(s, pattern)[1] │ <1>
[2] │ <5>
[3] │ <9>
Para obtener los string con uno o dos dígitos, se utiliza la expresión \d{1,2}
pattern <- "^\\d{1,2}$"
yes <- c("1","5","9","12")
no <- c("123","a4","b")
s <- c(yes, no)
s[1] "1" "5" "9" "12" "123" "a4" "b"
str_view(s,pattern)[1] │ <1>
[2] │ <5>
[3] │ <9>
[4] │ <12>
Para detectar el patrón número, pie, número, pulgada (ej. 5’7”) se utiliza la expresión “^[4-7]’\d{1,2]\”$”
pattern <- "^[4-7]'\\d{1,2}\"$"
yes <- c("5'7\"", "6'2\"", "5'12\"")
no <- c("6,2\"", "6.2\"","I am 5'11\"", "3'2\"", "64")
str_detect(yes,pattern)[1] TRUE TRUE TRUE
str_detect(no, pattern)[1] FALSE FALSE FALSE FALSE FALSE
-
Define strings to test your regular expressions, including some elements that match and some that do not. This allows you to check for the two types of errors: failing to match and matching incorrectly.
-
Square brackets define character classes: groups of characters that count as matching the pattern. You can use ranges to define character classes, such as
[0-9]for digits and[a-zA-Z]for all letters. -
Anchors define patterns that must start or end at specific places.
^and$represent the beginning and end of the string respectively. -
Curly braces are quantifiers that state how many times a certain character can be repeated in the pattern.
\\d{1,2}matches exactly 1 or 2 consecutive digits.
Sólo 14 string cumplen con el patrón deseado
pattern <- "^[4-7]'\\d{1,2}\"$"
sum(str_detect(problems,pattern))[1] 14
Por ejemplo:
problems[c(2,10,11,12,15)] %>% str_view_all(pattern)[1] │ 5' 4"
[2] │ <5'7">
[3] │ <5'3">
[4] │ 5 feet and 8.11 inches
[5] │ 5.5
El problema es que escribieron feet and inches en lugar de ’ y ”
str_subset(problems, "inches")[1] "5 feet and 8.11 inches" "Five foot eight inches" "5 feet 7inches"
[4] "5ft 9 inches" "5 ft 9 inches" "5 feet 6 inches"
O ’ ’ en ligar de ”
str_subset(problems, "''") [1] "5'9''" "5'10''" "5'10''" "5'3''" "5'7''" "5'6''" "5'7.5''"
[8] "5'7.5''" "5'10''" "5'11''" "5'10''" "5'5''"
Si no se usa el símbolo para pulgadas (“) al final, la expresión sería”^[4-7]’\d{1,2}$”
pattern <- "^[4-7]'\\d{1,2}$"
problems %>% str_replace("feet|ft|foot", "'") %>%
str_replace("inches|in|''|\"", "") %>%
str_detect(pattern) %>%
sum [1] 48
En regex es posible representar espacio con la expresión \s
pattern2 <- "^[4-7]'\\s\\d{1,2}\"$"
str_subset(problems,pattern2)[1] "5' 4\"" "5' 11\"" "5' 7\""
También, es posible reemplazar una cantidad de caracteres similares de manera posterior a * . Es decir, cero o más ejemplos del caracter previo. Esto también se puede aplicar después de espacio (\s)
yes <- c("AB", "A1B", "A11B", "A111B", "A1111B")
no <- c("A2B", "A21B")
str_detect(yes, "A1*B") #Cero o más ejemplos del caracter previo a *[1] TRUE TRUE TRUE TRUE TRUE
str_detect(no, "A1*B")[1] FALSE FALSE
# test how *, ? and + differ
data.frame(string = c("AB", "A1B", "A11B", "A111B", "A1111B"),
none_or_more = str_detect(yes, "A1*B"),
nore_or_once = str_detect(yes, "A1?B"),
once_or_more = str_detect(yes, "A1+B")) string none_or_more nore_or_once once_or_more
1 AB TRUE TRUE FALSE
2 A1B TRUE TRUE TRUE
3 A11B TRUE FALSE TRUE
4 A111B TRUE FALSE TRUE
5 A1111B TRUE FALSE TRUE
pattern <- "^[4-7]\\s*'\\s*\\d{1,2}$"
problems %>%
str_replace("feet|ft|foot" ,"'") %>% #reemplazar con '
str_replace("inches|in|''|\"", "") %>% #remover todos los símbolos de inches
str_detect(pattern) %>%
sum[1] 53
-
str_replace()replaces the first instance of the detected pattern with a specified string. -
Spaces are characters and R does not ignore them. Spaces are specified by the special character
\\s. -
Additional quantifiers include
*, + and ?.*means 0 or more instances of the previous character.?means 0 or 1 instances.+means 1 or more instances. -
Before removing characters from strings with functions like
str_replace()andstr_replace_all(), consider whether that replacement would have unintended effects.
pattern_without_groups <- "[4-7],\\d*$"pattern_with_groups <- "^([4-7]),(\\d*)$"yes <- c("5,9","5,11","6,","6,1")
no <- c("5'9", ",", "2,8", "6.1.1")
s <- c(yes,no)
str_detect(s,pattern_without_groups)[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
str_detect(s,pattern_with_groups)[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
A través de str_match() , es posible obtener los grupos si el patrón a detectar está en grupos y str_extract() los extrae
str_match(s,pattern_with_groups) [,1] [,2] [,3]
[1,] "5,9" "5" "9"
[2,] "5,11" "5" "11"
[3,] "6," "6" ""
[4,] "6,1" "6" "1"
[5,] NA NA NA
[6,] NA NA NA
[7,] NA NA NA
[8,] NA NA NA
La segunda y tercera columna representa los grupos, mientras que la primera columna representa el valor original. Una vez agrupadas las expresiones, es posible señalar cómo reemplazar el primero o segundo grupo a través de las expresiones \1 y \2 o la cantidad correspondiente. En este caso, reemplazar “,” por “’”
pattern_with_groups <- "^([4-7]),(\\d*)$"
yes <- c("5,9","5,11","6,","6,1")
no <- c("5'9", ",", "2,8", "6.1.1")
s <- c(yes,no)
str_replace(s,pattern_with_groups, "\\1'\\2")[1] "5'9" "5'11" "6'" "6'1" "5'9" "," "2,8" "6.1.1"
Identificar strings que cumplen las condiciones
pattern_with_groups <- "^([4-7])\\s*[,\\.\\s+]\\s*(\\d*)$"
str_subset(problems,pattern_with_groups) %>% head [1] "5.3" "5.25" "5.5" "6.5" "5.8" "5.6"
Luego , reemplazar
str_subset(problems, pattern_with_groups) %>%
str_replace(pattern_with_groups, "\\1'\\2") %>% head[1] "5'3" "5'25" "5'5" "6'5" "5'8" "5'6"
reemp1 <- str_subset(problems, pattern_with_groups) %>%
str_replace(pattern_with_groups, "\\1'\\2")
write_xlsx(as.data.frame(reemp1), "data/reemp_1.xlsx")-
Groups are defined using parentheses.
-
Once we define groups, we can use the function
str_match()to extract the values these groups define.str_extract()extracts only strings that match a pattern, not the values defined by groups. -
You can refer to the
ith group with\\i.For example, refer to the value in the second group with\\2.
# function to detect entries with problems
not_inches_or_cm <- function(x, smallest = 50, tallest = 84){
inches <- suppressWarnings(as.numeric(x))
ind <- !is.na(inches) &
((inches >= smallest & inches <= tallest) |
(inches/2.54 >= smallest & inches/2.54 <= tallest))
!ind
}
# identify entries with problems
problems <- reported_heights %>%
filter(not_inches_or_cm(height)) %>%
.$height
length(problems)[1] 200
#View(problems)Para convertir algunos string
converted <- problems %>%
str_replace("feet|foot|ft", "'") %>%
str_replace("inches|in|''\"","") %>%
str_replace("^([4-7])\\s*[,\\.\\s+]\\s*(\\d*)$", "\\1'\\2")
#head(converted)pattern <- "^[4-7]\\s*'\\s*\\d{1,2}$"
index <- str_detect(converted, pattern)
mean(index)[1] 0.48
Todos los string que no hacen parte del index, se obtienen a través de:
converted[!index] [1] "6" "5' 4\"" "165cm" "511"
[5] "6" "2" ">9000" "5'7\""
[9] "5'3\"" "5 ' and 8.11 " "11111" "5'9''"
[13] "6" "5'10''" "103.2" "19"
[17] "5" "300" "6'" "6"
[21] "Five ' eight " "5'5\"" "5'2\"" "7"
[25] "214" "6" "5'10''" "5'3''"
[29] "0.7" "5'7''" "6" "2'33"
[33] "5'3\"" "5'6''" "612" "1,70"
[37] "87" "5'7.5''" "5'7.5''" "111"
[41] "5'2\"" "5' 7.78\"" "12" "6"
[45] "yyy" "89" "34" "25"
[49] "6" "6" "22" "684"
[53] "6" "1" "1" "6*12"
[57] "87" "6" "1.6" "120"
[61] "120" "23" "1.7" "6"
[65] "5'8\"" "5'11\"" "5'7\"" "5' 11\""
[69] "5" "6'1\"" "69\"" "5' 7\""
[73] "5'10''" "5' 9 " "5 ' 9 " "6"
[77] "5'11''" "5'8\"" "6" "86"
[81] "708,661" "5 ' 6 " "5'10''" "6"
[85] "6'3\"" "649,606" "10000" "1"
[89] "728,346" "0" "5'5''" "5'7\""
[93] "6" "6" "6" "100"
[97] "6'4\"" "88" "6" "170 cm"
[101] "7,283,465" "5" "5" "34"
# read raw murders data line by line
library(tidyverse)
filename <- system.file("extdata/murders.csv", package = "dslabs")
lines <- readLines(filename)
lines %>% head()[1] "state,abb,region,population,total" "Alabama,AL,South,4779736,135"
[3] "Alaska,AK,West,710231,19" "Arizona,AZ,West,6392017,232"
[5] "Arkansas,AR,South,2915918,93" "California,CA,West,37253956,1257"
Para separar archivos separados por coma, se utiliza la función str_split()
x <- str_split(lines, ",")
x %>% head[[1]]
[1] "state" "abb" "region" "population" "total"
[[2]]
[1] "Alabama" "AL" "South" "4779736" "135"
[[3]]
[1] "Alaska" "AK" "West" "710231" "19"
[[4]]
[1] "Arizona" "AZ" "West" "6392017" "232"
[[5]]
[1] "Arkansas" "AR" "South" "2915918" "93"
[[6]]
[1] "California" "CA" "West" "37253956" "1257"
col_names <- x[[1]]
col_names #Extraer los nombres de los estados[1] "state" "abb" "region" "population" "total"
x <- x[-1] #Extraer exclusivamente los datos, sin encabezadoLa función map, del paquete purrr permite aplicar una función simultáneamente
library(purrr)
map(x, function(y)y[1]) %>% head()[[1]]
[1] "Alabama"
[[2]]
[1] "Alaska"
[[3]]
[1] "Arizona"
[[4]]
[1] "Arkansas"
[[5]]
[1] "California"
[[6]]
[1] "Colorado"
Otra forma
map(x, 1) %>% head()[[1]]
[1] "Alabama"
[[2]]
[1] "Alaska"
[[3]]
[1] "Arizona"
[[4]]
[1] "Arkansas"
[[5]]
[1] "California"
[[6]]
[1] "Colorado"
Las funciones map_chr() retorna un vector de caracteres y map_int() retorna vector de enteros
dat <- data.frame(map_chr(x,1), #Convierte vectores en caracteres
map_chr(x,2),
map_chr(x,3),
map_chr(x,4),
map_chr(x,5)) %>%
mutate_all(parse_guess) %>% #Transforma todos los vectores al tipo de datos que el programa suponga (double, char, logical...)
set_names(col_names) #Asigna nombre a los vectores
dat %>% head state abb region population total
1 Alabama AL South 4779736 135
2 Alaska AK West 710231 19
3 Arizona AZ West 6392017 232
4 Arkansas AR South 2915918 93
5 California CA West 37253956 1257
6 Colorado CO West 5029196 65
Se obtiene código más sencillo a través de stringr y el argumento simpligy = TRUE esto obliga a la función a retornar una matriz en lugar de una lista
x <- str_split(lines, ",", simplify = TRUE)
col_names <- x[1,]
x <- x[-1,]
x %>% as_data_frame() %>%
setNames(col_names) %>%
mutate_all(parse_guess)Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
ℹ Please use `as_tibble()` (with slightly different semantics) to convert to a
tibble, or `as.data.frame()` to convert to a data frame.
Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
`.name_repair` is omitted as of tibble 2.0.0.
ℹ Using compatibility `.name_repair`.
ℹ The deprecated feature was likely used in the tibble package.
Please report the issue at <https://github.com/tidyverse/tibble/issues>.
# A tibble: 51 × 5
state abb region population total
<chr> <chr> <chr> <dbl> <dbl>
1 Alabama AL South 4779736 135
2 Alaska AK West 710231 19
3 Arizona AZ West 6392017 232
4 Arkansas AR South 2915918 93
5 California CA West 37253956 1257
6 Colorado CO West 5029196 65
7 Connecticut CT Northeast 3574097 97
8 Delaware DE South 897934 38
9 District of Columbia DC South 601723 99
10 Florida FL South 19687653 669
# ℹ 41 more rows
Ejemplo
s <- c("5'10", "6'1")
tab <- data.frame(x = s)
tab x
1 5'10
2 6'1
Separar inches y feet a través de separate()
tab %>% separate(x, into=c("feet","inches"), sep="'" ) #Se debe específicar el nombre de las columnas y el símbolo porel cual separarse feet inches
1 5 10
2 6 1
También es posible realizar esto a través de expresiones regulares
tab %>% extract(x, c("feet", "inches"), regex = "([\\d])'(\\d{1,2})") feet inches
1 5 10
2 6 1
s <- c("5'10", "6'1\"", "5'8inches")
tab <- data.frame(x=s)
tab x
1 5'10
2 6'1"
3 5'8inches
tab %>% separate(x, c("feet", "inches"), sep="'", fill="right") feet inches
1 5 10
2 6 1"
3 5 8inches
Pero si utilizamos extract()
tab %>% extract(x, into = c("feet", "inches"), regex="(\\d)'(\\d{1,2})") feet inches
1 5 10
2 6 1
3 5 8
For case 1, if we add a '0 to, for example, convert all 6 to 6'0,
then our pattern will match. This can be done using groups using the
following code:
yes <- c("5", "6", "5")
no <- c("5'", "5''", "5'4")
s <- c(yes, no)
str_replace(s, "^([4-7])$", "\\1'0") #^$ poner este rango es otra manera de expresar = "mantenga esto"[1] "5'0" "6'0" "5'0" "5'" "5''" "5'4"
The pattern says it has to start (^), be followed with a digit
between 4 and 7, and then end there ($). The parenthesis defines
the group that we pass as \\1 to the replace regex.
We can adapt this code slightly to handle case 2 as well which covers
the entry 5'. Note that the 5' is left untouched by the code
above. This is because the extra ' makes the pattern not match
since we have to end with a 5 or 6. To handle case 2, we want to permit
the 5 or 6 to be followed by no or one symbol for feet. So we can simply
add '{0,1} after the ' to do this. We can also use the none
or once special character ?. As we saw previously, this is
different from * which is none or more. We now see that this code
also handles the fourth case as well:
str_replace(s, "^([56])'?$", "\\1'0")[1] "5'0" "6'0" "5'0" "5'0" "5''" "5'4"
Note that here we only permit 5 and 6 but not 4 and 7. This is because
heights of exactly 5 and exactly 6 feet tall are quite common, so we
assume those that typed 5 or 6 really meant either 60 or 72 inches.
However, heights of exactly 4 or exactly 7 feet tall are so rare that,
although we accept 84 as a valid entry, we assume that a 7 was
entered in error.
We can use quantifiers to deal with case 3. These entries are not
matched because the inches include decimals and our pattern does not
permit this. We need allow the second group to include decimals and not
just digits. This means we must permit zero or one period
. followed by zero or more digits. So we will use
both ? and *. Also remember that for this particular case,
the period needs to be escaped since it is a special character (it means
any character except a line break).
So we can adapt our pattern, currently ^[4-7]\\s*'\\s*\\d{1,2}$,
to permit a decimal at the end:
pattern <- "^[4-7]\\s*'\\s*(\\d+\\.?\\d*)$"Case 5, meters using commas, we can approach similarly to how we
converted the x.y to x'y. A difference is that we require that the
first digit is 1 or 2:
yes <- c("1,7", "1, 8", "2, " )
no <- c("5,8", "5,3,2", "1.7")
s <- c(yes, no)
str_replace(s, "^([12])\\s*,\\s*(\\d*)$", "\\1\\.\\2")[1] "1.7" "1.8" "2." "5,8" "5,3,2" "1.7"
In general, spaces at the start or end of the string are uninformative. These can be particularly deceptive because sometimes they can be hard to see:
s <- "Hi "
cat(s)Hi
identical(s, "Hi")[1] FALSE
str_trim("5 ' 9 ")[1] "5 ' 9"
One of the entries writes out numbers as
words: Five foot eight inches. Although not efficient, we could
add 12 extra str_replace to convert zero
to 0, one to 1, and so on. To avoid having to write
two separate operations for Zero and zero, One
and one, etc., we can use the str_to_lower() function to make
all words lower case first:
s <- c("Five feet eight inches")
str_to_lower(s)[1] "five feet eight inches"
We are now ready to define a procedure that handles converting all the problematic cases.
We can now put all this together into a function that takes a string vector and tries to convert as many strings as possible to a single format. Below is a function that puts together the previous code replacements:
convert_format <- function(s){
s %>%
str_replace("feet|foot|ft", "'") %>% #convert feet symbols to '
str_replace_all("inches|in|''|\"|cm|and", "") %>% #remove inches and other symbols
str_replace("^([4-7])\\s*[,\\.\\s+]\\s*(\\d*)$", "\\1'\\2") %>% #change x.y, x,y x y
str_replace("^([56])'?$", "\\1'0") %>% #add 0 when to 5 or 6
str_replace("^([12])\\s*,\\s*(\\d*)$", "\\1\\.\\2") %>% #change european decimal
str_trim() #remove extra space
}We can also write a function that converts words to numbers:
words_to_numbers <- function(s){
str_to_lower(s) %>%
str_replace_all("zero", "0") %>%
str_replace_all("one", "1") %>%
str_replace_all("two", "2") %>%
str_replace_all("three", "3") %>%
str_replace_all("four", "4") %>%
str_replace_all("five", "5") %>%
str_replace_all("six", "6") %>%
str_replace_all("seven", "7") %>%
str_replace_all("eight", "8") %>%
str_replace_all("nine", "9") %>%
str_replace_all("ten", "10") %>%
str_replace_all("eleven", "11")
}library(dslabs)
data("gapminder")
gapminder %>%
filter(region=="Caribbean") %>%
ggplot(aes(year, life_expectancy, color=country)) +
geom_line() + theme_bw()Utilizando recode() para reducir los nombres
# recode long country names and remake plot
gapminder %>% filter(region=="Caribbean") %>%
mutate(country = recode(country,
'Antigua and Barbuda'="Barbuda",
'Dominican Republic' = "DR",
'St. Vincent and the Grenadines' = "St. Vincent",
'Trinidad and Tobago' = "Trinidad")) %>%
ggplot(aes(year, life_expectancy, color = country)) +
geom_line() + ylab("Life expectancy") + theme_bw()graf_final <- gapminder %>% filter(region=="Caribbean") %>%
mutate(country = recode(country,
'Antigua and Barbuda'="Barbuda",
'Dominican Republic' = "DR",
'St. Vincent and the Grenadines' = "St. Vincent",
'Trinidad and Tobago' = "Trinidad")) %>%
ggplot(aes(year, life_expectancy, color = country)) +
geom_line() + ylab("Life expectancy") + theme_bw()
graf_finalggsave( "images/graf_final.png")Saving 7 x 5 in image
Import raw Brexit referendum polling data from Wikipedia:
library(rvest)
library(tidyverse)
library(stringr)
url <- "https://en.wikipedia.org/w/index.php?title=Opinion_polling_for_the_United_Kingdom_European_Union_membership_referendum&oldid=896735054"
tab <- read_html(url) %>% html_nodes("table")
polls <- tab[[6]] %>% html_table(fill = TRUE)
nrow(polls)[1] 134
head(polls)# A tibble: 6 × 9
`Date(s) conducted` Remain Leave Undecided Lead Sample `Conducted by`
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Date(s) conducted "" "" Undecided Lead Sample Conducted by
2 23 June 2016 "48.1%" "51.9%" N/A 3.8% 33,577,342 Results of the…
3 23 June "52%" "48%" N/A 4% 4,772 YouGov
4 22 June "55%" "45%" N/A 10% 4,700 Populus
5 20–22 June "51%" "49%" N/A 2% 3,766 YouGov
6 20–22 June "49%" "46%" 1% 3% 1,592 Ipsos MORI
# ℹ 2 more variables: `Polling type` <chr>, Notes <chr>
polls <- polls %>% slice(-1) %>% setNames(c("dates", "remain", "leave", "undecided", "lead", "samplesize", "pollster", "poll_type", "notes"))
#View(polls)
sum(str_detect(polls$remain, "%"))[1] 129
head(str_replace(polls$undecided, "N/A", "0"))[1] "0" "0" "0" "0" "1%" "9%"
library(dslabs)
library(tidyverse)
data("polls_us_election_2016")
polls_us_election_2016$startdate %>% head()[1] "2016-11-03" "2016-11-01" "2016-11-02" "2016-11-04" "2016-11-03"
[6] "2016-11-03"
class(polls_us_election_2016$startdate)[1] "Date"
polls_us_election_2016 %>% filter(pollster=="Ipsos" & state=="U.S.") %>% ggplot(aes(startdate, rawpoll_trump)) +
geom_line() + xlab("Start date") + ylab("Rawpoll Trump") + theme_bw()ggsave("images/rawpoll_Trump.png")Abordar fechas con lubridate()
library(lubridate)
dates <- sample(polls_us_election_2016$startdate, 10) %>% sort()
dates [1] "2016-03-18" "2016-07-15" "2016-09-06" "2016-09-08" "2016-09-21"
[6] "2016-09-27" "2016-09-30" "2016-10-13" "2016-10-20" "2016-10-21"
También, es posible extraer los días, meses y años
data.frame(date = dates,
month = month(dates),
day = day(dates),
year = year(dates)) date month day year
1 2016-03-18 3 18 2016
2 2016-07-15 7 15 2016
3 2016-09-06 9 6 2016
4 2016-09-08 9 8 2016
5 2016-09-21 9 21 2016
6 2016-09-27 9 27 2016
7 2016-09-30 9 30 2016
8 2016-10-13 10 13 2016
9 2016-10-20 10 20 2016
10 2016-10-21 10 21 2016
Extraer los labels
month(dates, label=TRUE) [1] Mar Jul Sep Sep Sep Sep Sep Oct Oct Oct
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
También es útil la función ymd() dentro de fechas, que supone (tipo sparse) el tipo de datos a partir de strings. Se sugiere que el formato yyyy-mm-dd sea el estándar de datos.
x <- c(20090101, "2009-01-02", "2009 01 03", "2009-1-4",
"2009-1, 5", "Created on 2009 1 6", "200901 !!! 07")
ymd(x)[1] "2009-01-01" "2009-01-02" "2009-01-03" "2009-01-04" "2009-01-05"
[6] "2009-01-06" "2009-01-07"
Otros tipos de parse.
x <- "09/01/02"
ydm(x)[1] "2009-02-01"
myd(x)[1] "2001-09-02"
dym(x)[1] "2001-02-09"
dmy(x)[1] "2002-01-09"
De igual forma, se puede extraer el tiempo local o internacional a através de la función now()
now() #Hora local (-5)[1] "2025-04-19 06:54:00 -05"
now("GMT") #Hora global (Londres)[1] "2025-04-19 11:54:00 GMT"
De igual forma, es posible extraer horas minutos y segundos actuales
now() %>% hour()[1] 6
now() %>% minute()[1] 54
now() %>% second()[1] 0.542948
Y restringir al formato de fecha
# parse time
x <- c("12:34:56")
hms(x)[1] "12H 34M 56S"
#parse datetime
x <- "Nov/2/2012 12:34:56"
mdy_hms(x)[1] "2012-11-02 12:34:56 UTC"
-
Dates are a separate data type in R.The tidyverse includes functionality for dealing with dates through the lubridate package.
-
Extract the year, month and day from a date object with the
year(),month()andday()functions. -
Parsers convert strings into dates with the standard YYYY-MM-DD format (ISO 8601 format). Use the parser with the name corresponding to the string format of year, month and day (
ymd(),ydm(),myd(),mdy(),dmy(),dym()). -
Get the current time with the
Sys.time()function. Use thenow()function instead to specify a time zone. -
You can extract values from time objects with the
hour(),minute()andsecond()functions. -
Parsers convert strings into times (for example,
hms()). Parsers can also create combined date-time objects (for example,mdy_hms()).
data(brexit_polls)
#View(brexit_polls)How many polls had a start date (startdate) in April (month number 4)?
sum(month(brexit_polls$startdate) == "4")[1] 25
Use the round_date() function on the enddate column with the
argument unit="week". How many polls ended the week of 2016-06-12?
sum(round_date(brexit_polls$enddate, unit = "week") == "2016-06-12")[1] 13
Use the weekdays() function from lubridate to determine the
weekday on which each poll ended (enddate).
table(weekdays(brexit_polls$enddate)) Friday Monday Saturday Sunday Thursday Tuesday Wednesday
14 20 4 37 17 23 12
Load the movielens data frame from dslabs.
data(movielens)
#View(movielens)This data frame contains a set of about 100,000 movie reviews. The
timestamp column contains the review date as the number of seconds
since 1970-01-01 (epoch time).
Convert the timestamp column to dates using the lubridate
as_datetime() function.
Which year had the most movie reviews?
as_datetime(movielens$timestamp) %>% year() %>% table().
1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007
3 6239 3294 1825 5901 13869 4658 3938 4462 4658 7161 7493 1548
2008 2009 2010 2011 2012 2013 2014 2015 2016
3676 3434 2518 4450 3849 1969 2224 6610 6225
Which hour of the day had the most movie reviews?
as_datetime(movielens$timestamp) %>% hour() %>% table().
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
3960 5296 4056 4007 3584 3773 4361 2932 2667 2866 2948 2862 3361 2384 3723 4404
16 17 18 19 20 21 22 23
4803 4717 5094 5161 7011 6428 6008 3598







