Patrick Barks 2019-01-15
Regular expressions (regex) provide a way to find patterns within strings of text. You can find the R help page for regular expressions using ?regex
.
grep(pattern, x)
returns the indices of the vector x
that match the given pattern
.
spp <- c("Polar bear",
"Grizzly bear",
"Cheetah",
"Giant panda")
# find the bears
grep("bear", spp)
## [1] 1 2
The grepl
function is similar, but returns logical values.
grepl("bear", spp)
## [1] TRUE TRUE FALSE FALSE
We could also tell grep
to return the matching values rather than their indices, as in
grep("bear", spp, value = TRUE)
## [1] "Polar bear" "Grizzly bear"
To match multiple search terms we can use the "or" symbol |
, as in
grep("bear|panda", spp, value = TRUE) # match "bear" or "panda"
## [1] "Polar bear" "Grizzly bear" "Giant panda"
There are a number of ways to make our search case-insensitive.
spp <- c("Bear", "bear", "cheetah", "Bear")
grep("bear", spp, ignore.case = TRUE)
## [1] 1 2 4
grep("Bear|bear", spp)
## [1] 1 2 4
grep("[Bb]ear", spp)
## [1] 1 2 4
To match a special character within a search string, we generally need to 'escape' it with two preceding backslashes \\
.
x0 <- c("(a)", "b", "(c)")
grep("\\(", x0) # match an opening parenthesis
## [1] 1 3
x1 <- c("weather_1.csv",
"weather_2.csv",
"weather_3.csv",
"csvntf.txt",
"denmark_weather_2012.csv")
grep("weather", x1) # match "weather" anywhere
## [1] 1 2 3 5
grep("^weather", x1) # match "weather" only at start
## [1] 1 2 3
grep("csv", x1) # match "csv" anywhere
## [1] 1 2 3 4 5
grep("csv$", x1) # match "csv" only at end
## [1] 1 2 3 5
x2 <- c("bear", "bean", "bat", "boar", "beer")
grep("b..r", x2, value = TRUE)
## [1] "bear" "boar" "beer"
To match a given character a specific number of times, we can use the following expressions.
x3 <- c("ag", "acg", "accg", "acccg", "accccg")
grep("ac*g", x3, value = TRUE) # * matches 0+ times
## [1] "ag" "acg" "accg" "acccg" "accccg"
grep("ac+g", x3, value = TRUE) # + matches at least 1 time
## [1] "acg" "accg" "acccg" "accccg"
grep("ac?g", x3, value = TRUE) # ? matches at most 1 time
## [1] "ag" "acg"
grep("ac{2}g", x3, value = TRUE) # {n} matches exactly n times
## [1] "accg"
grep("ac{2,}g", x3, value = TRUE) # {n,} matches at least n times
## [1] "accg" "acccg" "accccg"
grep("ac{2,3}g", x3, value = TRUE) # {n,m} matches n-m times
## [1] "accg" "acccg"
Note that the repetition expressions apply to the preceding character ("c" in the examples above).
x4 <- c("1000", "one thousand", "1000.00")
grep("[[:digit:]]", x4) # match any digit
## [1] 1 3
grep("[[:alpha:]]", x4) # match any alphabetic character
## [1] 2
grep("[[:alnum:]]", x4) # match any alphanumeric character
## [1] 1 2 3
grep("[[:lower:]]", x4) # match any lowercase
## [1] 2
grep("[[:upper:]]", x4) # match any uppercase
## integer(0)
grep("[[:space:]]", x4) # match any space character
## [1] 2
grep("[[:punct:]]", x4) # match any punctuation
## [1] 3
Note that the interpretation of character classes depends on the locale.
In the examples above, we found the elements of a vector that match a given pattern. Sometimes we might want to find the specific position within each element where the pattern occurs (e.g. to extract the matching pattern).
x5 <- c("Blue: 13",
"Red: 142",
"5 Yellow")
For example, to find the position of the integers within the strings above, we could use regexpr
, as in
rx <- regexpr("[[:digit:]]+", x5) # find patterns of repeating digits
rx
## [1] 7 6 1
## attr(,"match.length")
## [1] 2 3 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
The output tells us the position and length of the matching pattern within each element. We could then extract each matching pattern using regmatches
, as in
regmatches(x5, rx)
## [1] "13" "142" "5"
Sometimes we want to extract a specific pattern, but only if it's preceded or followed by some other pattern. We can use look-ahead (?=...)
or look-behind (?<=...)
expressions to do so (note that to use these we must set the argument perl = TRUE
). Here's an example where we want to extract numbers from strings, but only if they represent a dollar value (i.e. are preceded by "$" or followed by " dollars").
x6 <- c("Car: 8000 dollars",
"Computer: $500",
"1 table",
"4 chairs",
"18 dollars cash")
# match integers followed by " dollars"
rx1 <- regexpr("[[:digit:]]+(?= dollars)", x6, perl = TRUE)
regmatches(x6, rx1)
## [1] "8000" "18"
# match integers preced by "$"
rx2 <- regexpr("[[:digit:]]+(?<=$)", x6, perl = TRUE)
regmatches(x6, rx2)
## [1] "500"