You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
episodes\01-regular-expressions.md
- Elevated "Learning common regex metacharacters" to improve the layout of the episode
- Numbered challenges / exercises (LibraryCarpentry#198)
- Removed decayed link, replaced with regex100.com library of commmunity-submitted regex (LibraryCarpentry#210)
episodes\01-regular-expressions.md
- Numbered exercises (LibraryCarpentry#198)
- Clarified that learner should type add a space after community for first challenge in exercise 2.1 (LibraryCarpentry#220 (comment))
- Added exercise 2.4 on use of regex in R (self-organisied workshop, this lessson is taught after https://datacarpentry.org/r-socialsci/)
Copy file name to clipboardExpand all lines: episodes/01-regular-expressions.md
+20-18
Original file line number
Diff line number
Diff line change
@@ -46,7 +46,7 @@ Most regular expression implementations employ similar syntaxes and metacharacte
46
46
47
47
A very simple use of a regular expression would be to locate the same word spelled two different ways. For example the regular expression `organi[sz]e` matches both `organise` and `organize`. But because it locates all matches for the pattern in the file, not just for that word, it would also match `reorganise`, `reorganize`, `organises`, `organizes`, `organised`, `organized`, etc.
48
48
49
-
###Learning common regex metacharacters
49
+
## Learning common regex metacharacters
50
50
51
51
Square brackets can be used to define a list or range of characters to be found. So:
52
52
@@ -100,6 +100,8 @@ Or, any other string that starts a line, begins with a letter `o` in lower or ca
This logic is useful when you have lots of files in a directory, when those files have logical file names, and when you want to isolate a selection of files. It can be used for looking at cells in spreadsheets for certain values, or for extracting some data from a column of a spreadsheet to make new columns. There are many other contexts in which regex is useful when using a computer to search through a document, spreadsheet, or file structure. Some real-world use cases for regex are included on a [ACRL Tech Connect blog post](https://acrl.ala.org/techconnect/post/fear-no-longer-regular-expressions/).
266
+
This logic is useful when you have lots of files in a directory, when those files have logical file names, and when you want to isolate a selection of files. It can be used for looking at cells in spreadsheets for certain values, or for extracting some data from a column of a spreadsheet to make new columns. There are many other contexts in which regex is useful when using a computer to search through a document, spreadsheet, or file structure. You can browse real-world use cases in the [regex101.com library of community-submitted regex patterns](https://regex101.com/library).
265
267
266
268
To embed this knowledge we will not - however - be using computers. Instead we'll use pen and paper for now.
267
269
@@ -275,7 +277,7 @@ Then test each other on the answers. If you want to check your logic use [regex1
275
277
276
278
::::::::::::::::::::::::::::::::::::::: challenge
277
279
278
-
## Using square brackets
280
+
## 1. Using square brackets
279
281
280
282
What will the regular expression `Fr[ea]nc[eh]` match?
281
283
@@ -300,7 +302,7 @@ Note that the way this regular expression is constructed, it will match misspell
300
302
301
303
::::::::::::::::::::::::::::::::::::::: challenge
302
304
303
-
## Using dollar signs
305
+
## 2. Using dollar signs
304
306
305
307
What will the regular expression `Fr[ea]nc[eh]$` match?
306
308
@@ -325,7 +327,7 @@ This will match the pattern only when it appears at the end of a line. It will a
325
327
326
328
::::::::::::::::::::::::::::::::::::::: challenge
327
329
328
-
## Introducing options
330
+
## 3. Introducing options
329
331
330
332
What would match the strings `French` and `France` that appear at the beginning of a line?
331
333
@@ -347,7 +349,7 @@ This will also find words where there were characters after `French` such as `Fr
347
349
348
350
::::::::::::::::::::::::::::::::::::::: challenge
349
351
350
-
## Case insensitivity
352
+
## 4. Case insensitivity
351
353
352
354
How do you match the whole words `colour` and `color` (case insensitive)?
353
355
@@ -375,7 +377,7 @@ so `/colou?r/i` will match all case insensitive variants of `colour` and `color`
375
377
376
378
::::::::::::::::::::::::::::::::::::::: challenge
377
379
378
-
## Word boundaries
380
+
## 5. Word boundaries
379
381
380
382
How would you find the whole word `headrest` and or `head rest` but not <code>head rest</code> (that is, with two spaces between `head` and `rest`?
381
383
@@ -397,7 +399,7 @@ Note that although `\bhead\s?rest\b` does work, it will also match zero or one t
397
399
398
400
::::::::::::::::::::::::::::::::::::::: challenge
399
401
400
-
## Matching non-linguistic patterns
402
+
## 6. Matching non-linguistic patterns
401
403
402
404
How would you find a string that ends with four letters preceded by at least one zero?
403
405
@@ -415,7 +417,7 @@ How would you find a string that ends with four letters preceded by at least one
415
417
416
418
::::::::::::::::::::::::::::::::::::::: challenge
417
419
418
-
## Matching digits
420
+
## 7. Matching digits
419
421
420
422
How do you match any four-digit string anywhere?
421
423
@@ -437,7 +439,7 @@ Note: this will also match four-digit strings within longer strings of numbers a
437
439
438
440
::::::::::::::::::::::::::::::::::::::: challenge
439
441
440
-
## Matching dates
442
+
## 8. Matching dates
441
443
442
444
How would you match the date format `dd-MM-yyyy`?
443
445
@@ -459,7 +461,7 @@ Depending on your data, you may choose to remove the word bounding.
459
461
460
462
::::::::::::::::::::::::::::::::::::::: challenge
461
463
462
-
## Matching multiple date formats
464
+
## 9. Matching multiple date formats
463
465
464
466
How would you match the date format `dd-MM-yyyy` or `dd-MM-yy` at the end of a line only?
465
467
@@ -481,7 +483,7 @@ Note this will also find strings such as `31-01-198` at the end of a line, so yo
481
483
482
484
::::::::::::::::::::::::::::::::::::::: challenge
483
485
484
-
## Matching publication formats
486
+
## 10. Matching publication formats
485
487
486
488
How would you match publication formats such as `British Library : London, 2015` and `Manchester University Press: Manchester, 1999`?
For this exercise, open a browser and go to [https://regex101.com](https://regex101.com). Regex101.com is a free regular expression debugger with real time explanation, error detection, and highlighting.
23
23
24
24
Open the [swcCoC.md file](https://github.com/LibraryCarpentry/lc-data-intro/tree/main/episodes/data/swcCoC.md), copy the text, and paste that into the test string box.
25
25
26
-
For a quick test to see if it is working, type the string `community` into the regular expression box.
26
+
For a quick test to see if it is working, type the string `community` into the regular expression box.
27
27
28
28
If you look in the box on the right of the screen, you see that the expression matches six instances of the string 'community' (the instances are also highlighted within the text).
29
29
30
30
::::::::::::::::::::::::::::::::::::::: challenge
31
31
32
32
### Taking spaces into consideration
33
33
34
-
Type `community`. You get three matches. Why not six?
34
+
Add a space after `community`. You get three matches. Why not six?
35
35
36
36
::::::::::::::: solution
37
37
@@ -135,7 +135,7 @@ Find all of the words starting with Comm or comm that are plural.
## Exercise finding phone numbers, Using regex101.com
220
+
## Exercise: finding phone numbers
221
221
222
222
Does this Code of Conduct contain a phone number?
223
223
@@ -355,9 +355,79 @@ This expression should find one match in the document.
355
355
356
356
One of the reasons we stress the value of consistent and predictable directory and filenaming conventions is that working in this way enables you to use the computer to select files based on the characteristics of their file names. For example, if you have a bunch of files where the first four digits are the year and you only want to do something with files from '2017', then you can. Or if you have 'journal' somewhere in a filename when you have data about journals, you can use the computer to select just those files. Equally, using plain text formats means that you can go further and select files or elements of files based on characteristics of the data *within* those files. See Workshop Overview: [File Naming \& Formatting](https://librarycarpentry.org/lc-overview/06-file-naming-formatting) for further background.
## Exercise: Extracting substrings in R using regex
361
+
362
+
You can use regular expressions in many functions in base R, for example **`grep`** and **`sub`**. We will look at some functions from **`stringr`**, a powerful package for character strings that works well with packages we saw already like **`dplyr`** and **`tidyr`**. To learn more about **`stringr`** after the workshop, you may want to check out this handy [string manipulation with stringr cheatsheet](https://rstudio.github.io/cheatsheets/html/strings.html).
363
+
364
+
We will look at just two functions in **`stringr`** that can take regular expressions as an argument:
365
+
366
+
*`str_extract(string, pattern)`: return the first pattern match found in each string, as a vector
367
+
368
+
*`str_replace(string, pattern, replacement)`: finds the first pattern match in a string, and replaces it with a replacement string
369
+
370
+
These functions will return the first pattern match only. To return all possible matches, we can use **`str_extract_all()`** and **`str_replace_all()`**.
371
+
372
+
::::::::::::::::::::::::::::::::::::::::: callout
373
+
### ESCAPING METACHARACTERS IN REGULAR EXPRESSIONS IN R
374
+
Regular expressions in R follow the general syntax we have seen so far, with one main exception. In R, strings use a backslash `\` to escape special behavior - but regular expressions are themselves regarded as strings by R. This creates a problem when we use a metacharacter, like **`\d`**, as this is interpreted by R as `d`. We get around this by using an extra `\` beforehand, like **`\\d`**.
## Extracting a substring in Google Sheets using regex
377
+
Let's build a regular expression to extract the month from a date, written as a string in the format YYYY-MM-DD:
378
+
379
+
```R
380
+
library(stringr)
381
+
str_extract("2024-05-09", "-\\d{2}-")
382
+
```
383
+
```output
384
+
[1] "-05-"
385
+
```
386
+
This returns the month (MM), with the dashes on either side. There are more advanced ways to remove these, bu a simple approach would be to use **`str_replace_all()`** to replace every "-" with "" (that is, nothing):
387
+
```R
388
+
str_replace_all("-05-", "-", "")
389
+
```
390
+
```output
391
+
[1] "05"
392
+
```
393
+
We could go one step further and complete both steps in one line of code by wrapping our first **`str_extract()`** function inside **`str_replace_all()`**
Open the "SAFI_clean.csv" dataset we worked with on days 1 and 2 in R. It contains a column "interview_date" with the date of each interview in the format "YYYY-MM-DD". How can we apply the regular expression we just built to add a new column "interview_month" containing just the month (MM)?
405
+
406
+
Hint: the mutate() function from dplyr can create new columns containing modified values
407
+
408
+
::::::::::::::: solution
409
+
### Solution
410
+
```R
411
+
library(tidyverse)
412
+
library(here)
413
+
414
+
#Read the data
415
+
interviews<- read_csv(
416
+
here("data", "SAFI_clean.csv"),
417
+
na="NULL")
418
+
419
+
# Add a new column "interview_month", containing the month (MM) extracted from the interview_date (YYYY-MM-DD)
## Exercise: Extracting substrings in Google Sheets using regex
428
+
429
+
You can also use regular expression in Google Sheets.
430
+
361
431
362
432
::::::::::::::::::::::::::::::::::::::: challenge
363
433
@@ -381,9 +451,7 @@ This is one way to solve this challenge. You might have found others. Inside the
381
451
Latitude and longitude are in decimal degree format and can be positive or negative, so we start with an optional dash for negative values then use `\d+` for a one or more digit match followed by a period `\.`. Note we had to escape the period using `\`. After the period we look for one or more digits `\d+` again followed by a literal comma `,`. We then have a literal space match followed by an optional dash `-` (there are few `0.0` latitude/longitudes that are probably errors, but we'd want to retain so we can deal with them). We then repeat our `\d+\.\d+` we used for the latitude match.
0 commit comments