Skip to content

Commit 24cc373

Browse files
Numbered exercises, broken links amended
episodes\01-regular-expressions.md - Elevated "Learning common regex metacharacters" to improve the layout of the episode - Numbered challenges / exercises (LibraryCarpentry#198) - Removed decayed link, replaced with regex100.com library of commmunity-submitted regex (LibraryCarpentry#210) episodes\01-regular-expressions.md - Numbered exercises (LibraryCarpentry#198) - Clarified that learner should type add a space after community for first challenge in exercise 2.1 (LibraryCarpentry#220 (comment)) - Added exercise 2.4 on use of regex in R (self-organisied workshop, this lessson is taught after https://datacarpentry.org/r-socialsci/)
1 parent 14a7da0 commit 24cc373

File tree

2 files changed

+96
-26
lines changed

2 files changed

+96
-26
lines changed

episodes/01-regular-expressions.md

+20-18
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ Most regular expression implementations employ similar syntaxes and metacharacte
4646

4747
A very simple use of a regular expression would be to locate the same word spelled two different ways. For example the regular expression `organi[sz]e` matches both `organise` and `organize`. But because it locates all matches for the pattern in the file, not just for that word, it would also match `reorganise`, `reorganize`, `organises`, `organizes`, `organised`, `organized`, etc.
4848

49-
### Learning common regex metacharacters
49+
## Learning common regex metacharacters
5050

5151
Square brackets can be used to define a list or range of characters to be found. So:
5252

@@ -100,6 +100,8 @@ Or, any other string that starts a line, begins with a letter `o` in lower or ca
100100

101101
::::::::::::::::::::::::::::::::::::::::::::::::::
102102

103+
## Additional regex metacharacters
104+
103105
Other useful special characters are:
104106

105107
- `*` matches the preceding element zero or more times. For example, ab\*c matches "ac", "abc", "abbbc", etc.
@@ -113,7 +115,7 @@ So, what are these going to match?
113115

114116
::::::::::::::::::::::::::::::::::::::: challenge
115117

116-
## `^[Oo]rgani.e\w*`
118+
## 1. `^[Oo]rgani.e\w*`
117119

118120
What will the regular expression `^[Oo]rgani.e\w*` match?
119121

@@ -138,7 +140,7 @@ Or, any other string that starts a line, begins with a letter `o` in lower or ca
138140

139141
::::::::::::::::::::::::::::::::::::::: challenge
140142

141-
## `[Oo]rgani.e\w+$`
143+
## 2. `[Oo]rgani.e\w+$`
142144

143145
What will the regular expression `[Oo]rgani.e\w+$` match?
144146

@@ -163,7 +165,7 @@ Or, any other string that ends a line, begins with a letter `o` in lower or capi
163165

164166
::::::::::::::::::::::::::::::::::::::: challenge
165167

166-
## `^[Oo]rgani.e\w?\b`
168+
## 3. `^[Oo]rgani.e\w?\b`
167169

168170
What will the regular expression `^[Oo]rgani.e\w?\b` match?
169171

@@ -188,7 +190,7 @@ Or, any other string that starts a line, begins with a letter `o` in lower or ca
188190

189191
::::::::::::::::::::::::::::::::::::::: challenge
190192

191-
## `^[Oo]rgani.e\w?$`
193+
## 4. `^[Oo]rgani.e\w?$`
192194

193195
What will the regular expression `^[Oo]rgani.e\w?$` match?
194196

@@ -213,7 +215,7 @@ Or, any other string that starts and ends a line, begins with a letter `o` in lo
213215

214216
::::::::::::::::::::::::::::::::::::::: challenge
215217

216-
## `\b[Oo]rgani.e\w{2}\b`
218+
## 5. `\b[Oo]rgani.e\w{2}\b`
217219

218220
What will the regular expression `\b[Oo]rgani.e\w{2}\b` match?
219221

@@ -238,7 +240,7 @@ Or, any other string that begins with a letter `o` in lower or capital case afte
238240

239241
::::::::::::::::::::::::::::::::::::::: challenge
240242

241-
## `\b[Oo]rgani.e\b|\b[Oo]rgani.e\w{1}\b`
243+
## 6. `\b[Oo]rgani.e\b|\b[Oo]rgani.e\w{1}\b`
242244

243245
What will the regular expression `\b[Oo]rgani.e\b|\b[Oo]rgani.e\w{1}\b` match?
244246

@@ -261,7 +263,7 @@ Or, any other string that begins with a letter `o` in lower or capital case afte
261263

262264
::::::::::::::::::::::::::::::::::::::::::::::::::
263265

264-
This logic is useful when you have lots of files in a directory, when those files have logical file names, and when you want to isolate a selection of files. It can be used for looking at cells in spreadsheets for certain values, or for extracting some data from a column of a spreadsheet to make new columns. There are many other contexts in which regex is useful when using a computer to search through a document, spreadsheet, or file structure. Some real-world use cases for regex are included on a [ACRL Tech Connect blog post](https://acrl.ala.org/techconnect/post/fear-no-longer-regular-expressions/) .
266+
This logic is useful when you have lots of files in a directory, when those files have logical file names, and when you want to isolate a selection of files. It can be used for looking at cells in spreadsheets for certain values, or for extracting some data from a column of a spreadsheet to make new columns. There are many other contexts in which regex is useful when using a computer to search through a document, spreadsheet, or file structure. You can browse real-world use cases in the [regex101.com library of community-submitted regex patterns](https://regex101.com/library).
265267

266268
To embed this knowledge we will not - however - be using computers. Instead we'll use pen and paper for now.
267269

@@ -275,7 +277,7 @@ Then test each other on the answers. If you want to check your logic use [regex1
275277

276278
::::::::::::::::::::::::::::::::::::::: challenge
277279

278-
## Using square brackets
280+
## 1. Using square brackets
279281

280282
What will the regular expression `Fr[ea]nc[eh]` match?
281283

@@ -300,7 +302,7 @@ Note that the way this regular expression is constructed, it will match misspell
300302

301303
::::::::::::::::::::::::::::::::::::::: challenge
302304

303-
## Using dollar signs
305+
## 2. Using dollar signs
304306

305307
What will the regular expression `Fr[ea]nc[eh]$` match?
306308

@@ -325,7 +327,7 @@ This will match the pattern only when it appears at the end of a line. It will a
325327

326328
::::::::::::::::::::::::::::::::::::::: challenge
327329

328-
## Introducing options
330+
## 3. Introducing options
329331

330332
What would match the strings `French` and `France` that appear at the beginning of a line?
331333

@@ -347,7 +349,7 @@ This will also find words where there were characters after `French` such as `Fr
347349

348350
::::::::::::::::::::::::::::::::::::::: challenge
349351

350-
## Case insensitivity
352+
## 4. Case insensitivity
351353

352354
How do you match the whole words `colour` and `color` (case insensitive)?
353355

@@ -375,7 +377,7 @@ so `/colou?r/i` will match all case insensitive variants of `colour` and `color`
375377

376378
::::::::::::::::::::::::::::::::::::::: challenge
377379

378-
## Word boundaries
380+
## 5. Word boundaries
379381

380382
How would you find the whole word `headrest` and or `head rest` but not <code>head  rest</code> (that is, with two spaces between `head` and `rest`?
381383

@@ -397,7 +399,7 @@ Note that although `\bhead\s?rest\b` does work, it will also match zero or one t
397399

398400
::::::::::::::::::::::::::::::::::::::: challenge
399401

400-
## Matching non-linguistic patterns
402+
## 6. Matching non-linguistic patterns
401403

402404
How would you find a string that ends with four letters preceded by at least one zero?
403405

@@ -415,7 +417,7 @@ How would you find a string that ends with four letters preceded by at least one
415417

416418
::::::::::::::::::::::::::::::::::::::: challenge
417419

418-
## Matching digits
420+
## 7. Matching digits
419421

420422
How do you match any four-digit string anywhere?
421423

@@ -437,7 +439,7 @@ Note: this will also match four-digit strings within longer strings of numbers a
437439

438440
::::::::::::::::::::::::::::::::::::::: challenge
439441

440-
## Matching dates
442+
## 8. Matching dates
441443

442444
How would you match the date format `dd-MM-yyyy`?
443445

@@ -459,7 +461,7 @@ Depending on your data, you may choose to remove the word bounding.
459461

460462
::::::::::::::::::::::::::::::::::::::: challenge
461463

462-
## Matching multiple date formats
464+
## 9. Matching multiple date formats
463465

464466
How would you match the date format `dd-MM-yyyy` or `dd-MM-yy` at the end of a line only?
465467

@@ -481,7 +483,7 @@ Note this will also find strings such as `31-01-198` at the end of a line, so yo
481483

482484
::::::::::::::::::::::::::::::::::::::: challenge
483485

484-
## Matching publication formats
486+
## 10. Matching publication formats
485487

486488
How would you match publication formats such as `British Library : London, 2015` and `Manchester University Press: Manchester, 1999`?
487489

episodes/02-match-extract-strings.md

+76-8
Original file line numberDiff line numberDiff line change
@@ -17,21 +17,21 @@ exercises: 30
1717

1818
::::::::::::::::::::::::::::::::::::::::::::::::::
1919

20-
## Exercise Using Regex101.com
20+
## Exercise: Using Regex101.com
2121

2222
For this exercise, open a browser and go to [https://regex101.com](https://regex101.com). Regex101.com is a free regular expression debugger with real time explanation, error detection, and highlighting.
2323

2424
Open the [swcCoC.md file](https://github.com/LibraryCarpentry/lc-data-intro/tree/main/episodes/data/swcCoC.md), copy the text, and paste that into the test string box.
2525

26-
For a quick test to see if it is working, type the string `community` into the regular expression box.
26+
For a quick test to see if it is working, type the string `community ` into the regular expression box.
2727

2828
If you look in the box on the right of the screen, you see that the expression matches six instances of the string 'community' (the instances are also highlighted within the text).
2929

3030
::::::::::::::::::::::::::::::::::::::: challenge
3131

3232
### Taking spaces into consideration
3333

34-
Type `community `. You get three matches. Why not six?
34+
Add a space after `community`. You get three matches. Why not six?
3535

3636
::::::::::::::: solution
3737

@@ -135,7 +135,7 @@ Find all of the words starting with Comm or comm that are plural.
135135

136136
::::::::::::::::::::::::::::::::::::::::::::::::::
137137

138-
## Exercise finding email addresses using regex101.com
138+
## Exercise: finding email addresses
139139

140140
For this exercise, open a browser and go to [https://regex101.com](https://regex101.com).
141141

@@ -217,7 +217,7 @@ See the previous exercise for the explanation of the expression up to the `+`
217217

218218
::::::::::::::::::::::::::::::::::::::::::::::::::
219219

220-
## Exercise finding phone numbers, Using regex101.com
220+
## Exercise: finding phone numbers
221221

222222
Does this Code of Conduct contain a phone number?
223223

@@ -355,9 +355,79 @@ This expression should find one match in the document.
355355

356356
One of the reasons we stress the value of consistent and predictable directory and filenaming conventions is that working in this way enables you to use the computer to select files based on the characteristics of their file names. For example, if you have a bunch of files where the first four digits are the year and you only want to do something with files from '2017', then you can. Or if you have 'journal' somewhere in a filename when you have data about journals, you can use the computer to select just those files. Equally, using plain text formats means that you can go further and select files or elements of files based on characteristics of the data *within* those files. See Workshop Overview: [File Naming \& Formatting](https://librarycarpentry.org/lc-overview/06-file-naming-formatting) for further background.
357357

358+
359+
::::::::::::::::::::::::::::::::::::::::::::::::::
360+
## Exercise: Extracting substrings in R using regex
361+
362+
You can use regular expressions in many functions in base R, for example **`grep`** and **`sub`**. We will look at some functions from **`stringr`**, a powerful package for character strings that works well with packages we saw already like **`dplyr`** and **`tidyr`**. To learn more about **`stringr`** after the workshop, you may want to check out this handy [string manipulation with stringr cheatsheet](https://rstudio.github.io/cheatsheets/html/strings.html).
363+
364+
We will look at just two functions in **`stringr`** that can take regular expressions as an argument:
365+
366+
* `str_extract(string, pattern)`: return the first pattern match found in each string, as a vector
367+
368+
* `str_replace(string, pattern, replacement)`: finds the first pattern match in a string, and replaces it with a replacement string
369+
370+
These functions will return the first pattern match only. To return all possible matches, we can use **`str_extract_all()`** and **`str_replace_all()`**.
371+
372+
::::::::::::::::::::::::::::::::::::::::: callout
373+
### ESCAPING METACHARACTERS IN REGULAR EXPRESSIONS IN R
374+
Regular expressions in R follow the general syntax we have seen so far, with one main exception. In R, strings use a backslash `\` to escape special behavior - but regular expressions are themselves regarded as strings by R. This creates a problem when we use a metacharacter, like **`\d`**, as this is interpreted by R as `d`. We get around this by using an extra `\` beforehand, like **`\\d`**.
358375
::::::::::::::::::::::::::::::::::::::::::::::::::
359376

360-
## Extracting a substring in Google Sheets using regex
377+
Let's build a regular expression to extract the month from a date, written as a string in the format YYYY-MM-DD:
378+
379+
```R
380+
library(stringr)
381+
str_extract("2024-05-09", "-\\d{2}-")
382+
```
383+
```output
384+
[1] "-05-"
385+
```
386+
This returns the month (MM), with the dashes on either side. There are more advanced ways to remove these, bu a simple approach would be to use **`str_replace_all()`** to replace every "-" with "" (that is, nothing):
387+
```R
388+
str_replace_all("-05-", "-", "")
389+
```
390+
```output
391+
[1] "05"
392+
```
393+
We could go one step further and complete both steps in one line of code by wrapping our first **`str_extract()`** function inside **`str_replace_all()`**
394+
```R
395+
str_replace_all(str_extract("2024-05-09", "-\\d{2}-"), "-", "")
396+
```
397+
```output
398+
[1] "05"
399+
```
400+
::::::::::::::::::::::::::::::::::::::: challenge
401+
402+
### Using regex on a dataframe in R with stringr
403+
404+
Open the "SAFI_clean.csv" dataset we worked with on days 1 and 2 in R. It contains a column "interview_date" with the date of each interview in the format "YYYY-MM-DD". How can we apply the regular expression we just built to add a new column "interview_month" containing just the month (MM)?
405+
406+
Hint: the mutate() function from dplyr can create new columns containing modified values
407+
408+
::::::::::::::: solution
409+
### Solution
410+
```R
411+
library(tidyverse)
412+
library(here)
413+
414+
#Read the data
415+
interviews <- read_csv(
416+
here("data", "SAFI_clean.csv"),
417+
na = "NULL")
418+
419+
# Add a new column "interview_month", containing the month (MM) extracted from the interview_date (YYYY-MM-DD)
420+
df <- interviews %>%
421+
mutate(interview_month = str_replace_all(str_extract(interview_date,"-(\\d{2})-"),"-", ""))
422+
```
423+
:::::::::::::::::::::::::
424+
425+
::::::::::::::::::::::::::::::::::::::::::::::::::
426+
427+
## Exercise: Extracting substrings in Google Sheets using regex
428+
429+
You can also use regular expression in Google Sheets.
430+
361431

362432
::::::::::::::::::::::::::::::::::::::: challenge
363433

@@ -381,9 +451,7 @@ This is one way to solve this challenge. You might have found others. Inside the
381451
Latitude and longitude are in decimal degree format and can be positive or negative, so we start with an optional dash for negative values then use `\d+` for a one or more digit match followed by a period `\.`. Note we had to escape the period using `\`. After the period we look for one or more digits `\d+` again followed by a literal comma `,`. We then have a literal space match followed by an optional dash `-` (there are few `0.0` latitude/longitudes that are probably errors, but we'd want to retain so we can deal with them). We then repeat our `\d+\.\d+` we used for the latitude match.
382452

383453
:::::::::::::::::::::::::
384-
385454
::::::::::::::::::::::::::::::::::::::::::::::::::
386-
387455
:::::::::::::::::::::::::::::::::::::::: keypoints
388456

389457
- Regular expressions are useful for searching and cleaning data.

0 commit comments

Comments
 (0)