Skip to content

Commit fdac247

Browse files
committed
source commit: 4670617
0 parents  commit fdac247

19 files changed

+1760
-0
lines changed

01-regular-expressions.md

+510
Large diffs are not rendered by default.

02-match-extract-strings.md

+395
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,395 @@
1+
---
2+
title: Matching & Extracting Strings
3+
teaching: 0
4+
exercises: 30
5+
---
6+
7+
::::::::::::::::::::::::::::::::::::::: objectives
8+
9+
- Use regular expressions to match words, email addresses, and phone numbers.
10+
- Use regular expressions to extract substrings from strings (e.g. addresses).
11+
12+
::::::::::::::::::::::::::::::::::::::::::::::::::
13+
14+
:::::::::::::::::::::::::::::::::::::::: questions
15+
16+
- How can you use regular expressions to match and extract strings?
17+
18+
::::::::::::::::::::::::::::::::::::::::::::::::::
19+
20+
## Exercise Using Regex101.com
21+
22+
For this exercise, open a browser and go to [https://regex101.com](https://regex101.com). Regex101.com is a free regular expression debugger with real time explanation, error detection, and highlighting.
23+
24+
Open the [swcCoC.md file](https://github.com/LibraryCarpentry/lc-data-intro/tree/main/episodes/data/swcCoC.md), copy the text, and paste that into the test string box.
25+
26+
For a quick test to see if it is working, type the string `community` into the regular expression box.
27+
28+
If you look in the box on the right of the screen, you see that the expression matches six instances of the string 'community' (the instances are also highlighted within the text).
29+
30+
::::::::::::::::::::::::::::::::::::::: challenge
31+
32+
### Taking spaces into consideration
33+
34+
Type `community `. You get three matches. Why not six?
35+
36+
::::::::::::::: solution
37+
38+
### Solution
39+
40+
The string 'community-led' matches the first search, but drops out of this result because the space does not match the character `-`.
41+
42+
:::::::::::::::::::::::::
43+
44+
::::::::::::::::::::::::::::::::::::::::::::::::::
45+
46+
::::::::::::::::::::::::::::::::::::::: challenge
47+
48+
### Taking any character into consideration
49+
50+
If you want to match 'community-led' by adding other regex characters to the expression `community`, what would they be?
51+
52+
::::::::::::::: solution
53+
54+
### Solution
55+
56+
For instance, `\S+\b`. This would match one or more non-space characters followed by a word boundary.
57+
58+
:::::::::::::::::::::::::
59+
60+
::::::::::::::::::::::::::::::::::::::::::::::::::
61+
62+
::::::::::::::::::::::::::::::::::::::: challenge
63+
64+
### Exploring effect of expressions matching different words
65+
66+
Change the expression to `communi` and you get 15 full matches of several words. Why?
67+
68+
::::::::::::::: solution
69+
70+
### Solution
71+
72+
Because the string 'communi' is present in all of those words, including `communi`cation and `communi`ty. Because the expression does not have a word boundary, this expression would also match in`communi`cado, were it present in this text. If you want to test this, type `incommunicado` into the text somewhere and see if it is found.
73+
74+
:::::::::::::::::::::::::
75+
76+
::::::::::::::::::::::::::::::::::::::::::::::::::
77+
78+
::::::::::::::::::::::::::::::::::::::: challenge
79+
80+
### Taking capitalization into consideration
81+
82+
Type the expression `[Cc]ommuni`. You get 16 matches. Why?
83+
84+
::::::::::::::: solution
85+
86+
### Solution
87+
88+
The word Community is present in the text with a capital `C` and with a lowercase `c` 16 times.
89+
90+
:::::::::::::::::::::::::
91+
92+
::::::::::::::::::::::::::::::::::::::::::::::::::
93+
94+
::::::::::::::::::::::::::::::::::::::: challenge
95+
96+
### Regex characters that indicate location
97+
98+
Type the expression `^[Cc]ommuni`. You get no matches. Why?
99+
100+
::::::::::::::: solution
101+
102+
### Solution
103+
104+
There is no matching string present at the start of a line. Look at the text and replace the string after the `^` with something that matches a word at the start of a line. Does it find a match?
105+
106+
:::::::::::::::::::::::::
107+
108+
::::::::::::::::::::::::::::::::::::::::::::::::::
109+
110+
::::::::::::::::::::::::::::::::::::::: challenge
111+
112+
### Finding plurals
113+
114+
Find all of the words starting with Comm or comm that are plural.
115+
116+
::::::::::::::: solution
117+
118+
### Solution
119+
120+
```
121+
[Cc]omm\w+s\b
122+
```
123+
124+
`[Cc]` finds capital and lowercase `c`
125+
126+
`omm` is straightforward character matches
127+
128+
`\w+` matches the preceding element (a word character) one or more times
129+
130+
`s` is a straightforward character match
131+
132+
`\b` ensures the 's' is located at the end of the word.
133+
134+
:::::::::::::::::::::::::
135+
136+
::::::::::::::::::::::::::::::::::::::::::::::::::
137+
138+
## Exercise finding email addresses using regex101.com
139+
140+
For this exercise, open a browser and go to [https://regex101.com](https://regex101.com).
141+
142+
Open the [swcCoC.md file](https://github.com/LibraryCarpentry/lc-data-intro/tree/main/episodes/data/swcCoC.md), copy it, and paste it into the test string box.
143+
144+
::::::::::::::::::::::::::::::::::::::: challenge
145+
146+
### Start with what you know
147+
148+
What character do you know is held in common with all email addresses?
149+
150+
::::::::::::::: solution
151+
152+
### Solution
153+
154+
The '@' character.
155+
156+
:::::::::::::::::::::::::
157+
158+
::::::::::::::::::::::::::::::::::::::::::::::::::
159+
160+
::::::::::::::::::::::::::::::::::::::: challenge
161+
162+
### Add to what you know
163+
164+
The string before the "@" could contain any kind of word character, special character or digit in any combination and length. How would you express this in regex? Hint: often addresses will have a dash (-) or dot (.) in them, and neither of these are included in the word character expression (\\w). How do you capture this in the expression?
165+
166+
::::::::::::::: solution
167+
168+
### Solution
169+
170+
```
171+
[\w.-]+@
172+
```
173+
174+
`\w` matches any word character (including digits and underscore)
175+
176+
`.` matches a literal period (when used in between square brackets, `.` does not mean "any character", it literally means ".")
177+
178+
`-` matches a dash
179+
180+
`[]` the brackets enclose the boolean string that 'OR' the word characters, dot, and dash.
181+
182+
`+` matches any word character OR digit OR character OR `-` repeated 1 or more times
183+
184+
:::::::::::::::::::::::::
185+
186+
::::::::::::::::::::::::::::::::::::::::::::::::::
187+
188+
::::::::::::::::::::::::::::::::::::::: challenge
189+
190+
### Finish the expression
191+
192+
The string after the "@" could contain any kind of word character, special character or digit in any combination and length as well as the dash. In addition, we know that it will have some characters after a period (`.`). Most common domain names have two or three characters, but many more are now possible. Find the latest list [here](https://stats.research.icann.org/dns/tld_report/). What expression would capture this? Hint: the `.` is also a metacharacter, so you will have to use the escape `\` to express a literal period. Note: for the string after the period, we did not try to match a `-` character, since those rarely appear in the characters after the period at the end of an email address.
193+
194+
::::::::::::::: solution
195+
196+
### Solution
197+
198+
```
199+
[\w.-]+\.\w{2,3} OR [\w.-]+\.\w+
200+
```
201+
202+
See the previous exercise for the explanation of the expression up to the `+`
203+
204+
`\.` matches the literal period ('.') not the regex expression `.`
205+
206+
`\w` matches any word (including digits and underscore)
207+
208+
`+` matches any word character OR digit OR character OR `-` repeated 1 or more times.
209+
210+
`{2,3}` limits the number of word characters and/or digits to a two or three-character string.
211+
212+
`[]` the brackets enclose the boolean string that 'OR' the digits, word characters, characters and dash.
213+
214+
`+` matches any word character OR digit OR character OR `-` repeated 1 or more times
215+
216+
:::::::::::::::::::::::::
217+
218+
::::::::::::::::::::::::::::::::::::::::::::::::::
219+
220+
## Exercise finding phone numbers, Using regex101.com
221+
222+
Does this Code of Conduct contain a phone number?
223+
224+
What to consider:
225+
226+
1. It may or may not have a country code, perhaps starting with a "+".
227+
2. It will have an area code, potentially enclosed in parentheses.
228+
3. It may have the sections all separated with a "-".
229+
230+
::::::::::::::::::::::::::::::::::::::: challenge
231+
232+
### Start with what you know: find strings of digits
233+
234+
Start with what we know, which is the most basic format of a phone number: three digits, a dash, and four digits. How would we write a regex expression that matches this?
235+
236+
::::::::::::::: solution
237+
238+
### Solution
239+
240+
```
241+
\d{3}-\d{4}
242+
```
243+
244+
`\d` matches digits
245+
246+
`{3}` matches 3 digits
247+
248+
`-` matches the character '-'
249+
250+
`\d` matches any digit
251+
252+
`{4}` matches 4 digits.
253+
254+
This expression should find three matches in the document.
255+
256+
:::::::::::::::::::::::::
257+
258+
::::::::::::::::::::::::::::::::::::::::::::::::::
259+
260+
::::::::::::::::::::::::::::::::::::::: challenge
261+
262+
### Match a string that includes an area code with a dash
263+
264+
Start with what we know, which is the most basic format of a phone number: three digits, a dash, and four digits. How would we expand the expression to include an area code (three digits and a dash)?
265+
266+
::::::::::::::: solution
267+
268+
### Solution
269+
270+
```
271+
\d{3}-\d{3}-\d{4}
272+
```
273+
274+
`\d` matches digits
275+
276+
`{3}` matches 3 digits
277+
278+
`-` matches the character '-'
279+
280+
`\d` matches any digit
281+
282+
`{4}` matches 4 digits.
283+
284+
This expression should find one match in the document
285+
286+
:::::::::::::::::::::::::
287+
288+
::::::::::::::::::::::::::::::::::::::::::::::::::
289+
290+
::::::::::::::::::::::::::::::::::::::: challenge
291+
292+
### Match a string that includes an area code within parenthesis separated from the rest of the phone number with a space or without a space
293+
294+
Start with what we know, which is the most basic format of a phone number: three digits, a dash, and four digits. How would we expand the expression to include a phone number with an area code in parenthesis, separated from the phone number, with or without a space.
295+
296+
::::::::::::::: solution
297+
298+
### Solution
299+
300+
```
301+
\(\d{3}\) ?\d{3}-\d{4}
302+
```
303+
304+
`\(` escape character with the parenthesis as straightforward character match
305+
306+
`\d` matches digits
307+
308+
`{3}` matches 3 digits
309+
310+
`\)` escape character with the parenthesis as a straightforward character match
311+
312+
` ?` matches zero or one spaces
313+
314+
See the previous exercise for the explanation of the rest of the expression.
315+
316+
This expression should find two matches in the document.
317+
318+
:::::::::::::::::::::::::
319+
320+
::::::::::::::::::::::::::::::::::::::::::::::::::
321+
322+
::::::::::::::::::::::::::::::::::::::: challenge
323+
324+
### Match a phone number containing a country code.
325+
326+
Country codes are preceded by a "+" and can have up to three digits. We also have to consider that there may or may not be a space between the country code and anything appearing next.
327+
328+
::::::::::::::: solution
329+
330+
### Solution
331+
332+
```
333+
\+\d{1,3} ?\(\d{3}\)\s?\d{3}-\d{4}
334+
```
335+
336+
`\+` escape character with the plus sign as straightforward character match
337+
338+
`\d` matches digits
339+
340+
`{1,3}` matches 1 to 3 digits
341+
342+
` ?` matches zero or one spaces
343+
344+
See the previous exercise for the explanation of the rest of the expression.
345+
346+
This expression should find one match in the document.
347+
348+
:::::::::::::::::::::::::
349+
350+
::::::::::::::::::::::::::::::::::::::::::::::::::
351+
352+
::::::::::::::::::::::::::::::::::::::::: callout
353+
354+
### Using regular expressions when working with files and directories
355+
356+
One of the reasons we stress the value of consistent and predictable directory and filenaming conventions is that working in this way enables you to use the computer to select files based on the characteristics of their file names. For example, if you have a bunch of files where the first four digits are the year and you only want to do something with files from '2017', then you can. Or if you have 'journal' somewhere in a filename when you have data about journals, you can use the computer to select just those files. Equally, using plain text formats means that you can go further and select files or elements of files based on characteristics of the data *within* those files. See Workshop Overview: [File Naming \& Formatting](https://librarycarpentry.org/lc-overview/06-file-naming-formatting) for further background.
357+
358+
::::::::::::::::::::::::::::::::::::::::::::::::::
359+
360+
## Extracting a substring in Google Sheets using regex
361+
362+
::::::::::::::::::::::::::::::::::::::: challenge
363+
364+
### Extracting a substring in Google Sheets using regex
365+
366+
1. Export and unzip the [2017 Public Library Survey](https://github.com/LibraryCarpentry/lc-data-intro/blob/main/episodes/files/PLS_FY17.zip) (originally from the IMLS data site) as a CSV file.
367+
2. Upload the CSV file to Google Sheets and open as a Google Sheet if it does not do this by default.
368+
3. Look in the `ADDRESS` column and notice that the values contain the latitude and longitude in parenthesis after the library address.
369+
4. Construct a regular expression to match and extract the latitude and longitude into a new column named 'latlong'. HINT: Look up the function `REGEXEXTRACT` in Google Sheets. That function expects the first argument to be a string (a cell in `ADDRESS` column) and a quoted regular expression in the second.
370+
371+
::::::::::::::: solution
372+
373+
### Solution
374+
375+
This is one way to solve this challenge. You might have found others. Inside the cell you can use the below to extract the latitude and longitude into a single cell. You can then copy the formula down to the end of the column.
376+
377+
```source
378+
=REGEXEXTRACT(G2,"-?\d+\.\d+, -?\d+\.\d+")
379+
```
380+
381+
Latitude and longitude are in decimal degree format and can be positive or negative, so we start with an optional dash for negative values then use `\d+` for a one or more digit match followed by a period `\.`. Note we had to escape the period using `\`. After the period we look for one or more digits `\d+` again followed by a literal comma `,`. We then have a literal space match followed by an optional dash `-` (there are few `0.0` latitude/longitudes that are probably errors, but we'd want to retain so we can deal with them). We then repeat our `\d+\.\d+` we used for the latitude match.
382+
383+
:::::::::::::::::::::::::
384+
385+
::::::::::::::::::::::::::::::::::::::::::::::::::
386+
387+
:::::::::::::::::::::::::::::::::::::::: keypoints
388+
389+
- Regular expressions are useful for searching and cleaning data.
390+
- Test regular expressions interactively with [regex101.com](https://regex101.com/) or [RegExr.com](https://www.regexr.com/), and visualise them with [regexper.com](https://regexper.com/).
391+
- Test yourself with [RegexCrossword.com](https://regexcrossword.com/) or via the quiz and exercises in this lesson.
392+
393+
::::::::::::::::::::::::::::::::::::::::::::::::::
394+
395+

0 commit comments

Comments
 (0)