13
13
- How can I search within files?
14
14
- How can I combine existing commands to do new things?
15
15
16
- ::::::::::::::::::::::::::::::::::::::::::::::::::
16
+
17
17
18
18
## Searching files
19
19
@@ -26,7 +26,7 @@ regular expressions in this lesson, and are instead going to specify the strings
26
26
we are searching for.
27
27
Let's give it a try!
28
28
29
- ::::::::::::::::::::::::::::::::::::::::: callout
29
+
30
30
31
31
## Nucleotide abbreviations
32
32
@@ -36,18 +36,20 @@ in a sequencing file represents a position where the sequencing machine was not
36
36
confidently determine the nucleotide in that position. You can think of an ` N ` as being aNy
37
37
nucleotide at that position in the DNA sequence.
38
38
39
- ::::::::::::::::::::::::::::::::::::::::::::::::::
39
+
40
40
41
41
We'll search for strings inside of our fastq files. Let's first make sure we are in the correct
42
42
directory:
43
43
44
- ``` bash
45
- $ cd ~ /obss_2023/commandline/shell_data/untrimmed_fastq
46
- ```
44
+ !!! terminal "code"
45
+
46
+ ```bash
47
+ $ cd ~/shell_data/untrimmed_fastq
48
+ ```
47
49
48
50
Suppose we want to see how many reads in our file have really bad segments containing 10 consecutive unknown nucleotides (Ns).
49
51
50
- ::::::::::::::::::::::::::::::::::::::::: callout
52
+
51
53
52
54
## Determining quality
53
55
@@ -58,16 +60,17 @@ research you will most likely use a bioinformatics tool that has a built-in prog
58
60
filtering out low-quality reads. You'll learn how to use one such tool in
59
61
[ a later lesson] ( https://datacarpentry.org/wrangling-genomics/02-quality-control ) .
60
62
61
- ::::::::::::::::::::::::::::::::::::::::::::::::::
62
63
63
- Let's search for the string NNNNNNNNNN in the SRR098026 file:
64
64
65
- ``` bash
66
- $ grep NNNNNNNNNN SRR098026.fastq
67
- ```
65
+ !!! terminal-2 "Let's search for the string ` NNNNNNNNNN ` in the SRR098026 file:"
66
+
67
+
68
+ ```bash
69
+ $ grep NNNNNNNNNN SRR098026.fastq
70
+ ```
68
71
69
72
This command returns a lot of output to the terminal. Every single line in the SRR098026
70
- file that contains at least 10 consecutive Ns is printed to the terminal, regardless of how long or short the file is.
73
+ file that contains at least 10 consecutive ` N ` s is printed to the terminal, regardless of how long or short the file is.
71
74
We may be interested not only in the actual sequence which contains this string, but
72
75
in the name (or identifier) of that sequence. We discussed in a previous lesson
73
76
that the identifier line immediately precedes the nucleotide sequence for each read
@@ -79,73 +82,69 @@ We can use the `-B` argument for grep to return a specific number of lines befor
79
82
each match. The ` -A ` argument returns a specific number of lines after each matching line. Here we want the line _ before_ and the two lines _ after_ each
80
83
matching line, so we add ` -B1 -A2 ` to our grep command:
81
84
82
- ``` bash
83
- $ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq
84
- ```
85
-
86
- One of the sets of lines returned by this command is:
87
-
88
- ``` output
89
- @SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
90
- CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
91
- +SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
92
- #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
93
- ```
94
-
95
- ::::::::::::::::::::::::::::::::::::::: challenge
96
-
97
- ## Exercise
98
-
99
- 1 . Search for the sequence ` GNATNACCACTTCC ` in the ` SRR098026.fastq ` file.
100
- Have your search return all matching lines and the name (or identifier) for each sequence
101
- that contains a match.
102
-
103
- 2 . Search for the sequence ` AAGTT ` in both FASTQ files.
104
- Have your search return all matching lines and the name (or identifier) for each sequence
105
- that contains a match.
106
-
107
- ::::::::::::::: solution
108
-
109
- ## Solution
110
-
111
- 1 . ` grep -B1 GNATNACCACTTCC SRR098026.fastq `
112
-
113
- ```
114
- @SRR098026.245 HWUSI-EAS1599_1:2:1:2:801 length=35
115
- GNATNACCACTTCCAGTGCTGANNNNNNNGGGATG
116
- ```
117
-
118
- 2 . ` grep -B1 AAGTT *.fastq `
119
-
120
- ```
121
- [email protected] 209DTAAXX_Lenski2_1_7:8:3:247:351 length=36
122
- SRR097977.fastq:GATTGCTTTAATGAAAAAGTCATATAAGTTGCCATG
123
- --
124
- [email protected] 209DTAAXX_Lenski2_1_7:8:3:544:566 length=36
125
- SRR097977.fastq:TTGTCCACGCTTTTCTATGTAAAGTTTATTTGCTTT
126
- --
127
- [email protected] 209DTAAXX_Lenski2_1_7:8:3:724:110 length=36
128
- SRR097977.fastq:TGAAGCCTGCTTTTTTATACTAAGTTTGCATTATAA
129
- --
130
- [email protected] 209DTAAXX_Lenski2_1_7:8:3:258:281 length=36
131
- SRR097977.fastq:GTGGCGCTGCTGCATAAGTTGGGTTATCAGGTCGTT
132
- --
133
- [email protected] 209DTAAXX_Lenski2_1_7:8:3:353:318 length=36
134
- SRR097977.fastq:GGCAAAATGGTCCTCCAGCCAGGCCAGAAGCAAGTT
135
- --
136
- [email protected] 209DTAAXX_Lenski2_1_7:8:3:703:655 length=36
137
- SRR097977.fastq:TTTATTTGTAAAGTTTTGTTGAAATAAGGGTTGTAA
138
- --
139
- [email protected] 209DTAAXX_Lenski2_1_7:8:3:592:919 length=36
140
- SRR097977.fastq:TTCTTACCATCCTGAAGTTTTTTCATCTTCCCTGAT
141
- --
142
- [email protected] HWUSI-EAS1599_1:2:1:1:1505 length=35
143
- SRR098026.fastq:GNNNNNNNNCAAAGTTGATCNNNNNNNNNTGTGCG
144
- ```
145
-
146
- :::::::::::::::::::::::::
85
+ !!! terminal-2 "code"
86
+
87
+ ```bash
88
+ $ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq
89
+ ```
90
+
91
+ One of the sets of lines returned by this command is:
92
+
93
+ ```output
94
+ @SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
95
+ CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
96
+ +SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
97
+ #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
98
+ ```
99
+
100
+
101
+ !!! dumbbell "Exercise"
102
+
103
+ 1. Search for the sequence `GNATNACCACTTCC` in the `SRR098026.fastq` file.
104
+ Have your search return all matching lines and the name (or identifier) for each sequence
105
+ that contains a match.
106
+
107
+ 2. Search for the sequence `AAGTT` in both FASTQ files.
108
+ Have your search return all matching lines and the name (or identifier) for each sequence
109
+ that contains a match
110
+
111
+ ??? success "Solution"
112
+
113
+ 1. `grep -B1 GNATNACCACTTCC SRR098026.fastq`
114
+
115
+ ```
116
+ @SRR098026.245 HWUSI-EAS1599_1:2:1:2:801 length=35
117
+ GNATNACCACTTCCAGTGCTGANNNNNNNGGGATG
118
+ ```
119
+
120
+ 2. `grep -B1 AAGTT *.fastq`
121
+
122
+ ```
123
+ [email protected] 209DTAAXX_Lenski2_1_7:8:3:247:351 length=36
124
+ SRR097977.fastq:GATTGCTTTAATGAAAAAGTCATATAAGTTGCCATG
125
+ --
126
+ [email protected] 209DTAAXX_Lenski2_1_7:8:3:544:566 length=36
127
+ SRR097977.fastq:TTGTCCACGCTTTTCTATGTAAAGTTTATTTGCTTT
128
+ --
129
+ [email protected] 209DTAAXX_Lenski2_1_7:8:3:724:110 length=36
130
+ SRR097977.fastq:TGAAGCCTGCTTTTTTATACTAAGTTTGCATTATAA
131
+ --
132
+ [email protected] 209DTAAXX_Lenski2_1_7:8:3:258:281 length=36
133
+ SRR097977.fastq:GTGGCGCTGCTGCATAAGTTGGGTTATCAGGTCGTT
134
+ --
135
+ [email protected] 209DTAAXX_Lenski2_1_7:8:3:353:318 length=36
136
+ SRR097977.fastq:GGCAAAATGGTCCTCCAGCCAGGCCAGAAGCAAGTT
137
+ --
138
+ [email protected] 209DTAAXX_Lenski2_1_7:8:3:703:655 length=36
139
+ SRR097977.fastq:TTTATTTGTAAAGTTTTGTTGAAATAAGGGTTGTAA
140
+ --
141
+ [email protected] 209DTAAXX_Lenski2_1_7:8:3:592:919 length=36
142
+ SRR097977.fastq:TTCTTACCATCCTGAAGTTTTTTCATCTTCCCTGAT
143
+ --
144
+ [email protected] HWUSI-EAS1599_1:2:1:1:1505 length=35
145
+ SRR098026.fastq:GNNNNNNNNCAAAGTTGATCNNNNNNNNNTGTGCG
146
+ ```
147
147
148
- ::::::::::::::::::::::::::::::::::::::::::::::::::
149
148
150
149
## Redirecting output
151
150
@@ -165,11 +164,12 @@ Let's try out this command and copy all the records (including all four lines of
165
164
in our FASTQ files that contain
166
165
'NNNNNNNNNN' to another file called ` bad_reads.txt ` .
167
166
168
- ``` bash
169
- $ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt
170
- ```
167
+ !!! terminal "Code"
168
+
169
+ ```bash
170
+ $ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt
171
+ ```
171
172
172
- ::::::::::::::::::::::::::::::::::::::::: callout
173
173
174
174
## File extensions
175
175
@@ -180,7 +180,7 @@ name it with a `.fastq` extension. However, using a `.fastq` extension will lead
180
180
when we move to using wildcards later in this episode. We'll point out where this becomes
181
181
important. For now, it's good that you're thinking about file extensions!
182
182
183
- ::::::::::::::::::::::::::::::::::::::::::::::::::
183
+
184
184
185
185
The prompt should sit there a little bit, and then it should look like nothing
186
186
happened. But type ` ls ` . You should see a new file called ` bad_reads.txt ` .
@@ -190,73 +190,61 @@ We can check the number of lines in our new file using a command called `wc`.
190
190
in a file. The FASTQ file may change over time, so given the potential for updates,
191
191
make sure your file matches your instructor's output.
192
192
193
- As of Sept. 2020, wc gives the following output:
193
+ !!! terminal-2 " As of Sept. 2020, wc gives the following output:"
194
194
195
- ``` bash
196
- $ wc bad_reads.txt
197
- ```
195
+ ```bash
196
+ $ wc bad_reads.txt
197
+ ```
198
198
199
- ``` output
200
- 802 1338 24012 bad_reads.txt
201
- ```
202
-
203
- This will tell us the number of lines, words and characters in the file. If we
204
- want only the number of lines, we can use the ` -l ` flag for ` lines ` .
205
-
206
- ``` bash
207
- $ wc -l bad_reads.txt
208
- ```
209
-
210
- ``` output
211
- 802 bad_reads.txt
212
- ```
199
+ ```output
200
+ 802 1338 24012 bad_reads.txt
201
+ ```
213
202
214
- ::::::::::::::::::::::::::::::::::::::: challenge
203
+ This will tell us the number of lines, words and characters in the file. If we
204
+ want only the number of lines, we can use the `-l` flag for `lines`.
205
+
206
+ ```bash
207
+ $ wc -l bad_reads.txt
208
+ ```
209
+
210
+ ```output
211
+ 802 bad_reads.txt
212
+ ```
215
213
216
- ## Exercise
214
+ !!! dumbbell " Exercise"
217
215
218
- How many sequences are there in ` SRR098026.fastq ` ? Remember that every sequence is formed by four lines.
216
+ How many sequences are there in `SRR098026.fastq`? Remember that every sequence is formed by four lines.
219
217
220
- ::::::::::::::: solution
221
218
222
- ## Solution
223
219
224
- ``` bash
225
- $ wc -l SRR098026.fastq
226
- ```
220
+ ??? success "Solution"
221
+
222
+ ```bash
223
+ $ wc -l SRR098026.fastq
224
+ ```
227
225
228
- ``` output
229
- 996
230
- ```
226
+ ```output
227
+ 996
228
+ ```
231
229
232
230
Now you can divide this number by four to get the number of sequences in your fastq file.
233
231
234
- :::::::::::::::::::::::::
235
-
236
- ::::::::::::::::::::::::::::::::::::::::::::::::::
237
232
238
- ::::::::::::::::::::::::::::::::::::::: challenge
233
+ !!! dumbbell "Exercise"
239
234
240
- ## Exercise
235
+ How many sequences in `SRR098026.fastq` contain at least 3 consecutive Ns?
241
236
242
- How many sequences in ` SRR098026.fastq ` contain at least 3 consecutive Ns?
237
+ ??? success "Solution"
238
+
239
+ ```bash
240
+ $ grep NNN SRR098026.fastq > bad_reads.txt
241
+ $ wc -l bad_reads.txt
242
+ ```
243
243
244
- ::::::::::::::: solution
244
+ ```output
245
+ 249
246
+ ```
245
247
246
- ## Solution
247
-
248
- ``` bash
249
- $ grep NNN SRR098026.fastq > bad_reads.txt
250
- $ wc -l bad_reads.txt
251
- ```
252
-
253
- ``` output
254
- 249
255
- ```
256
-
257
- :::::::::::::::::::::::::
258
-
259
- ::::::::::::::::::::::::::::::::::::::::::::::::::
260
248
261
249
We might want to search multiple FASTQ files for sequences that match our search pattern.
262
250
However, we need to be careful, because each time we use the ` > ` command to redirect output
0 commit comments