Skip to content

Commit 056c68e

Browse files
update episode 4
1 parent 099ece9 commit 056c68e

File tree

1 file changed

+122
-134
lines changed

1 file changed

+122
-134
lines changed

docs/04-redirection.md

+122-134
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
- How can I search within files?
1414
- How can I combine existing commands to do new things?
1515

16-
::::::::::::::::::::::::::::::::::::::::::::::::::
16+
1717

1818
## Searching files
1919

@@ -26,7 +26,7 @@ regular expressions in this lesson, and are instead going to specify the strings
2626
we are searching for.
2727
Let's give it a try!
2828

29-
::::::::::::::::::::::::::::::::::::::::: callout
29+
3030

3131
## Nucleotide abbreviations
3232

@@ -36,18 +36,20 @@ in a sequencing file represents a position where the sequencing machine was not
3636
confidently determine the nucleotide in that position. You can think of an `N` as being aNy
3737
nucleotide at that position in the DNA sequence.
3838

39-
::::::::::::::::::::::::::::::::::::::::::::::::::
39+
4040

4141
We'll search for strings inside of our fastq files. Let's first make sure we are in the correct
4242
directory:
4343

44-
```bash
45-
$ cd ~/obss_2023/commandline/shell_data/untrimmed_fastq
46-
```
44+
!!! terminal "code"
45+
46+
```bash
47+
$ cd ~/shell_data/untrimmed_fastq
48+
```
4749

4850
Suppose we want to see how many reads in our file have really bad segments containing 10 consecutive unknown nucleotides (Ns).
4951

50-
::::::::::::::::::::::::::::::::::::::::: callout
52+
5153

5254
## Determining quality
5355

@@ -58,16 +60,17 @@ research you will most likely use a bioinformatics tool that has a built-in prog
5860
filtering out low-quality reads. You'll learn how to use one such tool in
5961
[a later lesson](https://datacarpentry.org/wrangling-genomics/02-quality-control).
6062

61-
::::::::::::::::::::::::::::::::::::::::::::::::::
6263

63-
Let's search for the string NNNNNNNNNN in the SRR098026 file:
6464

65-
```bash
66-
$ grep NNNNNNNNNN SRR098026.fastq
67-
```
65+
!!! terminal-2 "Let's search for the string `NNNNNNNNNN` in the SRR098026 file:"
66+
67+
68+
```bash
69+
$ grep NNNNNNNNNN SRR098026.fastq
70+
```
6871

6972
This command returns a lot of output to the terminal. Every single line in the SRR098026
70-
file that contains at least 10 consecutive Ns is printed to the terminal, regardless of how long or short the file is.
73+
file that contains at least 10 consecutive `N`s is printed to the terminal, regardless of how long or short the file is.
7174
We may be interested not only in the actual sequence which contains this string, but
7275
in the name (or identifier) of that sequence. We discussed in a previous lesson
7376
that the identifier line immediately precedes the nucleotide sequence for each read
@@ -79,73 +82,69 @@ We can use the `-B` argument for grep to return a specific number of lines befor
7982
each match. The `-A` argument returns a specific number of lines after each matching line. Here we want the line _before_ and the two lines _after_ each
8083
matching line, so we add `-B1 -A2` to our grep command:
8184

82-
```bash
83-
$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq
84-
```
85-
86-
One of the sets of lines returned by this command is:
87-
88-
```output
89-
@SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
90-
CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
91-
+SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
92-
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
93-
```
94-
95-
::::::::::::::::::::::::::::::::::::::: challenge
96-
97-
## Exercise
98-
99-
1. Search for the sequence `GNATNACCACTTCC` in the `SRR098026.fastq` file.
100-
Have your search return all matching lines and the name (or identifier) for each sequence
101-
that contains a match.
102-
103-
2. Search for the sequence `AAGTT` in both FASTQ files.
104-
Have your search return all matching lines and the name (or identifier) for each sequence
105-
that contains a match.
106-
107-
::::::::::::::: solution
108-
109-
## Solution
110-
111-
1. `grep -B1 GNATNACCACTTCC SRR098026.fastq`
112-
113-
```
114-
@SRR098026.245 HWUSI-EAS1599_1:2:1:2:801 length=35
115-
GNATNACCACTTCCAGTGCTGANNNNNNNGGGATG
116-
```
117-
118-
2. `grep -B1 AAGTT *.fastq`
119-
120-
```
121-
[email protected] 209DTAAXX_Lenski2_1_7:8:3:247:351 length=36
122-
SRR097977.fastq:GATTGCTTTAATGAAAAAGTCATATAAGTTGCCATG
123-
--
124-
[email protected] 209DTAAXX_Lenski2_1_7:8:3:544:566 length=36
125-
SRR097977.fastq:TTGTCCACGCTTTTCTATGTAAAGTTTATTTGCTTT
126-
--
127-
[email protected] 209DTAAXX_Lenski2_1_7:8:3:724:110 length=36
128-
SRR097977.fastq:TGAAGCCTGCTTTTTTATACTAAGTTTGCATTATAA
129-
--
130-
[email protected] 209DTAAXX_Lenski2_1_7:8:3:258:281 length=36
131-
SRR097977.fastq:GTGGCGCTGCTGCATAAGTTGGGTTATCAGGTCGTT
132-
--
133-
[email protected] 209DTAAXX_Lenski2_1_7:8:3:353:318 length=36
134-
SRR097977.fastq:GGCAAAATGGTCCTCCAGCCAGGCCAGAAGCAAGTT
135-
--
136-
[email protected] 209DTAAXX_Lenski2_1_7:8:3:703:655 length=36
137-
SRR097977.fastq:TTTATTTGTAAAGTTTTGTTGAAATAAGGGTTGTAA
138-
--
139-
[email protected] 209DTAAXX_Lenski2_1_7:8:3:592:919 length=36
140-
SRR097977.fastq:TTCTTACCATCCTGAAGTTTTTTCATCTTCCCTGAT
141-
--
142-
[email protected] HWUSI-EAS1599_1:2:1:1:1505 length=35
143-
SRR098026.fastq:GNNNNNNNNCAAAGTTGATCNNNNNNNNNTGTGCG
144-
```
145-
146-
:::::::::::::::::::::::::
85+
!!! terminal-2 "code"
86+
87+
```bash
88+
$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq
89+
```
90+
91+
One of the sets of lines returned by this command is:
92+
93+
```output
94+
@SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
95+
CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
96+
+SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
97+
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
98+
```
99+
100+
101+
!!! dumbbell "Exercise"
102+
103+
1. Search for the sequence `GNATNACCACTTCC` in the `SRR098026.fastq` file.
104+
Have your search return all matching lines and the name (or identifier) for each sequence
105+
that contains a match.
106+
107+
2. Search for the sequence `AAGTT` in both FASTQ files.
108+
Have your search return all matching lines and the name (or identifier) for each sequence
109+
that contains a match
110+
111+
??? success "Solution"
112+
113+
1. `grep -B1 GNATNACCACTTCC SRR098026.fastq`
114+
115+
```
116+
@SRR098026.245 HWUSI-EAS1599_1:2:1:2:801 length=35
117+
GNATNACCACTTCCAGTGCTGANNNNNNNGGGATG
118+
```
119+
120+
2. `grep -B1 AAGTT *.fastq`
121+
122+
```
123+
[email protected] 209DTAAXX_Lenski2_1_7:8:3:247:351 length=36
124+
SRR097977.fastq:GATTGCTTTAATGAAAAAGTCATATAAGTTGCCATG
125+
--
126+
[email protected] 209DTAAXX_Lenski2_1_7:8:3:544:566 length=36
127+
SRR097977.fastq:TTGTCCACGCTTTTCTATGTAAAGTTTATTTGCTTT
128+
--
129+
[email protected] 209DTAAXX_Lenski2_1_7:8:3:724:110 length=36
130+
SRR097977.fastq:TGAAGCCTGCTTTTTTATACTAAGTTTGCATTATAA
131+
--
132+
[email protected] 209DTAAXX_Lenski2_1_7:8:3:258:281 length=36
133+
SRR097977.fastq:GTGGCGCTGCTGCATAAGTTGGGTTATCAGGTCGTT
134+
--
135+
[email protected] 209DTAAXX_Lenski2_1_7:8:3:353:318 length=36
136+
SRR097977.fastq:GGCAAAATGGTCCTCCAGCCAGGCCAGAAGCAAGTT
137+
--
138+
[email protected] 209DTAAXX_Lenski2_1_7:8:3:703:655 length=36
139+
SRR097977.fastq:TTTATTTGTAAAGTTTTGTTGAAATAAGGGTTGTAA
140+
--
141+
[email protected] 209DTAAXX_Lenski2_1_7:8:3:592:919 length=36
142+
SRR097977.fastq:TTCTTACCATCCTGAAGTTTTTTCATCTTCCCTGAT
143+
--
144+
[email protected] HWUSI-EAS1599_1:2:1:1:1505 length=35
145+
SRR098026.fastq:GNNNNNNNNCAAAGTTGATCNNNNNNNNNTGTGCG
146+
```
147147

148-
::::::::::::::::::::::::::::::::::::::::::::::::::
149148

150149
## Redirecting output
151150

@@ -165,11 +164,12 @@ Let's try out this command and copy all the records (including all four lines of
165164
in our FASTQ files that contain
166165
'NNNNNNNNNN' to another file called `bad_reads.txt`.
167166

168-
```bash
169-
$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt
170-
```
167+
!!! terminal "Code"
168+
169+
```bash
170+
$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt
171+
```
171172

172-
::::::::::::::::::::::::::::::::::::::::: callout
173173

174174
## File extensions
175175

@@ -180,7 +180,7 @@ name it with a `.fastq` extension. However, using a `.fastq` extension will lead
180180
when we move to using wildcards later in this episode. We'll point out where this becomes
181181
important. For now, it's good that you're thinking about file extensions!
182182

183-
::::::::::::::::::::::::::::::::::::::::::::::::::
183+
184184

185185
The prompt should sit there a little bit, and then it should look like nothing
186186
happened. But type `ls`. You should see a new file called `bad_reads.txt`.
@@ -190,73 +190,61 @@ We can check the number of lines in our new file using a command called `wc`.
190190
in a file. The FASTQ file may change over time, so given the potential for updates,
191191
make sure your file matches your instructor's output.
192192

193-
As of Sept. 2020, wc gives the following output:
193+
!!! terminal-2 "As of Sept. 2020, wc gives the following output:"
194194

195-
```bash
196-
$ wc bad_reads.txt
197-
```
195+
```bash
196+
$ wc bad_reads.txt
197+
```
198198

199-
```output
200-
802 1338 24012 bad_reads.txt
201-
```
202-
203-
This will tell us the number of lines, words and characters in the file. If we
204-
want only the number of lines, we can use the `-l` flag for `lines`.
205-
206-
```bash
207-
$ wc -l bad_reads.txt
208-
```
209-
210-
```output
211-
802 bad_reads.txt
212-
```
199+
```output
200+
802 1338 24012 bad_reads.txt
201+
```
213202

214-
::::::::::::::::::::::::::::::::::::::: challenge
203+
This will tell us the number of lines, words and characters in the file. If we
204+
want only the number of lines, we can use the `-l` flag for `lines`.
205+
206+
```bash
207+
$ wc -l bad_reads.txt
208+
```
209+
210+
```output
211+
802 bad_reads.txt
212+
```
215213

216-
## Exercise
214+
!!! dumbbell "Exercise"
217215

218-
How many sequences are there in `SRR098026.fastq`? Remember that every sequence is formed by four lines.
216+
How many sequences are there in `SRR098026.fastq`? Remember that every sequence is formed by four lines.
219217

220-
::::::::::::::: solution
221218

222-
## Solution
223219

224-
```bash
225-
$ wc -l SRR098026.fastq
226-
```
220+
??? success "Solution"
221+
222+
```bash
223+
$ wc -l SRR098026.fastq
224+
```
227225

228-
```output
229-
996
230-
```
226+
```output
227+
996
228+
```
231229

232230
Now you can divide this number by four to get the number of sequences in your fastq file.
233231

234-
:::::::::::::::::::::::::
235-
236-
::::::::::::::::::::::::::::::::::::::::::::::::::
237232

238-
::::::::::::::::::::::::::::::::::::::: challenge
233+
!!! dumbbell "Exercise"
239234

240-
## Exercise
235+
How many sequences in `SRR098026.fastq` contain at least 3 consecutive Ns?
241236

242-
How many sequences in `SRR098026.fastq` contain at least 3 consecutive Ns?
237+
??? success "Solution"
238+
239+
```bash
240+
$ grep NNN SRR098026.fastq > bad_reads.txt
241+
$ wc -l bad_reads.txt
242+
```
243243

244-
::::::::::::::: solution
244+
```output
245+
249
246+
```
245247

246-
## Solution
247-
248-
```bash
249-
$ grep NNN SRR098026.fastq > bad_reads.txt
250-
$ wc -l bad_reads.txt
251-
```
252-
253-
```output
254-
249
255-
```
256-
257-
:::::::::::::::::::::::::
258-
259-
::::::::::::::::::::::::::::::::::::::::::::::::::
260248

261249
We might want to search multiple FASTQ files for sequences that match our search pattern.
262250
However, we need to be careful, because each time we use the `>` command to redirect output

0 commit comments

Comments
 (0)