Skip to content

Commit

Permalink
update episode 4
Browse files Browse the repository at this point in the history
  • Loading branch information
DininduSenanayake committed Jul 28, 2024
1 parent 099ece9 commit 056c68e
Showing 1 changed file with 122 additions and 134 deletions.
256 changes: 122 additions & 134 deletions docs/04-redirection.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
- How can I search within files?
- How can I combine existing commands to do new things?

::::::::::::::::::::::::::::::::::::::::::::::::::


## Searching files

Expand All @@ -26,7 +26,7 @@ regular expressions in this lesson, and are instead going to specify the strings
we are searching for.
Let's give it a try!

::::::::::::::::::::::::::::::::::::::::: callout


## Nucleotide abbreviations

Expand All @@ -36,18 +36,20 @@ in a sequencing file represents a position where the sequencing machine was not
confidently determine the nucleotide in that position. You can think of an `N` as being aNy
nucleotide at that position in the DNA sequence.

::::::::::::::::::::::::::::::::::::::::::::::::::


We'll search for strings inside of our fastq files. Let's first make sure we are in the correct
directory:

```bash
$ cd ~/obss_2023/commandline/shell_data/untrimmed_fastq
```
!!! terminal "code"

```bash
$ cd ~/shell_data/untrimmed_fastq
```

Suppose we want to see how many reads in our file have really bad segments containing 10 consecutive unknown nucleotides (Ns).

::::::::::::::::::::::::::::::::::::::::: callout


## Determining quality

Expand All @@ -58,16 +60,17 @@ research you will most likely use a bioinformatics tool that has a built-in prog
filtering out low-quality reads. You'll learn how to use one such tool in
[a later lesson](https://datacarpentry.org/wrangling-genomics/02-quality-control).

::::::::::::::::::::::::::::::::::::::::::::::::::

Let's search for the string NNNNNNNNNN in the SRR098026 file:

```bash
$ grep NNNNNNNNNN SRR098026.fastq
```
!!! terminal-2 "Let's search for the string `NNNNNNNNNN` in the SRR098026 file:"


```bash
$ grep NNNNNNNNNN SRR098026.fastq
```

This command returns a lot of output to the terminal. Every single line in the SRR098026
file that contains at least 10 consecutive Ns is printed to the terminal, regardless of how long or short the file is.
file that contains at least 10 consecutive `N`s is printed to the terminal, regardless of how long or short the file is.
We may be interested not only in the actual sequence which contains this string, but
in the name (or identifier) of that sequence. We discussed in a previous lesson
that the identifier line immediately precedes the nucleotide sequence for each read
Expand All @@ -79,73 +82,69 @@ We can use the `-B` argument for grep to return a specific number of lines befor
each match. The `-A` argument returns a specific number of lines after each matching line. Here we want the line _before_ and the two lines _after_ each
matching line, so we add `-B1 -A2` to our grep command:

```bash
$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq
```

One of the sets of lines returned by this command is:

```output
@SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
```

::::::::::::::::::::::::::::::::::::::: challenge

## Exercise

1. Search for the sequence `GNATNACCACTTCC` in the `SRR098026.fastq` file.
Have your search return all matching lines and the name (or identifier) for each sequence
that contains a match.

2. Search for the sequence `AAGTT` in both FASTQ files.
Have your search return all matching lines and the name (or identifier) for each sequence
that contains a match.

::::::::::::::: solution

## Solution

1. `grep -B1 GNATNACCACTTCC SRR098026.fastq`

```
@SRR098026.245 HWUSI-EAS1599_1:2:1:2:801 length=35
GNATNACCACTTCCAGTGCTGANNNNNNNGGGATG
```

2. `grep -B1 AAGTT *.fastq`

```
[email protected] 209DTAAXX_Lenski2_1_7:8:3:247:351 length=36
SRR097977.fastq:GATTGCTTTAATGAAAAAGTCATATAAGTTGCCATG
--
[email protected] 209DTAAXX_Lenski2_1_7:8:3:544:566 length=36
SRR097977.fastq:TTGTCCACGCTTTTCTATGTAAAGTTTATTTGCTTT
--
[email protected] 209DTAAXX_Lenski2_1_7:8:3:724:110 length=36
SRR097977.fastq:TGAAGCCTGCTTTTTTATACTAAGTTTGCATTATAA
--
[email protected] 209DTAAXX_Lenski2_1_7:8:3:258:281 length=36
SRR097977.fastq:GTGGCGCTGCTGCATAAGTTGGGTTATCAGGTCGTT
--
[email protected] 209DTAAXX_Lenski2_1_7:8:3:353:318 length=36
SRR097977.fastq:GGCAAAATGGTCCTCCAGCCAGGCCAGAAGCAAGTT
--
[email protected] 209DTAAXX_Lenski2_1_7:8:3:703:655 length=36
SRR097977.fastq:TTTATTTGTAAAGTTTTGTTGAAATAAGGGTTGTAA
--
[email protected] 209DTAAXX_Lenski2_1_7:8:3:592:919 length=36
SRR097977.fastq:TTCTTACCATCCTGAAGTTTTTTCATCTTCCCTGAT
--
[email protected] HWUSI-EAS1599_1:2:1:1:1505 length=35
SRR098026.fastq:GNNNNNNNNCAAAGTTGATCNNNNNNNNNTGTGCG
```

:::::::::::::::::::::::::
!!! terminal-2 "code"

```bash
$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq
```

One of the sets of lines returned by this command is:

```output
@SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
```


!!! dumbbell "Exercise"

1. Search for the sequence `GNATNACCACTTCC` in the `SRR098026.fastq` file.
Have your search return all matching lines and the name (or identifier) for each sequence
that contains a match.

2. Search for the sequence `AAGTT` in both FASTQ files.
Have your search return all matching lines and the name (or identifier) for each sequence
that contains a match

??? success "Solution"

1. `grep -B1 GNATNACCACTTCC SRR098026.fastq`

```
@SRR098026.245 HWUSI-EAS1599_1:2:1:2:801 length=35
GNATNACCACTTCCAGTGCTGANNNNNNNGGGATG
```

2. `grep -B1 AAGTT *.fastq`
```
[email protected] 209DTAAXX_Lenski2_1_7:8:3:247:351 length=36
SRR097977.fastq:GATTGCTTTAATGAAAAAGTCATATAAGTTGCCATG
--
[email protected] 209DTAAXX_Lenski2_1_7:8:3:544:566 length=36
SRR097977.fastq:TTGTCCACGCTTTTCTATGTAAAGTTTATTTGCTTT
--
[email protected] 209DTAAXX_Lenski2_1_7:8:3:724:110 length=36
SRR097977.fastq:TGAAGCCTGCTTTTTTATACTAAGTTTGCATTATAA
--
[email protected] 209DTAAXX_Lenski2_1_7:8:3:258:281 length=36
SRR097977.fastq:GTGGCGCTGCTGCATAAGTTGGGTTATCAGGTCGTT
--
[email protected] 209DTAAXX_Lenski2_1_7:8:3:353:318 length=36
SRR097977.fastq:GGCAAAATGGTCCTCCAGCCAGGCCAGAAGCAAGTT
--
[email protected] 209DTAAXX_Lenski2_1_7:8:3:703:655 length=36
SRR097977.fastq:TTTATTTGTAAAGTTTTGTTGAAATAAGGGTTGTAA
--
[email protected] 209DTAAXX_Lenski2_1_7:8:3:592:919 length=36
SRR097977.fastq:TTCTTACCATCCTGAAGTTTTTTCATCTTCCCTGAT
--
[email protected] HWUSI-EAS1599_1:2:1:1:1505 length=35
SRR098026.fastq:GNNNNNNNNCAAAGTTGATCNNNNNNNNNTGTGCG
```

::::::::::::::::::::::::::::::::::::::::::::::::::

## Redirecting output

Expand All @@ -165,11 +164,12 @@ Let's try out this command and copy all the records (including all four lines of
in our FASTQ files that contain
'NNNNNNNNNN' to another file called `bad_reads.txt`.

```bash
$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt
```
!!! terminal "Code"

```bash
$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt
```

::::::::::::::::::::::::::::::::::::::::: callout

## File extensions

Expand All @@ -180,7 +180,7 @@ name it with a `.fastq` extension. However, using a `.fastq` extension will lead
when we move to using wildcards later in this episode. We'll point out where this becomes
important. For now, it's good that you're thinking about file extensions!

::::::::::::::::::::::::::::::::::::::::::::::::::


The prompt should sit there a little bit, and then it should look like nothing
happened. But type `ls`. You should see a new file called `bad_reads.txt`.
Expand All @@ -190,73 +190,61 @@ We can check the number of lines in our new file using a command called `wc`.
in a file. The FASTQ file may change over time, so given the potential for updates,
make sure your file matches your instructor's output.

As of Sept. 2020, wc gives the following output:
!!! terminal-2 "As of Sept. 2020, wc gives the following output:"

```bash
$ wc bad_reads.txt
```
```bash
$ wc bad_reads.txt
```

```output
802 1338 24012 bad_reads.txt
```

This will tell us the number of lines, words and characters in the file. If we
want only the number of lines, we can use the `-l` flag for `lines`.

```bash
$ wc -l bad_reads.txt
```

```output
802 bad_reads.txt
```
```output
802 1338 24012 bad_reads.txt
```

::::::::::::::::::::::::::::::::::::::: challenge
This will tell us the number of lines, words and characters in the file. If we
want only the number of lines, we can use the `-l` flag for `lines`.

```bash
$ wc -l bad_reads.txt
```

```output
802 bad_reads.txt
```

## Exercise
!!! dumbbell "Exercise"

How many sequences are there in `SRR098026.fastq`? Remember that every sequence is formed by four lines.
How many sequences are there in `SRR098026.fastq`? Remember that every sequence is formed by four lines.

::::::::::::::: solution

## Solution

```bash
$ wc -l SRR098026.fastq
```
??? success "Solution"

```bash
$ wc -l SRR098026.fastq
```

```output
996
```
```output
996
```

Now you can divide this number by four to get the number of sequences in your fastq file.

:::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::::: challenge
!!! dumbbell "Exercise"

## Exercise
How many sequences in `SRR098026.fastq` contain at least 3 consecutive Ns?

How many sequences in `SRR098026.fastq` contain at least 3 consecutive Ns?
??? success "Solution"

```bash
$ grep NNN SRR098026.fastq > bad_reads.txt
$ wc -l bad_reads.txt
```

::::::::::::::: solution
```output
249
```

## Solution

```bash
$ grep NNN SRR098026.fastq > bad_reads.txt
$ wc -l bad_reads.txt
```

```output
249
```

:::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::::::::::::::::

We might want to search multiple FASTQ files for sequences that match our search pattern.
However, we need to be careful, because each time we use the `>` command to redirect output
Expand Down

0 comments on commit 056c68e

Please sign in to comment.